undefined
Blog > Data Collection > Post

Web scraping using python vs web scraping tool

Monday, September 23, 2019

Web scraping has become a widely used technique for gathering and extracting data from websites. People begin to develop or use a variety of different software to achieve their goal. Generally, they are divided into 2 factions: coding and tools. In this passage, we will present a demo of scraping Tweets using these two methods.

 

Scraping Twitter with Python

To scrape Twitter with Python, we will first need to apply for a Twitter API through this link. After applying for the API, we could get 4 lines of code, which are API key, API secret key, Access token, and Access token secret.

 

The results we got after applying for Twitter API

 

Now that we have the API, we could start to build our Twitter crawler. We will use two libraries to build the crawler, json and tweepy.

 

JSON is a build-in package which could be applied to manipulate JSON data. Tweepy is an open-source package to access Twitter API. It contains many useful functions and classes to handle various implementation details.

 

Using libraries JASON and tweepy to build the crawler.

 

The stream could help us run and extract the tweets. OAuthHandler could help us submit our keys and secrets to Twitter. StreamListener could help us modify the fields we need from each tweet. 

Then fill in the keys and secrets you applied through the previous link.

 

Fill in the keys and secrets we applied through the previous link.

 

 

Here we will create a class inheriting from StreamListener to modify what kind of fields we need to scrape from Twitter. We could scrape information such as tweets, location, username, user id, followers count, friends count, favourites count and time zone. For some of the information such as tweets, username, time zone, there would be words in other languages. Therefore, we should consider using another character encoding such as UTF-8 instead of the default Unicode character encoding.

 

Modify what kind of fields we need to scrape from Twitter.

 

Next, we could submit our key and secret using OAuthHandler we called from Tweepy.

 

Submit our key and secret using OAuthHandler

 

Now we only need a few more steps to run and extract the information. Here we will search all tweets related to the keyword “Big data” and start our extraction with Stream.

 

Search all tweets related to the keyword “Big data” and start the extraction with Stream

 

Each line of data is information from one tweet. Different fields are separated by two semi-colons “;”. The first field is the username, the second field is location and the last field is the tweet. We could write the data into a spreadsheet and set the delimiter as “;;” to separate the fields. Or we could apply other libraries like pandas, numpy and re to further clean the data.

 

Scraping Twitter with Octoparse

Unlike scraping with Python, we don’t need to start with applying for API, simply inputting the URL into Octoparse will do.

 

  • Click "+ Task" to start a task using Advanced Mode
  • Paste the URL into the "Website" box and click "Save URL" to move on

 

Open the target website in Octoparse.

 

  • To fully load the listings here, we need to scroll the page down to the bottom continuously. So we’ll set "Scroll down" for the Go To Web Page button:
  • Check the box for "Scroll down to bottom of the page when finished loading"
  • Set "Scroll times" as "20" and "Interval" as "3" second (This is for demonstration, and you can set the numbers based on your needs)
  • Select "Scroll down to the bottom of the page" as the "Scroll way" and click the "OK" button

 

"Interval" is the time interval between every two scrolls. Theoretically, the higher the number we input for "Scroll times", the more data we can extract.

 

Set up scroll down for the Go To Web Page button.

  •  Click on the first item of the listing page

 

Click on the first item of the listing page.

 

 

  • Click on the second item

The "Action Tips" now reads "21 elements selected", so we can go on to select "Extract text of the selected elements”.

 Click on the second item to help Octoparse recognize the other similar items.

 

Then a “Loop Item” will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, we can also delete it and add the data fields we need.

 

Delete the unwanted data fields.

 Now, you can start to order your dishes.

Select the data fields you want to scrape.

 

 

 

After that, we can rename the data field and start the extraction.

 Rename the data field and start the extraction.

 

This is the sample output of the extracted data:

 

Sample output of the extracted data.

 

Based on the above demo, we can now conclude the pros and cons of coding with Python and Octoparse.

1. Learning cost: Octoparse has a lower learning cost compared with Python. To build a crawler with Python, not only should you be familiar with different libraries and coding techniques, but also understand the web structure well and recognize the anti-scraping techniques. However, in Octoparse, the developers have already considered all kinds of situations for you and all you need is to make several clicks and all the data are ready.

 

2.Rapid establishment: From the above demo, it seems that both methods are simple and easy-to-use. However, the most time-consuming process is not building the crawler. It is actually when doing an initial analysis on websites before starting to build the crawler. Different Websites apply different development methods and anti-scraping techniques. We will need more time to analyze websites if we choose to code and get the data. However, the developers of Octoparse have already considered most situations and you could simply get access to your data without analyzing websites.

 

3.Flexibility: Python has better flexibility than Octoparse. We could manipulate our crawlers’ behaviour by simply changing some of the codes. And we could even import some powerful libraries or API to access the data with just several codes. Even some of the most difficult anti-scraping techniques, such as Captcha or reCaptcha, could be solved now by some deep-learning methods using Python.

 

Honestly, Python and Octoparse each have their strong points. Octoparse is more suitable for people with no coding skills while Python could provide great flexibility for experts. If you think it would be hard to learn such Python coding, please contact us! Octoparse could provide data service and help you extract data with only several clicks.

Feel free to contact us when you need a powerful web-scraping tool for your business or project!

 

Author: Momo

 

日本語記事:Python vs Octoparse!初心者向きのスクレイピング方法はどっち?
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Web Scraping Utilizando Python vs Herramienta Automatizado
También puede leer artículos de web scraping en el Website Oficial


contactus

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download