Web scraping using python vs web scraping toolMonday, September 23, 2019
Web scraping has become a widely used technique for gathering and extracting data from websites.
People begin to develop or use a variety of different software to achieve their goal. Generally, they are divided into 2 factions: coding and tools. In this passage, we will present a demo of scraping Tweets using these two methods.
Scraping Twitter with Python
To scrape Twitter with Python, we will first need to apply for a Twitter API through this link. After applying for the API, we could get 4 lines of code, which are API key, API secret key, Access token, and Access token secret.
JSON is a build-in package which could be applied to manipulate JSON data. Tweepy is an open-source package to access Twitter API. It contains many useful functions and classes to handle various implementation details.
The stream could help us run and extract the tweets. OAuthHandler could help us submit our keys and secrets to Twitter. StreamListener could help us modify the fields we need from each tweet.
Then fill in the keys and secrets you applied through the previous link.
Here we will create a class inheriting from StreamListener to modify what kind of fields we need to scrape from Twitter. We could scrape information such as tweets, location, username, user id, followers count, friends count, favourites count and time zone. For some of the information such as tweets, username, time zone, there would be words in other languages. Therefore, we should consider using another character encoding such as UTF-8 instead of the default Unicode character encoding.
Next, we could submit our key and secret using OAuthHandler we called from Tweepy.
#sumbit the key and secret
Now we only need a few more steps to run and extract the information. Here we will search all tweets related to the keyword “Big data” and start our extraction with Stream.
Each line of data is information from one tweet. Different fields are separated by two semi-colons “;”. The first field is the username, the second field is location and the last field is the tweet. We could write the data into a spreadsheet and set the delimiter as “;;” to separate the fields. Or we could apply other libraries like pandas, numpy and re to further clean the data.
Scraping Twitter with Octoparse
Please note that new version 8.4 is launched, contact customer service team @octoparse.com if you have any question while scraping data!
Unlike scraping with Python, we don’t need to start with applying for API, simply inputting the URL into Octoparse will do.
- Click "+ Task" to start a task using Advanced Mode
- Paste the URL into the "Website" box and click "Save" to move on
- To fully load the listings here, we need to scroll the page down to the bottom continuously. So we’ll set "Scroll down" for the Go To Web Page button:
- Check the box for "Scroll down to bottom of the page when finished loading"
- Set "Scroll times" as "20" and "Interval" as "3" second (This is for demonstration, and you can set the numbers based on your needs)
- Select "Scroll down to the bottom of the page" as the "Scroll way" and click the "OK" button
"Interval" is the time interval between every two scrolls. Theoretically, the higher the number we input for "Scroll times", the more data we can extract.
- Click on the first item of the listing page
- Click on the second item
Go to the "Action Tips", we can go on to select "Extract text of the selected elements”.
Then a “Loop Item” will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, we can also delete it and add the data fields we need.
Now, you can start to order your dishes.
After that, we can rename the data field and start the extraction.
Based on the above demo, we can now conclude the pros and cons of coding with Python and Octoparse.
1. Learning cost: Octoparse has a lower learning cost compared with Python. To build a crawler with Python, not only should you be familiar with different libraries and coding techniques, but also understand the web structure well and recognize the anti-scraping techniques. However, in Octoparse, the developers have already considered all kinds of situations for you and all you need is to make several clicks and all the data are ready.
2.Rapid establishment: From the above demo, it seems that both methods are simple and easy-to-use. However, the most time-consuming process is not building the crawler. It is actually when doing an initial analysis on websites before starting to build the crawler. Different Websites apply different development methods and anti-scraping techniques. We will need more time to analyze websites if we choose to code and get the data. However, the developers of Octoparse have already considered most situations and you could simply get access to your data without analyzing websites.
3.Flexibility: Python has better flexibility than Octoparse. We could manipulate our crawlers’ behaviour by simply changing some of the codes. And we could even import some powerful libraries or API to access the data with just several codes. Even some of the most difficult anti-scraping techniques, such as Captcha or reCaptcha, could be solved now by some deep-learning methods using Python.
Honestly, Python and Octoparse each have their strong points. Octoparse is more suitable for people with no coding skills while Python could provide great flexibility for experts. If you think it would be hard to learn such Python coding, please contact us! Octoparse could provide data service and help you extract data with only several clicks.
Feel free to contact us when you need a powerful web-scraping tool for your business or project!
日本語記事：Python vs Octoparse！初心者向きのスクレイピング方法はどっち？
Artículo en español: Web Scraping Utilizando Python vs Herramienta Automatizado
También puede leer artículos de web scraping en el Website Oficial