logo
languageENdown
menu

Web scraping using python vs web scraping tool

5 min read

Web scraping has become a widely used technique for gathering and extracting data from websites.

People begin to develop or use a variety of different software to achieve their goals. Generally, they are divided into 2 factions: coding and tools. In this passage, we will present a demo of scraping Tweets using these two methods.

Scraping Twitter with Python

To scrape Twitter with Python, we will first need to apply for a Twitter API through this link. After applying for the API, we could get 4 lines of code, which are API key, API secret key, Access token, and Access token secret.

Now that we have the API, we could start to build our Twitter crawler. We will use two libraries to build the crawler, json and tweepy.

JSON is a built-in package that could be applied to manipulate JSON data. Tweepy is an open-source package to access Twitter API. It contains many useful functions and classes to handle various implementation details.

The stream could help us run and extract the tweets. OAuthHandler could help us submit our keys and secrets to Twitter. StreamListener could help us modify the fields we need from each tweet. 

Then fill in the keys and secrets you applied through the previous link.

Here we will create a class inheriting from StreamListener to modify what kind of fields we need to scrape from Twitter. We could scrape information such as tweets, location, username, user id, followers count, friends count, favorites count, and time zone. For some of the information such as tweets, usernames, and time zone, there would be words in other languages. Therefore, we should consider using another character encoding such as UTF-8 instead of the default Unicode character encoding.

create class inheriting from streamlistener

Next, we could submit our key and secret using OAuthHandler we called from Tweepy.

#sumbit the key and secret

auth=OAuthHandler(ckey,csecret)

auth.set_access_token(atoken,asecret)

Now we only need a few more steps to run and extract the information. Here we will search all tweets related to the keyword “Big data” and start our extraction with Stream. 

Each line of data is information from one tweet. Different fields are separated by two semi-colons “;”. The first field is the username, the second field is the location and the last field is the tweet. We could write the data into a spreadsheet and set the delimiter as “;;” to separate the fields. Or we could apply other libraries like pandas, numpy, and re to further clean the data.

Scraping Twitter with Octoparse

Please note that the new version 8.4 is launched, contact the customer service team @octoparse.com if you have any questions while scraping data!

Unlike scraping with Python, we don’t need to start with applying for API, simply inputting the URL into Octoparse will do.

  • Click “+ Task” to start a task using Advanced Mode
  • Paste the URL into the “Website” box and click “Save” to move on
  • To fully load the listings here, we need to scroll the page down to the bottom continuously. So we’ll set “Scroll down” for the Go To Web Page button:
  • Check the box for “Scroll down to the bottom of the page when finished loading”
  • Set “Scroll times” as “20” and “Interval” as “3” seconds (This is for demonstration, and you can set the numbers based on your needs)
  • Select “Scroll down to the bottom of the page” as the “Scroll way” and click the “OK” button

“Interval” is the time interval between every two scrolls. Theoretically, the higher the number we input for “Scroll times”, the more data we can extract.

  • Click on the first item of the listing page
  • Click on the second item

Go to the “Action Tips”, we can go on to select “Extract the text of the selected elements”.

Then a “Loop Item” will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, we can also delete it and add the data fields we need.

Now, you can start to order your dishes.

After that, we can rename the data field and start the extraction.

Based on the above demo, we can now conclude the pros and cons of coding with Python and Octoparse.

1. Learning cost: Octoparse has a lower learning cost compared with Python. To build a crawler with Python, not only should you be familiar with different libraries and coding techniques, but also understand the web structure well and recognize the anti-scraping techniques. However, in Octoparse, the developers have already considered all kinds of situations for you and all you need is to make several clicks, and all the data are ready.

2. Rapid establishment: From the above demo, it seems that both methods are simple and easy to use. However, the most time-consuming process is not building the crawler. It is actually when doing an initial analysis on websites before starting to build the crawler. Different Websites apply different development methods and anti-scraping techniques. We will need more time to analyze websites if we choose to code and get the data. However, the developers of Octoparse have already considered most situations and you could simply get access to your data without analyzing websites.

3. Flexibility: Python has better flexibility than Octoparse. We could manipulate our crawlers’ behavior by simply changing some of the codes. And we could even import some powerful libraries or APIs to access the data with just several codes. Even some of the most difficult anti-scraping techniques, such as Captcha or reCaptcha, could be solved now by some deep-learning methods using Python.

Honestly, Python and Octoparse each have their strong points. Octoparse is more suitable for people with no coding skills while Python could provide great flexibility for experts. If you think it would be hard to learn such Python coding, please contact us! Octoparse could provide data service and help you extract data with only several clicks.

Feel free to contact us when you need a powerful web-scraping tool for your business or project!

Hot posts

Explore topics

image
Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today

Download

Related Articles

  • avatarAbigail Jones
    Price monitoring and product status tracking are an essential part of any e-business. If your budget for marketing was limited, then you should learn how to use the FREE price scraping tool like Octoaprse. It's an alternative for a comprehensive price tracker.
    January 25, 2021 · 5 min read
  • avatarAbigail Jones
    There are still a lot of other tutorials I would like to head back to and smooth over to really utilize Octoparse more fully. And also if you are a beginner like me, the step by step tutorial will blow you away. It not only adopts from the traditional sense of tutorial videos or page by page scroll via a table of content. Octoparse literally shows you where to click on button by button within the software. Now to me, that is what I call robust visual learning.
    March 14, 2017 · 2 min read
  • avatarAbigail Jones
    I have been an Octoparse user for a few months now and I am delighted with the application. I am a real estate agent and I navigate many real estate portals to get the phones of people who sell their houses. It takes me many hours before for I have to filter the data that correspond to data professionals who correspond to individuals. With Octoparse I can do this work in a few minutes.
    March 12, 2017 · 5 min read
  • avatarAbigail Jones
    Octoparse offers a free scraping GUI for everyone without coding experience. The free local extraction feature is enough for almost any business starter or data collector. I mean Octoparse is like the middle man from the web. I mean it’s so great that you can handle all types of pagination, from infinite scrolling down to AJAX, also including the normal button pagination.
    March 10, 2017 · 2 min read