How to Extract Data from Twitter Without CodingWednesday, June 3, 2020
In this tutorial, I’ll show you how to scrape data from Twitter in 5 minutes without using API, Tweepy, Python, or writing a single line of code.
To extract data from Twitter, you can use an automated web scraping tool - Octoparse. Octoparse is a web scraper that simulates human interaction with web pages. It allows you to extract all the information you see on any website including Twitter. With its intuitive point-and-click interface, you can easily build a customized crawler and extract Tweets of an account, tweets containing certain hashtags, or posted within a specific time frame, etc. You can then export the extracted data into Excel sheets, CSV, HTML, SQL, or stream it into your database in real-time via Octoparse APIs.
Read case study: Scrape Twitter discussions for sentiment analysis
Table of contents
Before we get started, you can click here to download Octoparse and install it on your computer. I recommend downloading version 8 because it's more beginner-friendly. Now, let’s take a look at how to build a Twitter crawler in Octoparse within 3 minutes.
Step 1: Enter the URL and set up pagination
Read: What's pagination?
Let’s say we are trying to scrape all the tweets of a certain handler. In this case, we are scraping the official Twitter account of Octoparse. As you can see, the website is loaded in the built-in browser. Usually, many websites have a “next page” button that allows Octoparse to click on and go to each page to grab more information. In this case, however, Twitter applies an “infinite scrolling” technique, which means that you need to scroll down the page to let Twitter load a few more tweets, and then extract the data shown on the screen. So the final extraction process will work like this: Octoparse will scroll down the page a little bit, extract the tweets, scroll down a bit, extract, so on and so forth.
To tell the crawler to scroll down the page repetitively, we can build a pagination loop by clicking on the blank area and clicking “loop click single element” on the Tips panel. As you can see here, a pagination loop is shown in the workflow area, this means that we’ve set up pagination successfully.
Step 2: Build a loop item to extract the data
Read: What's a loop item?
Now, let’s extract the tweets. Let’s say we want to get the handler, publish time, text content, number of comments, retweets and likes.
First, let’s build an extraction loop to get the tweets one by one. We can hover the cursor on the corner of the first tweet and click on it. When the whole tweet is highlighted in green, it means that it is selected. Repeat this action on the second tweet. As you can see, Octoparse is an intelligent bot and it has automatically selected all the following tweets for you. Click on “extract text of the selected elements” and you will find an extraction loop is built in the workflow.
But we want to extract different data fields into separate columns instead of just one, so we need to modify the extraction settings to select our target data manually. It is very easy to do this. Make sure you go into the "action setting" of the “extract data” step. Click on the handler, and click “extract the text of the selected element”. Repeat this action to get all the data fields you want. Once you are finished, delete the first giant column which we don’t need and save the crawler. Now, our final step awaits.
Step 3: Modify the pagination setting and run the crawler
We’ve built a pagination loop earlier, but we still need a little modification on the workflow setting. As we want Twitter to load the content fully before the bot extracts it, let’s set up the AJAX timeout to 5 seconds, to give Twitter 5 seconds to load after each scroll. Then, let’s set up both the scroll repeats and the wait time as 2 to make sure that Twitter loads the content successfully. Now, for each scroll, Octoparse will scroll down for 2 screens, and each screen will take 2 seconds.
Head back to the loop item setting to edit the loop time to 20. This means that the bot will repeat the scrolling 20 times. You can now run the crawler on your local device to get the data, or run it on Octoparse Cloud servers to schedule your runs and save your local resource. Notice, the blank cells in the columns mean that there is no original data on the page, so nothing is extracted.
Update: If you are on Octoparse 8.4, check the latest tutorial here.
If you have any questions on scraping Twitter or any other websites, email us at firstname.lastname@example.org. We are so ready to help!