How to Extract Data from Twitter Without CodingWednesday, June 03, 2020
In this tutorial, I’ll show you how to scrape Twitter data in 5 minutes without using Twitter API, Tweepy, Python, or writing a single line of code.
To extract data from Twitter, you can use an automated web scraping tool - Octoparse. As Octoparse simulates human interaction with a webpage, it allows you to pull all the information you see on any website, such as Twitter. For example, you can easily extract Tweets of a handler, tweets containing certain hashtags, or posted within a specific time frame, etc. All you need to do is to grab the URL of your target webpage and paste it into Octoparse built-in browser. Within a few point-and-clicks, you will be able to create a crawler from scratch by yourself. When the extraction is completed, you can export the data into Excel sheets, CSV, HTML, SQL, or you can stream it into your database in real-time via Octoparse APIs.
Read case study: Scrape Twitter discussions for sentiment analysis
Table of contents
Before we get started, you can click here to install Octoparse on your computer. Now, let’s take a look at how to build a Twitter crawler within 3 minutes.
Step 1: Input the URL and build a pagination
Read: What's pagination?
Let’s say we are trying to scrape all the tweets of a certain handler. In this case, we are scraping the official Twitter account of Octoparse. As you can see, the website is loaded in the built-in browser. Usually, many websites have a “next page” button that allows Octoparse to click on and go to each page to grab more information. In this case, however, Twitter applies “Infinite scrolling” technique, which means that you need to first scroll down the page to let Twitter load a few more tweets, and then extract the data shown on the screen. So the final extraction process will work like this: Octoparse will scroll down the page a little bit, extract the tweets, scroll down a bit, extract, so on and so forth.
To let the bot scroll down the page repetitively, we can build a pagination loop by clicking on the blank area and click “loop click single element” on the Tips panel. As you can see here, a pagination loop is shown in the workflow area, this means that we’ve built a pagination successfully.
Step 2: Build a loop item to extract the data
Read: What's loop item?
Now, let’s extract the tweets. Let’s say we want to get the handler, publish time, text content, number of comments, retweets and likes.
First, let’s build an extraction loop to get the tweets one by one. We can hover the cursor on the corner of the first tweet and click on it. When the whole tweet is highlighted in green, it means that it is selected. Repeat this action on the second tweet. As you can see, Octoparse is an intelligent bot and it has automatically selected all the following tweets for you. Click on “extract text of the selected elements” and you will find an extraction loop is built in the workflow.
But we want to extract different data fields into separate columns instead of just one, so we need to modify the extraction settings to select our target data manually. It is very easy to do this. Make sure you go into the "action setting" of the “extract data” step. Click on the handler, and click “extract the text of the selected element”. Repeat this action to get all the data fields you want. Once you are finished, delete the first giant column which we don’t need and save the crawler. Now, our final step awaits.
Step 3: Modify the pagination setting and execute the crawler
We’ve built a pagination loop earlier, but we still need a little modification on the workflow setting. As we want Twitter to load the content fully before the bot extracts it, let’s set up the AJAX time out as 5 seconds, to give Twitter 5 seconds to load after each scroll. Then, let’s set up both the scroll repeats and the wait time as 2 to make sure that Twitter loads the content successfully. Now, for each scroll, Octoparse will scroll down for 2 screens, and each screen will take 2 seconds.
Head back to the loop item setting to edit the loop time to 20. This means that the bot will repeat the scrolling for 20 times. You can now run the crawler on your local device to get the data, or run it on Octoparse Cloud servers to schedule your runs and save your local resource. Notice, the blanks cells in the columns mean that there is no original data on the page, so nothing is extracted.
If you have any questions on scraping Twitter or any other websites, email us at firstname.lastname@example.org. We are so ready to help!