You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
With a reported 211 million daily active users, Twitter has proven its worth in social media marketing. Users on Twitter post an average of 6000 tweets every second, making it over 500 million tweets posted daily. All of this chatter and noise is a treasure chest full of valuable information for marketers, brands, researchers, and analysts. Marketers and brands often scrape Twitter data from specific accounts (influencers and competitors) to analyze engagement and plan effective strategies.
Due to popular demand, this tutorial is the second in a series of tutorials that the Octoparse team has prepared for users with a need for Twitter data.
In this post, we are going to teach you how to scrape tweets from a public account.
If you don't want to bother creating a custom crawler on your own, you can search for a ready-to-use Twitter Task Template from the main screen to save some time.
If you want to know how to build the task from scratch, you may continue reading the following tutorial or watching the video below.
You can use the following sample link to follow through:
The main steps are shown in the menu on the right, and you can download the sample task file here.
1. Create a Go to Web Page - to open the target Twitter link
Every workflow in Octoparse starts by telling Octoparse a web page to start with.
Enter the sample URL into the search bar at the top of the home screen and click Start.
2. Log into Twitter in Browse mode - to save cookies for authentication
Twitter forbids direct access to followers/following lists unless you've logged in first.
Toggle on Browse mode and log into Twitter as you do in a normal browser
Click the Go to Web Page action to open its settings panel (located at the bottom right)
Go to the Options tab and tick Use cookies
Click Use cookie from the current page
Click Apply to save the settings
Turn off Browse Mode
We have now successfully saved the login information in the task workflow so that our Twitter account can be logged in when we run the task.
3. Create a Loop Item - to loop through each tweet
Next, we need to create a loop for all the tweets.
Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)
Continue to select the second tweet
Choose Text from the Tips Panel
4. Create another Loop Item - to scroll down the web page
The infinite scroll pattern of Twitter is designed to load content dynamically, requiring a few necessary tweaks in the task workflow to minimize data loss.
Add a new Loop Item to the workflow
Drag the original loop inside the new loop (Loop Item inside Loop Item1)
Click the Loop Item1 and set its Loop Mode to Scroll Page in the General tab
Set the scroll pattern to for one screen, wait time 1s, and repeat 100 times (or more)
Tick Capture data as page scrolls dynamically (possibly duplicates) (Important!)
Click Apply to confirm
5. Rewrite some of the XPath - to locate the web elements more accurately
The auto-generated XPath may not be accurate enough. So we need to rewrite the XPath for some data fields.
Click Loop Item (Not the Loop Item 1!) and input the XPath //article[@role="article"]/../../..
6. Add more data fields - to scrape the desired data
Click Extract Data
Select any text you want to scrape
Choose Text from the Tips Panel
Repeat the action and get the name, time, text, reply, retweet, likes
Double-click each field header to rename them
You may notice that Tweet post time is shown as "3m". We need to clean the data field to show the exact post date/time.
Click More button on the field
Choose Customize field
Select to extract the attribute of DateTime
7. Run the task - to get your desired data
Click Save on the upper right to save your task
Click Run next to it and wait for a Run Task window to pop up
Select Run on your device to run the task on your local device
Wait for the task to complete
Here is the sample output from a local run.
Tip: It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.
Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.