All Collections
Case Tutorial
Social Media
Scrape tweets from a public Twitter account
Scrape tweets from a public Twitter account
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

With a reported 211 million daily active users, Twitter has proven its worth in social media marketing. Users on Twitter post an average of 6000 tweets every second, making it over 500 million tweets posted daily. All of this chatter and noise is a treasure chest full of valuable information for marketers, brands, researchers, and analysts. Marketers and brands often scrape Twitter data from specific accounts (influencers and competitors) to analyze engagement and plan effective strategies.

Due to popular demand, this tutorial is the second in a series of tutorials that the Octoparse team has prepared for users with a need for Twitter data.

In this post, we are going to teach you how to scrape tweets from a public account.

If you don't want to bother creating a custom crawler on your own, you can search for a ready-to-use Twitter Task Template from the main screen to save some time.

If you want to know how to build the task from scratch, you may continue reading the following tutorial or watching the video below.

You can use the following sample link to follow through:

The main steps are shown in the menu on the right, and you can download the sample task file here.


1. Create a Go to Web Page - to open the target Twitter link

Every workflow in Octoparse starts by telling Octoparse a web page to start with.

  • Enter the sample URL into the search bar at the top of the home screen and click Start.


2. Log into Twitter in Browse mode - to save cookies for authentication

Twitter forbids direct access to followers/following lists unless you've logged in first.

  • Toggle on Browse mode and log into Twitter as you do in a normal browser

mceclip4.png
  • Click the Go to Web Page action to open its settings panel (located at the bottom right)

  • Go to the Options tab and tick Use cookies

  • Click Use cookie from the current page

  • Click Apply to save the settings

mceclip5.png
  • Turn off Browse Mode

mceclip3.png

We have now successfully saved the login information in the task workflow so that our Twitter account can be logged in when we run the task.


3. Create a Loop Item - to loop through each tweet

Next, we need to create a loop for all the tweets.

  • Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)

  • Continue to select the second tweet

  • Choose Text from the Tips Panel


4. Create another Loop Item - to scroll down the web page

The infinite scroll pattern of Twitter is designed to load content dynamically, requiring a few necessary tweaks in the task workflow to minimize data loss.

  • Add a new Loop Item to the workflow

  • Drag the original loop inside the new loop (Loop Item inside Loop Item1)

__t.gif
  • Click the Loop Item1 and set its Loop Mode to Scroll Page in the General tab

77.png
  • Set the scroll pattern to for one screen, wait time 1s, and repeat 100 times (or more)

  • Tick Capture data as page scrolls dynamically (possibly duplicates) (Important!)

  • Click Apply to confirm

1.png

5. Rewrite some of the XPath - to locate the web elements more accurately

The auto-generated XPath may not be accurate enough. So we need to rewrite the XPath for some data fields.

  • Click Loop Item (Not the Loop Item 1!) and input the XPath //article[@role="article"]/../../..

8.png

6. Add more data fields - to scrape the desired data

  • Click Extract Data

  • Select any text you want to scrape

  • Choose Text from the Tips Panel

  • Repeat the action and get the name, time, text, reply, retweet, likes

  • Double-click each field header to rename them

11.png

You may notice that Tweet post time is shown as "3m". We need to clean the data field to show the exact post date/time.

  • Click More button on the field

  • Choose Customize field

  • Select to extract the attribute of DateTime

__t1.gif

7. Run the task - to get your desired data

  • Click Save on the upper right to save your task

  • Click Run next to it and wait for a Run Task window to pop up

  • Select Run on your device to run the task on your local device

  • Wait for the task to complete


Here is the sample output from a local run.

81.png

Tip: It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.

Local runs are great for task troubleshooting and quick runs. If you are dealing with more complicated tasks, it is recommended that you select Run in the Cloud to run the task in Octoparse's cloud-based platform for higher speed. Try out this premium feature by signing up for the 14-day free trial here. You can also schedule your task to run hourly, daily, or weekly and get data delivered to you regularly.

Did this answer your question?