Web Scraping Case Study | Scrolling down to scrape TweetsSaturday, October 8, 2016 1:42 AM
With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends, etc. from a variety of social media websites, such as Twitter. By scraping data from Twitter, you can (1) keep updated with the latest trends worldwide, (2) find out potential customers for your business, and (3) analyze the marketing value of hot topics.
You can go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Twitter Template directly to save your time. With this template, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates
If you would like to know how to build the task from scratch, you may continue reading the following tutorial or check the video below.
You may need this link to follow though:
Here are the main steps in this tutorial[Download the demo task click here]
1. Go to Web Page - Open the target web page
- Enter the URL on the home page and click Start
This target webpage is a news page from Twitter that doesn't require login. To extract data behind login, refer to the following tutorial:
2. Create a Loop Item and extract data - loop extract each tweet
- Select the first tweet on the web page (note to select the whole tweet block, the color will turn green if you select the whole tweet)
- Continue to select the second tweet
- Choose Extract text of the selected elements
3. Create a Loop Item - to scroll down the web page
- Add a new Loop Item in the workflow
- Drag the original loop inside the new loop (Loop Item inside Loop Item1）
- Click the Loop Item1
- Set its Loop Mode as Scroll Page
- Set scroll pattern to Scroll for one screen, wait time to 1s, and Repeats to 100
- Remember to tick "Capture data as page scrolls dynamically (possibly duplicates)"
- Click "Apply" to confirm
4. Modify the Loop Item XPath
- Click the "Loop Item" (Not the Loop Item 1!) and input the XPath as //article[@role="article"]/../../..
5. Extract Data - select text to scrape
- Click the "Extract Data" action and you will see a tweet being highlighted in red
- Select text within the red area (name, time, text, reply, retweet, like) and choose to "Extract the text of selected element"
Double click each field down below the page, you can easily rename them.
You may have noticed the Tweet post time is shown as "20m". This is hard for us to tell the exact post date and time. We can modify this field to get the detailed time.
- Click More button of the field
- Choose the Customize field
- Select to extract the attribute of datetime
6. Start data extraction - run your task and get data
- Click Save
- Click Run on the upper right side
- Select Run on your device to run the task on your computer, or select Run in the Cloud to run the task in the Cloud (for premium users only)
You can export the result data in provided formats such as EXCEL, CVS, JSON, or in your database.
Here is the sample output.
It is normal if you get duplicates since every time the page scrolls, it loads only one or two new tweets.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.