Web Scraping Case Study | Scrolling down to scrape Tweets
Saturday, October 08, 2016 1:42 AMBrief Intro
More and more websites prefer updating their web content dynamically. A dynamic website is the web page using HTML scripting running in the browser as it loads. Javascript, AJAX, DHTML and other scripting languages are used to determine how HTML receives page and is parsed into the Document Object Model (DOM) that represents the loaded web page. They add new items when users scroll down. Twitter, Facebook and Amazon are all such typical dynamic websites. They only display a certain number of contents initially, loading another ones when scrolling down. In this tutorial, I will take Twitter for example to show you how to scrape dynamic website automatically.
List features covered
- Scrolling down
- Setting AJAX
- Building a loop list
(Download my extraction task of this tutorial HERE just in case you need it.)
Now, let's get started!
Step 1. Set up basic information and navigate to the target website
- Click "Start" (Advanced Mode)"
- Complete the “Basic Information”
- Click “Next”
- Enter the target URL in the built-in browser ( URL of the example: https://twitter.com/search?q=news&src=typd&lang=en )
- Click "Go" icon to open webpage
Step 2. Scroll down to load the web content
We are now on the search result page. Waiting until the page loaded.
- Select the “Advanced Options”
- Choose “Scroll down to page bottom when finished loading”
- Then enter how many times you want to scroll
- Here I choose “Scroll down for one screen”. You can also choose “Scroll down to bottom of the page”
Notice that, we suggest you'd better not set a relatively high number of "Scroll times", like 10,000 or more.
- Click “Save”.
Step 3. Create a list of items
Move your cursor over the article with similar layout, where you would extract the content of the article.
- Click any where on the first section on the web page
- When prompted, Click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
Now we get all the sections added to the list with similar layout
- Click “Finish Creating List”
- Click “loop”, this action will tell Octoparse to click on each section on the list to extract the selected data
Step 7. Select the data to be extracted
- Click the data field “Defense News”.
- Select “Extract text”
- Follow the same steps to extract the other data.
- Click "Save"
Step 8. Rename data field
All the content will be selected in Data Fields.
- Click the “Field Name” to modify.
Step 9. Starting running your task
- After saving your extraction configuration,click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
Octoparse will automatically extract all the data selected.
Step 10. Check the data and export
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Author's Picks
Octoparse Smart Mode -- Get Data in Seconds
Top 30 Free Web Scraping Software