Web Scraping Case Study | Scrolling down to scrape Tweets

Saturday, October 08, 2016 1:42 AM

Brief Intro

More and more websites prefer updating their web content dynamically. A dynamic website is the web page using HTML scripting running in the browser as it loads. Javascript, AJAX, DHTML and other scripting languages are used to determine how HTML receives page and is parsed into the Document Object Model (DOM) that represents the loaded web page. They add new items when users scroll down. Twitter, Facebook and Amazon are all such typical dynamic websites. They only display a certain number of contents initially, loading another ones when scrolling down. In this tutorial, I will take Twitter for example to show you how to scrape dynamic website automatically.

 

List features covered 

  • Scrolling down
  • Setting AJAX
  • Building a loop list

(Download my extraction task of this tutorial HERE just in case you need it.)

 

Now, let's get started!

 

Step 1. Set up basic information and navigate to the target website

 

Step 2. Scroll down to load the web content

We are now on the search result page. Waiting until the page loaded.

  • Select the “Advanced Options”
  • Choose “Scroll down to page bottom when finished loading”
  • Then enter how many times you want to scroll
  • Here I choose “Scroll down for one screen”. You can also choose “Scroll down to bottom of the page”

Notice that, we suggest you'd better not set a relatively high number of "Scroll times", like 10,000 or more.

  • Click “Save”.

 

Step 3. Create a list of items

 Move your cursor over the article with similar layout, where you would extract the content of the article.

  • Click any where on the first section on the web page 
  • When prompted, Click “Create a list of items” (sections with similar layout)
  • Click “Add current item to the list”

 Now, the first item has been added to the list, we need to finish adding all items to the list

  • Click “Continue to edit the list”
  • Click a second section with similar layout
  • Click “Add current item to the list” again

 Now we get all the sections added to the list with similar layout

  • Click “Finish Creating List”
  • Click “loop”, this action will tell Octoparse to click on each section on the list to extract the selected data

 

Step 7. Select the data to be extracted 

  • Click the data field “Defense News”.
  • Select “Extract text”
  • Follow the same steps to extract the other data.
  • Click "Save"

 

Step 8. Rename data field

All the content will be selected in Data Fields.

  • Click the “Field Name” to modify.

 

Step 9. Starting running your task 

  • After saving your extraction configuration,click “Next”
  • Select “Local Extraction”
  • Click “OK” to run the task on your computer.

Octoparse will automatically extract all the data selected.

 

Step 10. Check the data and export

  • Check the data extracted
  • Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

 

Octoparse Smart Mode -- Get Data in Seconds

Getting started with XPath 1

Getting started with XPath 2

Getting started with XPath 1

Collect Data from LinkedIn

Top 30 Free Web Scraping Software

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
btn_sidebar_use.png
btn_sidebar_form.png