Automatically Scrape Dynamic Websites(Example:Twitter)

Saturday, October 08, 2016 1:42 AM

Dynamic updates of contents are implemented by more and more websites. A dynamic website is the web page using HTML scripting running in the browser as it loads. Javascript, AJAX, DHTML and other scripting languages are used to determine how HTML receives page and is parsed into the Document Object Model (DOM) that represents the loaded web page. They add new items when users scroll down. Twitter, Facebook and Amazon are all such typical dynamic websites. They only display a certain number of contents initially, loading another ones when scrolling down.

 

In the previous tutorial How to extract data in the list on eBay has shown how to scrape data from dynamic websites with AJAX. Here we use the same method to scrape data from Twitter. A list of additional alternatives is available in scraping the website with AJAX.

(Download my extraction task of this tutorial HERE just in case you need it.)

 

Step 1. You need to configure a rule first. Choose “Advanced Mode” ➜Complete basic information.

Enter the target URL of Twitter in the built-in browser. ➜ Click “Go” icon to open the webpage. ( URL of the example: https://twitter.com/search?q=news&src=typd&lang=en )

 

Step 2. We are now on the search result page. Waiting until the page loaded, select the “Advanced Options”. ➜ Choose “Scroll down to page bottom when finished loading”. ➜ Then enter how many times you want to scroll.

Here I choose “Scroll down for one screen”. You can also choose “Scroll down to bottom of the page”. ➜ Click “Save”.

 

Step 3. Now we finished configuring AJAX. You can extract the list information you want. Move your cursor over the section with similar layout, where you would extract data.

Click the first highlighted link ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”. Then the first highlighted link has been added to the list. ➜ Click “Continue to edit the list”.

Click the second highlighted link ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements in each page.

 

Step 4. Extract the title of the first section. ➜ Click the title. ➜ Select “Extract text”. Other contents can be extracted in the same way.

 

Step 5. All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify.

 

Step 6. Click “Next” ➜ Click “Next” ➜ Click “Local Extraction”. Octoparse will automatically extract all the data selected.

 

Step 7. The data extracted will be shown in "Data Extracted" pane. Click “View Data” button to view data. You then could export the results to Excel file. Click “Export” button and then save the file as Excel to your computer.

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

 

Octoparse Smart Mode -- Get Data in Seconds

Getting started with XPath 1

Getting started with XPath 2

Getting started with XPath 1

Collect Data from LinkedIn

Top 30 Free Web Scraping Software

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
Request Pro Trial

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.
× get my coupon now No Thanks