Web Crawling Basic Concepts | What You Need to Know About WorkFlow Sequence

Wednesday, May 10, 2017 9:46 AM

Basic Intro:

You may encounter such scenario: When you change the executing sequence of some actions, like "Loop Items" and "Click to Paginate", then you will get a totally different set of extracted data results. 

 

Explanation:

To help you better understand how to crawl data in needs, I will take the execution sequence of "Pagination" and "Loop Items" for example to explain the execution order of a crawling task, and what will be the data results like if their relative execution sequence has been changed.

 

Now, let's come down to learn how it works!

(URL of the example: https://www.yelp.com/search?find_desc=&find_loc=Washington,+DC&ns=1)

 

Case 1: Build a loop list before pagination

In this case, we will see what if we build a loop list before pagination.

 

[WorkFlow]

First, we build a loop list as below. 

( If you want to learn more about detailed steps to create a loop list, check out the tutorial: Web Scraping Case Study | Scraping Data from Yelp)

 

Then, we continue to set up pagination.

Notice that the pagination action "Click to Paginate" is executed after the "Loop Items" which has extracted data from a list of loop items contained in the current web page.

That means Octoparse will first extract data within the first page, then click to paginate to the next page. (The same process with Page #2, Page #3 ...)

( To learn more about detailed steps to paginate, check out the tutorial: Scraping from multi-pages: pagination with "Next" button )

 

[Data Results]

The data extracted is as below.  

 

The web content is displayed through the web browser as below.

Compared with the extracted data, we can see the crawling process has started with the first web page and paginated to crawl the following web pages.

 

 

 

Case 2: Paginate before creating a loop list

Compared with Case 1, we will see what will happen if we change the execution order by setting up pagination before a "Loop Item" action is executed.

 

[WorkFlow]

The same with Case 1, we create a loop list first. Then, we set up the pagination.

However, we need to drag the "Click to Paginate" action right before the "Loop Item" action, as the figure shown below.

That means, after navigated to the Page #1, Octoparse will do nothing but click to paginate directly. Then, Octoparse will start execute the 'Loop Item" by extracting data from Page #2. (The same process with Page #2, Page #3 ...).

 

 

 

[Data Results]

The data extracted is as below.  

 

The web content is displayed through the web browser as below.

Compared with the web page content, it makes sense that  that the data was extracted from the second page!

 

Now you should figure out why you can't put pagination before a loop list. To learn more powerful crawling features, you can check out the tutorials below:

How to get current page title when scraping in Octoparse

How to Extract Data from Webpages Loaded with AJAX (Example: gumtree.com)

Web Scraping Feature Study | Scraping from multi-pages: pagination with "Next" button

Octoparse Smart Mode -- Get Data in Seconds

 

 

 

 

 

 

btn_sidebar_use.png
btn_sidebar_form.png