undefined

Web Crawling Basic Concepts | What You Need to Know About WorkFlow Sequence

Wednesday, May 10, 2017 9:46 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

After going through the intro lessons, you should have grabbed the basics of Octoparse and managed to create a few tasks successfully. In this article, I will go into a little more depth as I walk through how Octoparse works to extract data from any web page and more importantly, how all the various actions work together in a workflow. Having a good understanding of these basic principles really builds the backbone for creating more successful and complex scraping tasks. 

1. How Octoparse works to extract web data

 

1.1 Octoparse simulates human browsing behaviors

Octoparse works by simulating human browsing behaviors on its built-in browser. Actions like opening web pages, clicking page elements, clicking the next page button, or scrolling down the page can all be done in Octoparse. The simulated scraping process is identical to how you'd access the web data in any everyday browser.  

 

1.2 Octoparse scrapes data automatically through workflow

When you are building a scraping task in Octoparse, you are essentially creating a scraping workflow that can be translated into a series of instructions for Octoparse to follow through. This workflow, however, is created automatically by Octoparse while you are interacting with the built-in browser. In some cases, you may not need to modify the auto-created workflow; yet, in other cases, you may need to build/troubleshoot the workflow manually if things are not working as expected. In either case, it is strongly recommended that you grasp the basics of the workflow so you can scrape more precisely and accurately.  

 

2. Understanding workflow

A workflow consists of a list of actions that are put together in a specific order to scrap the target web data.

The steps of the workflow should always be read from top to bottom, and from inside to outside for nested actions. Let's take a look at some examples. 

 

Example 1 - Extract from a list of elements to get data

workflow1 

Step 1: Go to Web Page, to open the target web page

Step 2: Pagination, to locate the next page button on the page (you are currently on Page 1)

Step 3: Loop Item, to locate the list of elements on the page

Step 4: Extract Data, to extract the needed data from the list of the elements

Step 5: Click to Paginate, to click on the next page button to go to Page 2

Step 6: Continue to extract data from the loop, and click the next page button until Octoparse gets to the last page

Step 7: No next page button located on the last page and the workflow ends

 

Example 2 - Click a list of elements on the web page and extract data from the detail page

workflow2

Step 1: Go to Web Page, to open the target web page

Step 2: Pagination, to locate the next page button on the page(you are currently on Page 1)

Step 3: Loop Item, to locate the list of elements on the page

Step 4: Click Item, to click the elements from the Loop Item and go to the detail page

Step 5: Extract Data, to extract the needed data from the detail page

Step 6: Click to Paginate, to click on the next page button to go to Page 2

Step 7: Continue to click elements from the loop, extract data from the detail page and click the next page button until Octoparse gets to the last page

Step 8: No next page button located on the last page and the workflow ends

 

Example 3 - Load more elements by clicking the Load More button and scrape data from the list of elements

workflow3

Step 1: Go to Web Page, to open the target web page

Step 2: Pagination, to locate the Load More button on the page

Step 3: Click to paginate, to click on the Load More button to load more elements on the page

Step 4: Continue to click on the Load More button until it disappears

Step 5: Loop Item, to locate the list of elements on the page

Step 6: Extract Data, to extract the target data from the list of the elements

 

3. Testing the workflow

It is important to test-run the workflow step by step before running the task. When you click a step in the workflow, Octoparse would perform the action in the built-in browser to help test if the action works as expected and you can modify it accordingly. For example, when the Go to Web Page is clicked, Octoparse will load the web page in the built-in browser automatically. 

You can check more details about testing the workflow here.

 

Tip!

There are no fixed ways to build a workflow. You can add any actions as long as they work logically together.  

You can use multiple click actions or loop items to scrape data from pages of multiple levels, for example, list page and product page for directory websites.  You can easily drag and move an action to the right spot.

 

For any questions, you are welcome to submit a request here. Our support team will get back to you within 24 hours.

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline