undefined
Blog > Octoparse > Post

AI Web Scraping: Scrape Ecommerce Website with Auto-detection

Monday, May 18, 2020

We made a series of web scraping tutorials for you to get on board quickly with our latest version Octoparse 8. By the end of the series, you will be able to build a crawler from scratch and pull data from any website you want. 

In this lesson, we will go through how to scrape eCommerce data using the auto-detect algorithm in Octoparse 8.

 

Table of contents

 

Most of the websites share similar layouts. For example, eBay is a webpage containing many items nested in a list. 

Octoparse's brand-new auto-detect algorithm is specially designed to scrape such kind of pages. It automatically detects for listing data (including text elements and links), "Next page" buttons, "load more" buttons and scrolls down the page, and then it generates the scraping task automatically. 

Before we get started, Octoparse Helloworld provides some test sites for you to play around. You can check out this eCommerce site and grab the URL we use in this tutorial. 

 

Step 1: Create a new task

Enter the example URL into the search box. Click "Start" to create a new task.

 

Step 2: Get data via auto-detect

Octoparse will load the webpage URL in the built-in browser and start the auto-detect process. Wait patiently until the process completes and when more info is provided on the "Tips" panel. 

 

Step 3: Check the data

When the auto-detection completes, follow the instruction provided on "Tips" and check your data in the preview section. You can rename the data fields or remove those that are not needed. The detected data will also be highlighted on the webpage for you. 

 

 

Step 4: Confirm your options

Now, go to "Tips" and check your options. Based on the type of data detected, a number of options are provided for you to choose from. For this example, list data is detected so you are provided with the options to:

Option 1: Scrape the data in the list 

This option is selected by default as Octoparse thinks this is what you need to do for sure. 

Option 2:  Click the "Next" button to capture multiple pages 

Apparently, Octoparse has detected a "Next" button on the page. Check this option if you want Octoparse to click the "Next" button to scrape data from more pages.

To find out if the button detected is the correct one, click "Check" and watch it gets highlighted on the webpage. If you need to re-select the "Next" button, click "Edit" and follow the instructions on "Tips". 

Option 3:  Click the "links" to capture data on the page that follows 

Now Octoparse is asking if you want to click on the links detected and scrape more information from the detail pages. Check this option if this is what you need.

To confirm if the links are the ones you'd like to click through, click "Check" to have the links highlighted on the web page. 

In this case, we only want to scrape the list information across all pages, so we'll go ahead and check the first and the second option. 

 

Step 5: Save task settings

Octoparse would generate a workflow automatically based on the data detected and the saved settings. You can choose to run the task now or edit the workflow manually.

When everything looks good, you can hit save and run to get your data.  

 

Don’t forget to practice with the HelloWorld test site. If you encounter any difficulties, feel free to submit a ticket or email us at support@octoparse.com. To know how to optimize your task, you can check out lesson 2

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download