AI Web Scraping: Scrape Ecommerce Website with Auto-detectionMonday, May 18, 2020
We made a series of web scraping tutorials for you to get on board quickly with our latest version Octoparse 8. By the end of the series, you will be able to build a crawler from scratch and pull data from any website you want.
In this lesson, we will go through how to scrape eCommerce data using the auto-detect algorithm in Octoparse 8.
Table of contents
Most of the websites share similar layouts. For example, eBay is a webpage containing many items nested in a list.
Octoparse's brand-new auto-detect algorithm is specially designed to scrape such kind of pages. It automatically detects for listing data (including text elements and links), "Next page" buttons, "load more" buttons and scrolls down the page, and then it generates the scraping task automatically.
Step 1: Create a new task
Enter the example URL into the search box. Click "Start" to create a new task.
Step 2: Get data via auto-detect
Octoparse will load the webpage URL in the built-in browser and start the auto-detect process. Wait patiently until the process completes and when more info is provided on the "Tips" panel.
Step 3: Check the data
When the auto-detection completes, follow the instruction provided on "Tips" and check your data in the preview section. You can rename the data fields or remove those that are not needed. The detected data will also be highlighted on the webpage for you.
Step 4: Confirm your options
Now, go to "Tips" and check your options. Based on the type of data detected, a number of options are provided for you to choose from. For this example, list data is detected so you are provided with the options to:
Option 1: Scrape the data in the list
This option is selected by default as Octoparse thinks this is what you need to do for sure.
Option 2: Click the "Next" button to capture multiple pages
Apparently, Octoparse has detected a "Next" button on the page. Check this option if you want Octoparse to click the "Next" button to scrape data from more pages.
To find out if the button detected is the correct one, click "Check" and watch it gets highlighted on the webpage. If you need to re-select the "Next" button, click "Edit" and follow the instructions on "Tips".
Option 3: Click the "links" to capture data on the page that follows
Now Octoparse is asking if you want to click on the links detected and scrape more information from the detail pages. Check this option if this is what you need.
To confirm if the links are the ones you'd like to click through, click "Check" to have the links highlighted on the web page.
In this case, we only want to scrape the list information across all pages, so we'll go ahead and check the first and the second option.
Step 5: Save task settings
Octoparse would generate a workflow automatically based on the data detected and the saved settings. You can choose to run the task now or edit the workflow manually.
When everything looks good, you can hit save and run to get your data.
Don’t forget to practice with the HelloWorld test site. If you encounter any difficulties, feel free to submit a ticket or email us at firstname.lastname@example.org. To know how to optimize your task, you can check out lesson 2.