How to extract data in the list on eBayFriday, May 06, 2016 1:32 AM
This document will demonstrate how to collect information on the list of web page, along with some knowledge about AJAX loading. In this tutorial, I will show you how to create a loop with pagination and collect data from Ebay. I'll search an item (TV for example) on Ebay. And copy the URL after the result page is loaded.
Set basic information on Octoparse.
First of all. Enter the task name. Save your task to a category. Then click "Next" to the second step.
Open the example link in the build-in browser. Click Save in the Customize Current Action pane.
We configure pagination action to extract data from all the web pages.
Wait until the page is loaded and scroll down to the bottom of the page. Click on “Next” page link. Select “Loop click the element” and click “Save”.
In fact, the pagination function is related to the webpage with AJAX. The asynchronous update of AJAX is actually a scripting technology. It updates a portion of the web page without reloading the entire page by exchanging a small amount of data with the server in the backend.
There are two most obvious features in Ajax loading. One is that the site will not have any change when you click the option on a web page. The other one is that the web page is not fully loaded, but has a partial change on the page. If a website meets these two features, then it is a AJAX website.
Sometimes the collection will be stopped directly or be completed with no data extracted when we perform the local extraction, in this case we maybe need to set the AJAX loading in Octoparse.
The reason is that the built-in browser in Octoparse open the web page and doesn't receive a signal of the change on the page when the website has only updated part of the content with the URL unchanged. Thus Octoparse collects no data or the process has been stopped.
In addition to choosing the option "Load page with AJAX", we also need to set the timeout parameter, the amount of time, to wait for the AJAX requests to be finished so that we can execute the next step.
So click the "Click to paginate"step in workflow designer, click "Load page with AJAX" and select "10 seconds” from the Ajax Timeout list in the advanced options. Click save.
Then we need to extract all the links on the page. Create a list of item before getting into the detail page of each item.
Click on the first title > Create a list of item > Select Add current item to the list > Select Continue to edit the list.
Click on the second one > Add current item to the list > Finish creating list > Loop
Now on the detail page, we extract data. Click on the data you want to extract.
Click on the title and price, then choose “extract text”. Click Save.
Each time we open a web page with Ajax, we need to set the timeout parameter to wait for the AJAX requests to be finished. So in each “Click Item” step in workflow designer, we need to tick the option “Load page with AJAX” and select "10 seconds” from the “Ajax Timeout” list in the advanced options. Click save.
In the “Cycle Pages” box, drag the “Loop Item” box before the “Click to paginate” action.
We've finished configuring rule. Click “Next” on the top right corner. Tick the option "Disable image loading" to speed up the extraction. Click “Next”. Then click “Local Extraction” to run the task.
The data extracted will be showed in this pane and we can also see the configured rule of the task. Check out the built-in browser to see if the task runs as expected.
Export the results to Excel files, or other formats and save the file to the computer.
This is the screenshot of some of the data extracted.
If this video tutorial is not available for you, you can click here to see the corresponding graphic tutorial.