Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Lesson 5: Getting data - Click on a list and capture data from each item pageThursday, August 16, 2018
1) Select the links to click into individual item page
To do this, we will create a "Loop Item" to loop click each product link on the result page.
- Click on the first product title that contains the URL to access the item page. The selected item will be highlighted in green while items with the same layout will be highlighted in red.
- Click on the second product title containing the URL
- Select "Loop click each URL" from "Action Tips". Notice a "Loop Item" for the clicking action is being auto-generated and added to the workflow.
To loop click through items on the list, it is important that you select the anchor texts. Octoparse automatically identifies tags of selected items. So when you select an item with URL, the selected tag would be "A", which stands for anchor that usually links one page to another.
2) Select details on the item page to extract
Once the "Loop Item" is completed, Octoparse will load the first item page in the built-in browser.
Now, set up an extraction template by designating the specific data fields to capture from the page; Octoparse will apply this template to the other item pages.
- Click on target data fields such as title, review, price, etc.
- Select "Extract data" from "Action Tips" to complete the extraction action when you finish selecting. Notice an "Extract data" step gets auto-generated and added to the workflow. Data fields extracted will be displayed in "Data field" pane next to the workflow designer.
Set up a wait time in "Advanced Options" for steps like "Click Item" or "Extract Data" can effectively avoid data skip and make the crawling process more human-like! (Usually 2-5 seconds would work well).
Done! Learn how to set up pagination in lesson 6 to complete your scraping project!
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar