Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Why does Octoparse only extract the first item and duplicate?Thursday, August 16, 2018
“Loop Item” is very important in Octoparse because it is one of the most frequently-used steps while you build a scraping task.
If you have a task that Octoparse only extracts the first item and keeps producing duplicates, you may need to revise the “Loop Item” you create in the task.
There are mainly two reasons why it happens:
1) Data to be extracted is not in the selected area. (e.g. You only select the title to create a loop and yet you click the data outside the title area to extract)
This mistake may usually happen when you need to extract data from the list page.
In this case, you may need to delete the entire “Loop Item” and rebuild another one. Please note that you need to select the entire area as an item to create a loop. (Data extraction is only allowed in the selected area.) If you cannot select the entire area directly, expand the area by clicking this icon on "Action Tips" to include all the data you need.
2) When finishing a loop, Octoparse will mark the first item in red as shown in the screenshots below to remind you to start extracting data from the first item.
But if you start to extract data from the second item or other items without following Octoparse’s hints, Octoparse may scrape the second item or other items’ data and produce duplicates. You should delete the step of “Extract Data” and drag a new step of “Extract Data” in your loop under the instruction of Octoparse.
You can follow these two steps to check the “Loop Item” manually.
- Just click the first item in your “Loop Item” to check the data extracted as shown in the screenshot below.
- Click the second item in “Loop Item” to check the data. If the data extracted is always the same even though you select the second item, you should follow the above solutions to revise your task.
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar