Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Lesson 4: Getting data - Capture a list of itemsMonday, November 22, 2021
The latest version for this tutorial is available here. Go to have a check now!
In the previous lesson, we learn how to capture simple text (see how to capture text from a page in lesson 3 ). We are now ready to move on to a more advanced scraping technique: capturing a list of items.
1. Build the list by defining a pattern
Tell Octoparse what items to include in the list by selecting any 2 items from the list
· Click any 2 product sections consecutively. Notice the other product sections on the page are being auto selected and highlighted in green with all the sub-elements highlighted in red.
· Click "Extract text of the selected elements". A "Loop Item" will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, however, if this is not exactly what you are looking for, you can delete it and add the data fields you need in the next step.
1. In order for the list to be built correctly containing the items desired, it is critical to keep the two selection identical in structure, ie the highlighted content should be of the same "look".
You can always expand the selection area by clicking on the tags (e.g. DIV, A, LI, etc.) on the bottom of "Action Tips".
2. If certain items on the list are still missing after the first two clicks, keep clicking on more products from the same list until all items desired are selected and highlighted in green.
2. Capture sub-elements within the selected item
2.1 From the item highlighted in green (usually the first one on the list), click to capture the sub-elements desired. This is to set an extraction template for the other items on the list. Configure the extraction step for the first item, then Octoparse will apply the template to the remaining items on the list.
· Click to capture any sub-elements within the highlighted section
· When finish selecting, click "Extract text of the selected elements"
2.2. Capture all sub-elements automatically
Beside the steps provided in 2.1, there is an alternative way to capture sub-elements in Octopares 7x. At the time you are adding items to the list, Octoparse automatically detects all sub-elements within the selected sections and highlight them in red. Now, you can click "Select all sub-elements" from "Action Tips" to have all the detected sub-elements selected at once.
Now, all sub-elements are selected and shown in the "Action Tips" panel.
· Click the "X" next to the data fields to delete any unnecessary columns.
· Once done, select "Extract data".
Notice the data fields extracted are being added to the "Data field" Pane next to the workflow designer for further customization if necessary.
- Most popular tutorials
- Scrape tweets from Twitter
- Extract data from a list of URLs
- Extract multiple pages through pagination
- Scrape data on Instagram
- How to download images from a list of URLs?