Conglomerate Data in OctoparseWednesday, August 30, 2017 10:00 AM
Data conglomeration is a feature provided in Octoparse for combining data captured from a list into a single data field. This feature comes in handy for combining web content from different pages or paragraphs. Let’s look at an example.
Let’s say that I would like to capture the blog content on this webpage - https://philipyancey.com/a-view-from-abroad. There are two parts to the page, the blog content and the comments section and I only want the blog content.
Notice if I want to extract the blog content in one big chunk, the comments get automatically added to my selection. Even worse, there is really no ways to select the whole article exclusively but only the individual paragraph.
After looking closely at the source code I figure there it is really not possible to separate these two sections as they are all parts of a single body and any scraper will recognize it as one single element. Meaning that if we want to extract the content, the comments will be unavoidably included.
To tackle this, I will use the text conglomeration function offered in Octoparse.
Step 1: Build a loop list to extract each paragraph separately.
Step 2: Customize settings for the ‘Extract Data’ action by clicking on the field customization button ➜ select “Customize data conglomeration’ ➜ select ‘Combine data captured from the same field into the same row when extracted multiple times’ ➜ OK ➜ Save
Now, each paragraph of the blog content is captured as separate items of a list then combined into a single field. Run the task to see the result.
Export to excel for an better view.
That's all! I hope you had enjoyed reading the tutorial. If there's any feature you would like to learn more about, feel free to always reach out the support team at firstname.lastname@example.org.