Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Conglomerate data extracted

Thursday, August 16, 2018

In this tutorial, we will show you how to customize data conglomeration in Octoparse to merge different rows of data into one single row.

 

Let's suppose you need to extract posts from a blog. In some cases, you might not be able to select the entire post to extract. But you want the whole post in one single row instead of separated paragraphs in different data fields like below:

 

So in this case, to merge different rows into one row of data, we suggest you use the conglomerate feature in Octoparse while configuring extraction.

Here we use the blog content from https://philipyancey.com/a-view-from-abroad as an example to show you how to use the conglomerate feature to merge data extracted.

 

1) Select the desired data to extract

1.Select one paragraph on the page and click "Select all"  to create "Loop Item" to extract each paragraph of the post.

 

2.Select "Extract text of the selected elements" 

                                                                                 

 

2) Customize data conglomeration to combine the data extracted 

1. Click on the "Extract Data" action and then the data field to customize

 

 

2. Click on   to customize data field

 

3. Select "Customize data conglomeration"

 

4. Select "Conglomerate data captured for the same data field into a single row." 

Now, the paragraphs captured in "Text" field would be merged into one single row when executing. 

 

Let's run the task and export the result to excel for a better view. 

You can see that paragraphs captured in "Text" filed are now combined into a single row as one big chunk.

 

 

Tips!

1. Data conglomeration is especially useful in extracting articles from the web. You can extract the article as one whole chunk with no other elements like blank lines, comments, and images.

2. When the data are conglomerated as one chunk, you can use Data reformat tools  to add a prefix or suffix, such as "|" and "\", to make each item to be better viewed.

 

 

 

Related articles:

Select and extract data/URL/image/HTML 

Use lists to extract 

Extract multiple pages through pagination 

Re-format data extracted 

Extract data from a list of URLs 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png