Web Crawling Case Study | Scraping information from Capterra

Wednesday, April 05, 2017 8:00 AM

In this tutorial, I will show you how to scrape data from www.capterra.com by searching with multiple keywords. 

 

 

Some features that we will touch upon include:

  • Create a loop for a text list
  • Extract data

 

Now, let's get started!

Step 1. Start a new task  

  • Choose "Advanced Mode" and click "Start".
  • Complete the basic information.
  • Click "Next" to proceed to extraction setup.

 

Step 2. Navigate to the webpage

Enter the target URL in the built-in browser, then click "Go" icon to open the webpage.

The URL we used for this example is http://www.capterra.com/search

 

Step 3. Build a loop of text list

To perform search with multiple keywords, we'll first need to build a loop with the list of keywords you have in mind. 

  • Wait till the page finishes loading
  • Drop an "Loop" action into the Workflow Designer

 

 We need to specify that we are looping a text list (list of keywords).

  • Go to "Loop Mode"
  • Select "Text list"

Here, we are telling Octoparse to loop through a text list, which is the list of keywords we will use to search with. Now, we need to input the list.

 

  • Enter the list of keywords into the Text List input box. Here I will enter "data" and "travel" as an example.
  • Click "OK"
  • Click "Save"

 

Step 4. Enter text to the search bar

Now the loop of text list has been built, we will proceed to add an "Enter Text" action within the loop. This will tell Octoparse to enter each keyword from the text list into the search bar one by one. 

  • Click on the search bar of the website in the built-in browser
  • Choose "Enter text value"

  • Now, drag "Enter text value" into the "Loop Item” box manually.

 

After "Enter Text" had been added, there's one more important step: match the text list in the loop to "Enter Text" action. 

  • Select “Use current loop text to fill the text box”
  • Then click "save"

 

Step 5. Click to search

  • Click the “Search” button of the website
  • Select “Click an item”

 

Step 6. Check the Workflow

To move on, it's always advised to re-run the workflow every few steps. Click the first action, wait until the action completes and click the next. This is done to confirm if the workflow works as expected. Since we just dragged an action, we will need to re-run the workflow to take us to the next page. 

  • Click through the steps 

 And doing so will take us to the search result page.

 

Step 7. Build a list of items to extract

Now you’ve come to the page you like and you see that the data you are interested in are arranged in a list. We will need to build a list to tell Octoparse to click open each item of the list and extract the detailed information.

  • Click the first item of the list, make sure all data you’d like to captured are being highlighted.
  • When prompted, select “Create a list of items”, then click “Add current item to the list”.

Now the first selected item should have been added to the list. To add the other items, click “Continue to edit the list”

 

  • Click the second item of the list. When prompted, select “Add current item to the list” one more time.

Now, you should see all items of the list are added.

  • Finish up by clicking on “Finish Creating List”
  • Last but not least, click “Loop”.

You are telling Octoparse to click on each item of the list to extract the data you want

 

Step 8. Extract data

  • Once the detail page gets loaded, click on the data fields you would like to capture 
  • Select "Extract Text"

 alt=

 

Step 6. Set up Extraction Options

Now we’re done configuring the extraction rule. You can choose not to load image to speed up the extraction. If so, click “Next”.

 

Step 7. Start running your task

Congratulation. You had just finished configuring the task. 

You can now choose to,

  1. Run the task locally - on your own machine
  2. Run the task in the Cloud for more sophisticated scraping experience
  3. Schedule for the task to run in the cloud

We'll choose to run an local extraction for domostration purposes. 

 

The data scraped will be showed in "Data Extracted" pane. Workflow will be shown at the right side for your reference. You can also check out the built-in browser to see if the task runs as expected.

 

Step 8. Export data

Export the data output to Excel files, or any formats of your choice of export directly to database. 

 

This is the data extracted:

 

Good job for completing this tutorial!

 

Now check out similar case studies:

Or, learn more about related topics:

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

Octoparse Smart Mode -- Get Data in Seconds

Get Started with Octoparse in 2 Minutes

Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected

Scrape Data from Website with Pagination - Infinite Scrolling

Collect Data from eBay

Top 30 Free Web

 

 

 

 

Request Pro Trial Data
Collection
Service
Email
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.
× get my coupon now No Thanks