Web Crawling Case Study | Scraping information from CapterraWednesday, April 05, 2017 8:00 AM
In this tutorial, I will show you how to scrape data from www.capterra.com by searching with multiple keywords.
Some features that we will touch upon include:
- Create a loop for a text list
- Extract data
Now, let's get started!
Step 1. Start a new task
- Choose "Advanced Mode" and click "Start".
- Complete the basic information.
- Click "Next" to proceed to extraction setup.
Step 2. Navigate to the webpage
Enter the target URL in the built-in browser, then click "Go" icon to open the webpage.
The URL we used for this example is http://www.capterra.com/search
Step 3. Build a loop of text list
To perform search with multiple keywords, we'll first need to build a loop with the list of keywords you have in mind.
- Wait till the page finishes loading
- Drop an "Loop" action into the Workflow Designer
We need to specify that we are looping a text list (list of keywords).
- Go to "Loop Mode"
- Select "Text list"
Here, we are telling Octoparse to loop through a text list, which is the list of keywords we will use to search with. Now, we need to input the list.
- Enter the list of keywords into the Text List input box. Here I will enter "data" and "travel" as an example.
- Click "OK"
- Click "Save"
Step 4. Enter text to the search bar
Now the loop of text list has been built, we will proceed to add an "Enter Text" action within the loop. This will tell Octoparse to enter each keyword from the text list into the search bar one by one.
- Click on the search bar of the website in the built-in browser
- Choose "Enter text value"
- Now, drag "Enter text value" into the "Loop Item” box manually.
After "Enter Text" had been added, there's one more important step: match the text list in the loop to "Enter Text" action.
- Select “Use current loop text to fill the text box”
- Then click "save"
Step 5. Click to search
- Click the “Search” button of the website
- Select “Click an item”
Step 6. Check the Workflow
To move on, it's always advised to re-run the workflow every few steps. Click the first action, wait until the action completes and click the next. This is done to confirm if the workflow works as expected. Since we just dragged an action, we will need to re-run the workflow to take us to the next page.
- Click through the steps
And doing so will take us to the search result page.
Step 7. Build a list of items to extract
Now you’ve come to the page you like and you see that the data you are interested in are arranged in a list. We will need to build a list to tell Octoparse to click open each item of the list and extract the detailed information.
- Click the first item of the list, make sure all data you’d like to captured are being highlighted.
- When prompted, select “Create a list of items”, then click “Add current item to the list”.
Now the first selected item should have been added to the list. To add the other items, click “Continue to edit the list”
- Click the second item of the list. When prompted, select “Add current item to the list” one more time.
Now, you should see all items of the list are added.
- Finish up by clicking on “Finish Creating List”
- Last but not least, click “Loop”.
You are telling Octoparse to click on each item of the list to extract the data you want
Step 8. Extract data
- Once the detail page gets loaded, click on the data fields you would like to capture
- Select "Extract Text"
Step 6. Set up Extraction Options
Now we’re done configuring the extraction rule. You can choose not to load image to speed up the extraction. If so, click “Next”.
Step 7. Start running your task
Congratulation. You had just finished configuring the task.
You can now choose to,
- Run the task locally - on your own machine
- Run the task in the Cloud for more sophisticated scraping experience
- Schedule for the task to run in the cloud
We'll choose to run an local extraction for domostration purposes.
The data scraped will be showed in "Data Extracted" pane. Workflow will be shown at the right side for your reference. You can also check out the built-in browser to see if the task runs as expected.
Step 8. Export data
Export the data output to Excel files, or any formats of your choice of export directly to database.
This is the data extracted:
Good job for completing this tutorial!
Now check out similar case studies:
Or, learn more about related topics:
- Searching based on multiple text boxes
- Automatically Cycle Text List
- Create A Loop For Pagination Manually
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!