Web Crawling Case Study | Crawling Data from Otlob.comThursday, May 11, 2017 10:29 AM
Welcome to Octoparse web scraping case study!
In this tutorial, we will walk through detailed steps to learn how to crawl data from otlob.com.
List features covered
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser (URL of the example: https://www.otlob.com/restaurant/e7so/mcdonald-s-6th-of-october)
- Click "Go" icon to open webpage
Step 3. Create a list of items
Once the webpage finishes loading, we would notice that menu items are nicely arranged with very similar section layout. That means we could build a loop list to add all of these similar sections and configure Octoparse to extract data from each single section later.
- Move your cursor over the menu items with similar layout, where you would extract the data info from.
- Click any where on the first section on the web page
- Click "Expand the selection area" to the point where the outlined box includes all the content you want.
- Click "Create a list of items"
- Click "Add current item to the list"
Note: "Expand the selection area" is usually used when the selection had not been identified properly in the first place.
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click "Continue to edit the list"
- Click a second section with similar layout
- Click "Add current item to the list" again
Then, we get parts of the sections added to the list automatically.
- Click "Finish Creating List"
- Click "loop", which means to click on each section on the list to extract the selected data
Step 4. Modify XPath to build a complete loop list
After finish creating a loop list in the Step 3, we would notice that only parts of the menu items are added into the loop list. Thus, we need to modify the XPath of the variable list to get all of the menu items added into our loop list!
- Copy the XPath generated in Octoparse in the Firepath.
- Modify and locate the correct XPath to include all of the loop items.
- Paste the modified XPath: .//*[@id='menu']/div/div/section/article/article in the Variable list.
- Click "Save"
Now, we can see that all of the items are fetched by Octoparse.
Step 5. Select the data to be extracted and Rename data fields
Now, we want to extract data from these similar menu items.
Back to the "Extract Data" action and click it and we will find the first menu item section is outlined with dotted line box. That means we need to extract data from the first section by following the steps below. Note that the extraction action we will be setting up for the first section is going to apply to the rest of the loop list.
- Click the product name.
- Select "Extract text"
- Follow the same steps to extract other data.
- Rename any field names if necessary.
- Click "Save"
Step 6. Start running your task
Now we are done configuring the task, and it's time to run the task to get the data we want.
- Click “Next”
- Click “Next”
- Select "Local Extraction"
There is Local Extraction and Cloud Extraction (premium plan) in Octoparse. With a Local Extraction, the task will be run in your own machine; with a Cloud Extraction, the task will be run on Octoparse Cloud Platform, which means you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 7. Check and export the data
When the task is completed, you can click the export button to export your data results.
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
To learn more about related topics, you can check out the tutorials below:
We have also picked up some case study you may feel interested in as following:
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!