Web Crawling Case Study | Crawling real estate data from ZillowFriday, May 12, 2017 9:52 AM
Welcome to Octoparse web crawling case study. In this tutorial, we will walk through the detailed steps to crawl real estate data information from Zillow.com.
List of features covered
- Modify Xpath
- Build a loop list
- Set up pagination
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start "
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
To retrieve the web content in the built-in web browser, we need to paste the target URL in the URL text box.
- Enter the target URL in the built-in browser (URL of the example: https://www.zillow.com/digs/)
- Click "Go" icon to open webpage
Step 3. Create a list of items
Notice how the list of properties are arranged in similar sections, move your cursor over the property information, where you would extract the data from.
- Click the address of the first section "Woodinville, WA"
- When prompted, Click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
Note that we need to build a list to loop the addresses of these properties, since the address link can lead us to the detailed second-level content web page where we would extract the data from. Other loop items, such as pictures and sections, are not able to activate the action to navigate you to the detailed content page directly.
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
Now we get all the sections added to the list with similar layout
- Click “Finish Creating List”
- Click “loop”， this action will tell Octoparse to click on each section on the list to extract the selected data
Step 4. Modify XPath to locate the loop items
You can see the loop items shown in the "Loop Item" list is not located properly. Precisely, we only want to build a loop list of addresses. Thus, we need to modify the XPath of the variable list, so that we could locate these items correctly.
- First, we need to inspect the address, for example "Woodinville, WA" in the Firepath.
- Modify the XPath to locate all of these addresses with similar layout.
- Copy the modified XPath .//li[@class='truncate-line'] and paste it in the Variable list of Octoparse.
- Click "Save"
Now, you can observe these loop items have shown properly in the "Loop Item" list.
Step 5. Select the data to be extracted
- Navigate to the "Click Item" action and click it
Now, you will be directed into the detailed information page.
- Click the address data field
- Select “Extract text”
- Follow the same steps to extract the other data fields
Step 6. Rename data fields.
- Rename the any field names if necessary.
- Click "Save"
Step 7. Set up pagination
- Click on “>” to the right of page numbers
- When prompted, we can't still find loop click action for pagination. Thus, we need to locate the pagination button manually by selecting Tag "A"
- Then, select "Loop click the element"
- Click "Save"
This will tell Octoparse to click open each page for more extraction actions.
Step 8. Starting running your task
- After saving your extraction configuration，click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
(Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress)
Step 9. Check the data and export
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Now, you have completed learning how to crawl from Zillow.
To learn more about how to crawl data from websites, you can check out the case tutorials:
Or you can refer to the tutorials below to learn more about what you can do with these powerful features:
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!