Web Crawling Case Study | Crawling real estate data from Zillow

Friday, May 12, 2017 9:52 AM

Welcome to Octoparse web crawling case study. In this tutorial, we will walk through the detailed steps to crawl real estate data information from Zillow.com.

 

List of features covered 

  • Modify Xpath
  • Build a loop list
  • Set up pagination

 

 

Now, let's get started!

 

Step 1. Set up basic information

  • Click "Quick Start "
  • Create a new task in the Advanced Mode
  • Complete the basic information

 

 

Step 2. Navigate to the target website

To retrieve the web content in the built-in web browser, we need to paste the target URL in the URL text box.

 

 

Step 3. Create a list of items

Notice how the list of properties are arranged in similar sections, move your cursor over the property information, where you would extract the data from.

  • Click the address of the first section "Woodinville, WA"
  • When prompted, Click “Create a list of items” (sections with similar layout)
  • Click “Add current item to the list”

Note that we need to build a list to loop the addresses of these properties, since the address link can lead us to the detailed second-level content web page where we would extract the data from. Other loop items, such as pictures and sections, are not able to activate the action to navigate you to the detailed content page directly.

 

 Now, the first item has been added to the list, we need to finish adding all items to the list

  • Click “Continue to edit the list”
  • Click a second section with similar layout
  • Click “Add current item to the list” again

 Now we get all the sections added to the list with similar layout

  • Click “Finish Creating List”
  • Click “loop”, this action will tell Octoparse to click on each section on the list to extract the selected data

 

 

Step 4. Modify XPath to locate the loop items

You can see the loop items shown in the "Loop Item" list is not located properly. Precisely, we only want to build a loop list of addresses. Thus, we need to modify the XPath of the variable list, so that we could locate these items correctly.

  • First, we need to inspect the address, for example  "Woodinville, WA" in the Firepath.
  • Modify the XPath to locate all of these addresses with similar layout.
  • Copy the modified XPath .//li[@class='truncate-line']  and paste it in the Variable list of Octoparse.
  • Click "Save"

Now, you can observe these loop items have shown properly in the "Loop Item" list.

 

 

Step 5. Select the data to be extracted 

  • Navigate to the "Click Item" action and click it

Now, you will be directed into the detailed information page.

  • Click the address data field
  • Select “Extract text”
  • Follow the same steps to extract the other data fields

 

 

Step 6. Rename data fields.

  • Rename the any field names if necessary.
  • Click "Save"

 

 

Step 7. Set up pagination  

  • Click on “>” to the right of page numbers
  • When prompted, we can't still find loop click action for pagination. Thus, we need to locate the pagination button manually by selecting Tag "A"
  • Then, select "Loop click the element"
  • Click "Save"

This will tell Octoparse to click open each page for more extraction actions.

 

 

Step 8. Starting running your task

  • After saving your extraction configuration,click “Next”
  • Select “Local Extraction”
  • Click “OK” to run the task on your computer.

    (Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane  for the extraction progress)

 

 

 

Step 9. Check the data and export

  • Check the data extracted
  • Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer

 

 

 

Done !

 

Now, you have completed learning how to crawl from Zillow.  

 

To learn more about how to crawl data from websites, you can check out the case tutorials:

Web Scraping Case Study | Scraping Data from Yelp

Scrape Article Information from Google Scholar

extract facebook data

How to Extract Information from Yellow Page Websites

 

 

Or you can refer to the tutorials below to learn more about what you can do with these powerful features:

Modify XPath Manually in Octoparse

Web Scraping - Modify X Path For "Load More" Button with Octoparse

Extracting Stock Prices using Regular expression (Example: Finance.Yahoo.com)

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today!

 

 

 

 

 

 

 

 

btn_sidebar_use.png
btn_sidebar_form.png