undefined

Scrape Real Estate Data (Example:www.realtor.com)

Monday, May 30, 2016 7:19 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Realtor is a website where you can search real estate for sales, discover new homes, shop mortgages, and find property records. 

 

In this tutorial, we are going to show you how to scrape property data from Realtor.com. The website has anti-scraping techniques, so we need to make sure not to scrape the website too fast.

 

We will scrape the data from the property detail page and scrape the title, location, price, rating, etc with Octoparse.

 

To follow through, you may want to use this URL in the tutorial:

https://www.realtor.com/realestateandhomes-search/Tallassee_AL

We'll use 2 tasks to get the data in the detail pages.

 

Here are the main steps in this tutorial: 

 

Task 1: Extract all the URLs of detail pages on the search result pages [Download the demo task file here]

  1. "Go To Web Page" - open the target web page
  2. Create a pagination loop - scrape all the results from multiple pages
  3. Create a "Loop Item" - to loop extract URLs of all the listings
  4. Start extraction - run the task and get data

 

Task 2: Collect the product information from scraped URLs [Download the demo task file here]

  1. Input a batch of the scraped URLs - loop opens the detail pages
  2. Extract data - select the data for extraction
  3. Refine the data fields
  4. Set up wait time - slow down the scraping

  5. Start extraction - run the task and get data

 

Task 1: Extract the detail page URLs on the search result pages

1. "Go to Web Page" - open the target web page

  • Enter the example URL and click Start

 

2.  Create a Pagination - scrape all the results from multiple pages

  • Scroll down and click the "Next" button on the web page
  • Click Loop click next page on the Tips panel

loop click next page 

  

Octoparse auto-detects AJAX applied for the click action as 3 seconds. You can modify it based on your local Internet condition (Click to know more about AJAX: Handling AJAX).

 

  • Set up AJAX timeout as 10 seconds

set ajax timeout

 

 

  • Click on the Pagination step in the workflow and enter the Xpath: //a[@aria-label="Go to next page"][not(contains(@class, "disabled"))]

change pagination xpath 

 

 

3.  Create a "Loop Item" - to loop extract URLs of all the listings

  • Click on the image of the first item on the list
  • Click the A tag at the bottom of the Tips panel (A tag defines a hyperlink, which is used to link from one page to another) 

click the a tag

 

  • Click Select All on the Tips
  • Choose Extract the URLs of the selected elements

 

We can see that some items are not selected, so we need to modify the Loop Item, so we need to modify the Xpath of the Loop Item.

  • Click on Loop Item
  • Change Loop Mode from Fixed List to Variable list
  • Enter XPath //ul[@data-testid='property-list-container']/li into the text box
  • Click Apply to save
  • Go to Extract Data and modify the URL XPath 
  • Set the XPath as //a[@rel="noopener"] 

change xpath

 

 

4. Start extraction - run the task and get data

  • Click save icon
  • Click run iconon the upper left side
  • Select Run task on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)

 

Here is the sample output.

sample data

  

 

Task 2: Collect property data from scraped property URLs

 

1. Input a batch of the scraped URLs - loop opens the detail pages

In Task 1, we already have a list of URLs.

  • Click + New to start a task using Advanced Mode to build Task 2
  • Choose Import from the task to get the URLs from Task 1 

import urls

 

 

Tip!

There are 4 ways to input URLs. In this tutorial, we use Import from the task for demonstration. Please note that this one only works when the parent task is running in the Cloud. If we import from a local run data results, only 100 lines of data will be imported. To learn more about importing URLs, check this guide: Batch URL input.

 

 

After clicking the Save button, you will see a loop item named Loop URLs be generated in the workflow.

 

2. Extract data - select the data for extraction

  • Click on the elements you want to scrape
  • Choose Extract text/URL/image URL of the selected element on the Tips panel
  • Double click each field to rename it

 

3. Refine the data fields

To avoid data fetched to the wrong column, we will need to Customize element XPath.

  • Click more icon icon and click Customize Xpath
  • Input the revised XPath into the text box and click Apply to save

 

Here are revised XPaths for some common data fields

  • Presented_by: //div[contains(text(),'Presented')]/following-sibling::span[2]
  • Price: //div[@data-testid="list-price"]
  • Facilities://div[@data-testid="property-meta"]
  • Address: //div[@data-testid="address"]
  • Property_type: //div[contains(text(),'Property')]/following-sibling::div[1]
  • Time _on _realtor: //div[contains(text(),'Time on realtor.com')]/following-sibling::div[1]
  • Price _ per _sqft: //div[contains(text(),'Price per sqft')]/following-sibling::div[1]
  • Year_Built: //div[contains(text(),'Year Built')]/following-sibling::div[1]

 

4. Set up wait time - slow down the scraping

As the website applies anti-scraping techniques, we need to set up wait time to slow down the scraping speed so as to avoid being blocked.

  • Click on the Extract Data
  • Go to Options
  • Tick Wait before action and set it as 7s-10s
  • Click Apply to save

set-wait-time

 

 

5. Start extraction - run the task and get data

  • Click Save to save the task first
  • Click Run on the upper left side
  • Select Run task on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)

 

Here is the sample output.

sample data

 

 

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today.

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline