You are browsing a tutorial guide for Octoparse version 8.4. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
Realtor is a website where you can search real estate for sales, discover new homes, shop for mortgages, and find property records.
In this tutorial, we are going to show you how to scrape property data from Realtor.com. The website has anti-scraping techniques, so we need to make sure not to scrape the website too fast.
We will scrape the data from the property detail page and scrape the title, location, price, rating, etc with Octoparse.
To follow through, you may want to use this URL in the tutorial:
We'll use 2 tasks to get the data in the detail pages.
Here are the main steps in this tutorial:
Task 1: Extract all the URLs of detail pages on the search result pages [Download the demo task file here]
Task 2: Collect the product information from scraped URLs [Download the demo task file here]
Task 1: Extract the detail page URLs on the search result pages
1. "Go to Web Page" - open the target web page
Enter the example URL and click Start
2. Create a Pagination - scrape all the results from multiple pages
Scroll down and click the "Next" button on the web page
Click Loop click next page on the Tips panel
Octoparse auto-detects AJAX applied for the click action as 3 seconds. You can modify it based on your local Internet condition (Click to know more about AJAX: Handling AJAX).
Set up AJAX timeout as 10 seconds
Click on the Pagination step in the workflow and enter the Xpath: //a[@aria-label="Go to next page"][not(contains(@class, "disabled"))]
3. Create a "Loop Item" - to loop extract URLs of all the listings
Click on the image of the first item on the list
Click the A tag at the bottom of the Tips panel (A tag defines a hyperlink, which is used to link from one page to another)
Click Select All on the Tips
Choose Extract the URLs of the selected elements
We can see that some items are not selected, so we need to modify the Loop Item, so we need to modify the Xpath of the Loop Item.
Click on Loop Item
Change Loop Mode from Fixed List to Variable list
Enter XPath //ul[@data-testid='property-list-container']/li into the text box
Click Apply to save
Go to Extract Data and modify the URL XPath
Set the XPath as //a[@rel="noopener"]
4. Start extraction - run the task and get data
Run the task from the upper left side
Select Run task on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)
Here is the sample output -
Task 2: Collect property data from scraped property URLs
1. Input a batch of the scraped URLs - loop opens the detail pages
In Task 1, we already have a list of URLs.
Click + New to start a task using Advanced Mode to build Task 2
Choose Import from the task to get the URLs from Task 1
TIP: There are 4 ways to input URLs. In this tutorial, we use Import from the task for demonstration. Please note that this one only works when the parent task is running in the Cloud. If we import from a local run data results, only 100 lines of data will be imported. To learn more about importing URLs, check this guide: Batch URL input.
After clicking the Save button, you will see a loop item named Loop URLs be generated in the workflow.
2. Extract data - select the data for extraction
Click on the elements you want to scrape
Choose Extract text/URL/image URL of the selected element on the Tips panel
Double click each field to rename it
3. Refine the data fields
To avoid data fetched to the wrong column, we will need to Customize element XPath.
Click More(...) and select Customize Xpath
Input the revised XPath into the text box and click Apply to save
Here are revised XPaths for some common data fields
Presented_by: //div[contains(text(),'Presented')]/following-sibling::span[2]
Price: //div[@data-testid="list-price"]
Facilities://div[@data-testid="property-meta"]
Address: //div[@data-testid="address"]
Property_type: //div[contains(text(),'Property')]/following-sibling::div[1]
Time _on _realtor: //div[contains(text(),'Time on realtor.com')]/following-sibling::div[1]
Price _ per _sqft: //div[contains(text(),'Price per sqft')]/following-sibling::div[1]
Year_Built: //div[contains(text(),'Year Built')]/following-sibling::div[1]
4. Set up wait time - slow down the scraping
As the website applies anti-scraping techniques, we need to set up a wait time to slow down the scraping speed so as to avoid being blocked.
Click on the Extract Data
Go to Options
Tick Wait before action and set it as 7s-10s
Click Apply to save
5. Start extraction - run the task and get data
Click Save to save the task first
Click Run on the upper left side
Select Run task on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)
Here is the sample output: