undefined

Scraping Data from Walmart.com

Saturday, December 31, 2016 1:43 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Walmart is a large retail corporation in the United States. In this tutorial, we are going to show you how to scrape product data from Walmart.com.

You can also go to "Task Templates" on the main screen of the Octoparse scraping tool, and start with the ready-to-use Walmart Template directly to save your time. With this feature, there is no need to configure scraping tasks. For further details, you may check it out here: Task Templates

If you would like to know how to build the task from scratch, you may continue reading the following tutorial.

Suppose we want to scrape some specific information about headphones, and we can start with the home page (https://www.walmart.com/) to create our crawler. We will scrape data such as product title, price, product ID, and reviews from the product details page with Octoparse.

 

Here are the main steps in this tutorial: [Download demo task file here

1. Open the target web page

2. Create a Pagination - to scrape from multiple pages

3. Scrape data from the product list 

4. Click into each product link to scrape data - to get data from product pages

5. Extract data from the detail page

6. Run extraction - run your task and get data

Note:

Walmart tasks cannot be run in the Cloud due to CAPTCHA issues. You can only run it on your device for now.


 

1. Open the target web page

a. Enter the URL on the home page and click Start

b. Click the search box and then click Enter text on the Tips panel

walmart1

c. Type in "Headphone" and confirm

d. Click on "Enter Text" and click "Options" below, set as to "Hit the Enter/Return key when finish entering", then click "Apply" to confirm

 

2. Create a Pagination - to scrape from multiple pages

a. Click on the "next page button" in the website (besides the number of page buttons), select Loop click single element, and set up the AJAX timeout as 10s

The auto-generated XPath for Pagination does not always work in this case, so we need to modify the XPath to make it scrape all the pages.

b. Click on Pagination

c. Input the XPath //a[@aria-label="Next Page"] in the Matching XPath box

d. Click Apply to confirm

walmart2

 

3. Scrape data from the product list 

a. Select the first product (note: the whole product section)

b. Choose Select all sub-elements

c. Choose Select all

d. Choose Extract Data

Now, a Loop Item with Extract Data will be created in the workflow

walmart3

e. Double click the field name to rename it or click "... " to delete unwanted fields in the data preview table

If all the data you want can be scraped from the listing page, you can jump to step 6. Run extraction -  run your task and get data

4. Click into each product link to scrape data - to get data from product pages

Some information like product descriptions can only be grabbed from the product detail page. We need to click on each product link to get the data.

a. Click on the first product link

b. Choose Click URL

Then a click item will be created in the workflow.

 

5. Extract data from the detail page

a. Select the data you want

b. Click Extract the text of the element or Extract the URL of the select image

c. Double click the field name to rename it or click ... to delete fields

d. Tick the box "Wait before action", and set the time (e.g. 7s) for Extract Data action

The auto-generated XPath of the data fields may fail to work after the web page updates. We will need to modify the XPath of the fields. Don' worry, we have prepared some useful XPath for this website.

a. Switch Data Preview to Vertical View

b. Double click on the XPath to modify it

c. Replace the XPath with the ones below

 walmart4

Product name: //h1

Price: //span[@itemprop="price"]

Product details: //h2[text()='Product details']/../following-sibling::div[1]

Specifications: //h2[text()='Specifications']/../following-sibling::div[1]

 

6. Run extraction -  run your task and get data

a. Click save icon, then click run icon on the upper left side

b. Select Run task on your device to run the task on your computer

 

Here is the sample output. 

walmart5

 

Is this article helpful? Contact us anytime if you need our help!

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline