Amazon Scraping Case Study | Monitoring stock counts on Amazon
Friday, May 19, 2017 7:10 AMWelcome to Octoparse web scraping case study!
In this series of case study tutorials for Amazon, we will learn how to deal with various difficult-to-handle situations when scraping data from Amazon.
In this tutorial, I will show you how to locate and capture stock numbers of products on Amazon. The stock numbers are not always readily available on the webpage.
So what can you do when you want this data? Here we are showing you how.
List of features covered in this case study:
- Build a URL Loop List
- Set AJAX and AJAX Timeout
- Modify XPath
- Set up Branch Judgment
- Run Local Extraction
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Build a loop for a URL list
In this case, we want to crawl data from several particular web pages, which is quite different from the usual pagination method by loop clicking the next page button or scrolling down for a few times. Luckily, it's easy for Octoparse to retrieve data from these particular web pages by entering a given list of target URLs and traverse these URLs to open each web page .
- First, drag a "Loop Item" action in to the Workflow Designer pane to create a loop for a given list of URLs
- Copy a list of URLs which you'd crawl data from
- Paste this given list of URLs into the "List of URLs" text box
- Click "OK"
- Click "Save"
Now, you can see the given list of URLs have been saved as the "Loop Item"
Step 3. Select the data to be extracted and rename data fields
After finishing building the URL list, we will be directed to the first URL automatically. Next, we should extract the product information from the first web page.
Note: When directed to the first URL, the page keeps loading for a long time, even though the content has already been displayed completely. Thus, we can manually click the multiplication button to stop unnecessary loading process.
Then we will begin to extract data from these product information sections.
Note that the extraction action we will be setting up for the current page is going to apply to the rest of the list.
- Click the product title in the first web page
- Select "Extract text"
- Follow the same steps to extract the other data fields
- Rename the field names if necessary
- Click "Save"
Step 4. Modify XPath to locate product information
In this case, the data field of "ASIN" and "Rank" of many product loop items are left empty after extraction. That means the the data have not been located properly, thus we need to modify the XPath of these two data fields to match all of these product loop items.
- Open the URL in Firefox and inspect the "ASIN" data field using Firepath(Click here to know more about Firepth)
- Locate and modify the XPath until the XPath can locate the "ASIN" of all the other product items(Click here to know more about Xpath)
- Back to Octoparse, then navigate to the "Extract Data" action
- Click the "ASIN" data field
- Click "Customize field"
- Select "Define ways to locate an item"
- Copy the modified XPath .//*[contains(text(), 'ASIN')]/following-sibling::td[1] from Firepath, and paste it in the "Matching XPath" text box
- Click "Save"
Now, you can notice that the ASIN data is extracted correctly.
Then, we follow the same steps to modify the XPath of "Rank".
- Inspect the "Rank" data field in Firepath
- Locate and modify the XPath until the XPath can locate the "Rank" data of all the other product items
- Back to Octoparse, then navigate to the "Extract Data" action
- Click the "Rank" data field
- Click "Customize field"
- Select "Define ways to locate an item"
- Copy the modified XPath .//*[contains(text(), 'Best Sellers Rank')]/following-sibling::td[1] from Firepath, and paste it in the "Matching XPath" text box
- Click "Save"
Now, you can notice that the Rank data is extracted correctly.
Step 5. Click to add the product into cart
After finish extracting the needed data, we will continue with our crawling task.
- First, click "Add to cart" button to add the first product in to my cart
Now, a pop-up checking order window should appear. Plus, since this part of content is displayed using AJAX, we nee to set AJAX for this click action.
- Go to the "Advanced Options"
- Rename the action caption as “Add Cart” if you want.
- Select "Load the page with AJAX"
- Set the AJAX Timeout to 7 seconds
Step 6. Execute branch judgment based on situations
In this step, we need to make a branch judgment about whether we need to deal with the pop-up window in the Step 5.
- Drag a "Branch Judgment" action into the Workflow Designer
- Click the left branch
- Select "When current page contains element"
- Copy the XPath //BUTTON[@id='siAddCoverage-announce'] of the "Add" button located in the pop-up checking order window
- Paste this XPath in the "Element Xpath" text box
- Click "Save"
After setting up the judgmental condition, we should tell Octoparse what's the next step if the condition is satisfied.
- Thus, click the "Add" button
- Select "Click an item"
Note that the generated "Click Item" action is placed out of the left branch judgment action in the first place, thus we need to drag it into the left branch box as below.
Step 7. Set AJAX Loading and AJAX Timeout
Since the button "Add" is displayed using AJAX, we need to set AJAX for this action.
- Navigate to the "Click Item" action
- Select "Load the page with AJAX"
- Set the AJAX Timeout to 7 seconds
- Rename the action caption as “Click Add” if you want.
Note that we just leave the right box in the “Branch Judgement” blank, which means if there is no such a pop-up, Octoparse will directly go to the next step.
Now, we need to come back to the "Branch Judgment" action and set AJAX for this action to meet the execution condition of the left branch judgment.
- Navigate to the "Branch Judgment"
- Select the same waiting time 7 seconds before execution
- More specifically, you can paste the XPath of the "Add" button //BUTTON[@id='siAddCoverage-announce'] in the "Wait Until Specific Element Appears" text box
Step 8. Go to"Cart" to identify the number of stocks
- Click "Cart" button
- Select "Click an item"
- Drag the "Click Item" action out of the left branch judgment box and place it out of the "Branch Judgment" action
- Rename the action caption in “Advanced Option” as “Click Cart” if you want.
Next, we should try to set an enough waiting time until we can execute the "Cart" button for the "Click Cart" action to view the cart.
- Set "Wait Before Execution" to 10 seconds
- Or you can specify the exact XPath of the "Cart" button .//*[@id='hlb-view-cart-announce']
Note that we need to cancel the option "Open the link in new tab", since it's generated automatically when we create a "Click Item" action. It will open the clicked item in a new tab, which is not necessary and costs more time.
- Go to "New Tab"
- Cancel "Open the link in new tab"
Step 9. Specify the quantities of wanted products
Now, we need to specify how many pieces we want by selecting the quantity number.
- Click "Quantity"
- Select "Click an item"
Note that we still need to set AJAX loading and AJAX Timeout for the following actions, since these actions are only partially updated using AJAX technique.
- Go to the "Advanced Options"
- Select “Load the page with AJAX”
- Set an AJAX Timeout to 2 seconds
- Click "Save"
- Then, click "10+" in the drop-down Quantity list
- Go to the "Advanced Options"
- Cancel "Open the link in new tab"
- Select "Load the page with AJAX"
- Set an AJAX Timeout to 4 seconds
- Click "Save"
Now, we need to specify the exact quantity number in the text box
- Click "Quantity" text box
- When prompted, select "Enter text value" and enter "999" in the "Enter text" text box
- Click "Save"
Now, you notice that the value "999" is saved into Quantity text box.
Step 10. Update the quantity information
After we specified the quantity information, we need to synchronize and check out the stocks of the sellers by clicking the "Update" button.
- Click "Update" button
- Select “Click an item”
- Go to the "Advanced Options"
- Cancel “Open the link in new tab”
- Select "Load the page with AJAX"
- Set an AJAX Timeout to 4 seconds
- Rename the action option as “Click Update” if you want
- Click "Save"
Step 11. Select the data to be extracted and rename data fields.
- Click the data field of stock information
- Select "Extract text"
- Rename the field name as "Stock"
- Click "Save"
Step 12. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
- Click "Next"
- Click "Next"
- Select "Local Extraction"
- Click "OK" to start
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, which means you can set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 13. Check the data and export
- Check the data extracted
- Click "Export" button
to export the results to Excel file, databases or other formats and save the file to your computer
Done!
Now you have learn how to crawl stock data from Amazon, get started with your own crawling task to extract any data you want.
To learn more about how to crawl data from high profile websites:
Web Scraping Case Study | Scraping Data from Yelp
Scrape Article Information from Google Scholar
How to Extract Information from Yellow Page Websites
Or learn more about what you can do with Octoparse:
How to Scrape Websites With Infinite Scroll? (Quora, Facebook,Twitter)
Octoparse Cloud Service - Splitting Tasks to Speed Up Cloud Extraction
Web Scraping - Modify X Path For "Load More" Button with Octoparse
Extracting Stock Prices using Regular expression (Example: Finance.Yahoo.com)
Author: The Octoparse Team