Amazon Scraping Case Study | Monitoring stock counts on AmazonFriday, May 19, 2017 7:10 AM
Welcome to Octoparse web scraping case study!
In this series of case study tutorials for Amazon, we will learn how to deal with various difficult-to-handle situations when scraping data from Amazon.
In this tutorial, I will show you how to locate and capture stock numbers of products on Amazon. The stock numbers are not always readily available on the webpage.
So what can you do when you want this data? Here we are showing you how.
List of features covered in this case study:
Step 1. Set up basic information
Step 2. Build a loop for a URL list
In this case, we want to crawl data from several particular web pages, which is quite different from the usual pagination method by loop clicking the next page button or scrolling down for a few times. Luckily, it's easy for Octoparse to retrieve data from these particular web pages by entering a given list of target URLs and traverse these URLs to open each web page .
Now, you can see the given list of URLs have been saved as the "Loop Item"
Step 3. Select the data to be extracted and rename data fields
After finishing building the URL list, we will be directed to the first URL automatically. Next, we should extract the product information from the first web page.
Note: When directed to the first URL, the page keeps loading for a long time, even though the content has already been displayed completely. Thus, we can manually click the multiplication button to stop unnecessary loading process.
Then we will begin to extract data from these product information sections.
Note that the extraction action we will be setting up for the current page is going to apply to the rest of the list.
Step 4. Modify XPath to locate product information
In this case, the data field of "ASIN" and "Rank" of many product loop items are left empty after extraction. That means the the data have not been located properly, thus we need to modify the XPath of these two data fields to match all of these product loop items.
Now, you can notice that the ASIN data is extracted correctly.
Then, we follow the same steps to modify the XPath of "Rank".
Now, you can notice that the Rank data is extracted correctly.
Step 5. Click to add the product into cart
After finish extracting the needed data, we will continue with our crawling task.
Now, a pop-up checking order window should appear. Plus, since this part of content is displayed using AJAX, we nee to set AJAX for this click action.
Step 6. Execute branch judgment based on situations
In this step, we need to make a branch judgment about whether we need to deal with the pop-up window in the Step 5.
After setting up the judgmental condition, we should tell Octoparse what's the next step if the condition is satisfied.
Note that the generated "Click Item" action is placed out of the left branch judgment action in the first place, thus we need to drag it into the left branch box as below.
Step 7. Set AJAX Loading and AJAX Timeout
Since the button "Add" is displayed using AJAX, we need to set AJAX for this action.
Note that we just leave the right box in the “Branch Judgement” blank, which means if there is no such a pop-up, Octoparse will directly go to the next step.
Now, we need to come back to the "Branch Judgment" action and set AJAX for this action to meet the execution condition of the left branch judgment.
Step 8. Go to"Cart" to identify the number of stocks
Next, we should try to set an enough waiting time until we can execute the "Cart" button for the "Click Cart" action to view the cart.
Note that we need to cancel the option "Open the link in new tab", since it's generated automatically when we create a "Click Item" action. It will open the clicked item in a new tab, which is not necessary and costs more time.
Step 9. Specify the quantities of wanted products
Now, we need to specify how many pieces we want by selecting the quantity number.
Note that we still need to set AJAX loading and AJAX Timeout for the following actions, since these actions are only partially updated using AJAX technique.
Now, we need to specify the exact quantity number in the text box
Now, you notice that the value "999" is saved into Quantity text box.
Step 10. Update the quantity information
After we specified the quantity information, we need to synchronize and check out the stocks of the sellers by clicking the "Update" button.
Step 11. Select the data to be extracted and rename data fields.
Step 12. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, which means you can set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 13. Check the data and export
Now you have learn how to crawl stock data from Amazon, get started with your own crawling task to extract any data you want.
To learn more about how to crawl data from high profile websites:
Or learn more about what you can do with Octoparse:
Author: The Octoparse Team