Web Scraping - Modify X Path For "Load More" Button with OctoparseThursday, March 02, 2017 9:15 PM
Sometimes you run the scraping task in your PC, but the extraction seems like it was stuck on the web page and Octoparse did not extract any data on this page. Then you need to observe the extraction by running the task in your PC again with Local Extraction.
If there is one situation that the website you want to scraped contains a “Load More” button, Octoparse will normally extract data after all the content are displayed by clicking the “Load More” button many times and the “Load More” button would usually disappear.
But some websites will retain the code that can implement the “Load more” button and Octoparse will keep clicking the “Load more” button even though there is no “Load more” button on the web page. As a result, Octoparse keeps clicking the “Load More” button endlessly and the extraction will stay on the web page and then won’t extract any data.
This web scraping tutorial will provide you a possible solution to this situation and take Nature Republic(http://www.naturerepublic.com) for example - the code that implement the “Load More” button still exist even though all the content is shown and the “Load More” button is disappeared on the web page.
The possible solution is to check the scraping task again and see if the X Path expressions used for continually clicking the “Load More” button is correct.
I assume that you have created the scraping task and have enough knowledge on Octoparse layout and Octoparse Workflow. Please read HERE to get the basic understanding of X Path.
When a website use a “Load More” button(or other similar link) to load more content, you can create a Loop Box for continually clicking on the “Load More” button to show all the content before extracting data from the web page.
The website URL we will use is http://www.naturerepublic.com/shop/category/2.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to practice it yourself. (Download the extraction task of this tutorial HERE just in case you need it.)
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: http://www.naturerepublic.com/shop/category/2)
Step 3. Click the "Load More" button at the bottom of the web page to reveal more data.
Click on the "Load More" button/link. ➜ Choose "Loop click in the element" to turn the page. ➜ Click "Save".
A Cycle Page Box is automatically created in the Workflow for continually clicking the “Load More” button.
The original XPath expression generated is //A[@id='btn_more_new_prod']
Step 4. Modify the XPath expression for the "Load More" Cycle Page Box.
In Window 2, we keep clicking on the "Load More" button till the all the content is shown and the “Load More” button is disappeared.
Copy the XPath expression of the "Load More" button from Octoparse. ➜ Paste it in the XPath text box both in Window 1 and Window 2. You will see that the code that implement the “Load more” button in Window 2 still retain in the source.
Compare the code and find the difference between these two blocks of code.
So we need to modify the original XPath expression:
//A[@id='btn_more_new_prod' and not(@style='display: none;')]
Copy the new XPath expression to Octoparse and click Save.
Navigate to "Click to Paginate" action ➜ Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 2 seconds (or longer)➜ Click "Save".
Step 5. Then you can go on to begin extracting data you want to pull out from these items.
After you've done the configuration, don't forget to go through and check the Workflow before you save and run the scraping task.
1. If there are some missing values for some data fields in the output, you can figure out why Octoparse could not extract the value for the data fields. Click this article to find out the reasons for the missing values when using Local Extraction.
Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields. You can follow this tutorial to modify XPath expressions in Octoparse.
Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.
Author: The Octoparse Team