Web Scraping - Modify XPath For "Load More" Button with OctoparseThursday, March 2, 2017 9:15 PM
Sometimes you run the scraping task in your PC, but the extraction seems like it was stuck on the web page and Octoparse did not extract any data on this page. Then you need to observe the extraction by running the task in your PC again with Local Extraction.
If there is one situation that the website you want to scrape contains a “Load More” button, Octoparse will normally extract data after all the content are displayed by clicking the “Load More” button many times and the “Load More” button would usually disappear.
But some websites will retain the code that can implement the “Load more” button and Octoparse will keep clicking the “Load more” button even though there is no “Load more” button on the web page. As a result, Octoparse keeps clicking the “Load More” button endlessly and the extraction will stay on the web page and then won’t extract any data.
This web scraping tutorial will provide you a possible solution to this situation and take Nature Republic(http://www.naturerepublic.com) for example - the code that implements the “Load More” button still exists even though all the content is shown and the “Load More” button is disappeared on the web page.
The possible solution is to check the scraping task again and see if the X Path expressions used for continually clicking the “Load More” button is correct.
I assume that you have created the scraping task and have enough knowledge on Octoparse layout and Octoparse Workflow. Please read HERE to get the basic understanding of X Path.
When a website uses a “Load More” button(or other similar links) to load more content, you can create a Loop Box for continually clicking on the “Load More” button to show all the content before extracting data from the web page.
You can directly download the task (The OTD. file) to begin to collect the data. Or you can follow the steps below to make a scraping task to practice it yourself. (Download the extraction task of this tutorial HERE just in case you need it.)
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Enter the target URL in the built-in browser. ➜ Click the "Go" icon to open the webpage.
Step 3. Click the "Load More" button at the bottom of the web page to reveal more data.
Click on the "Load More" button/link. ➜ Choose "Loop click in the element" to turn the page. ➜ Click "Save".
A Cycle Page Box is automatically created in the Workflow for continually clicking the “Load More” button.
The original XPath expression generated is //A[@id='btn_more_new_prod']
Step 4. Modify the XPath expression for the "Load More" Cycle Page Box.
Let's start Mozilla Firefox and open two new windows which both navigate to the web page (Click HERE to learn how to use Firebug and Firepath)
In Window 2, we keep clicking on the "Load More" button till all the content is shown and the “Load More” button is disappeared.
Copy the XPath expression of the "Load More" button from Octoparse. ➜ Paste it in the XPath text box both in Window 1 and Window 2. You will see that the code that implements the “Load more” button in Window 2 still retains in the source.
Compare the code and find the difference between these two blocks of code.
So we need to modify the original XPath expression:
//A[@id='btn_more_new_prod' and not(@style='display: none;')]
Copy the new XPath expression to Octoparse and click Save.
Navigate to "Click to Paginate" action ➜ Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 2 seconds (or longer)➜ Click "Save".
Step 5. Then you can go on to begin extracting data you want to pull out from these items.
After you've done the configuration, don't forget to go through and check the Workflow before you save and run the scraping task.
1. If there are some missing values for some data fields in the output, you can figure out why Octoparse could not extract the value for the data fields. Click this article to find out the reasons for the missing values when using Local Extraction.
Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields. You can follow this tutorial to modify XPath expressions in Octoparse.
Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.
Author: The Octoparse Team