How to Scrape Web Pages with Load More ButtonTuesday, September 6, 2022
You must encounter this problem when web scraping - some sites have a LOAD MORE button and you need to click it to paginate or load more content. But it's not easy to realize this. So, in this article, we will introduce how to easily solve this web scraping with load more button problem with a web scraping tool or Python method.
No-Coding Tool to Scrape Pages with Load More Button
If you're a non-coder and know nothing about coding, we recommend Octoparse as the best web scraping tool for you to solve the Load More button problem. It is a free tool for both Windows and Mac systems, which is easy to use and asks for nothing on coding skills. You can scrape almost all kinds of websites with its auto-detection function and preset templates. For the Load More button, Octoparse allows you to set pagination and infinite scroll with loop item. Let's follow the simple methods below to have a try.
1. Scraping Load More Button with Pagination
You can set pagination with the Load More button if you're scraping a multipage site. Or some sites describe this as Next. Octoparse provides auto-detecting mode or manually setting ways. Read the detailed Dealing with pagination with a "Load More" button user guide or follow the simple steps below.
Step 1: Sign up for a free account and launch Octoparse. Copy and Paste the target page link to the main panel, and it will start auto-detecting mode by default.
Step 2: Octoparse will set the pagination after the auto-detecting. Click on the "Load More" button in the Tips Panel and hit it to check if it has been located correctly. If not, you can click Edit to choose the right button. By manually, you need to select the "Load More" button on the web page, and choose Loop to click single element option. You can set up a proper AJAX timeout by yourself.
Step 3: After all data fields have been checked, run the workflow you just created. And you'll get the scraped data with the Load More button works.
2. Infinite Scroll to Load More Data
For some pages that have a Load More button, it will load more content on the same page after you click on the "Load More" button once and once. In this situation, you can set the pagination with infinite scroll easily. It also supports both automatical and manual ways, which are very similar to the Load More methods above.
3. Real Example to Solve Load More Problem with XPath
A real-life example of this kind of issue is from one of our users when he couldn't scrape all the data items from the website with the Load More button. Below is the situation.
He wrote us an email and said:
“I want help regarding scraping a website with a "show more product" button.
Type Links to scrape: http://dir.indiamart.com/mumbai/industrial-machinery.html
Type of data: 08447563983, Machinery And Spares.
I want to scrape the complete page including the "load more product" button.
I have created primary steps I have attached images in the attachment.
But this only fetched 29 data from page, I want you to tell me how to add load more feature in this process.
Also, tell me more about configuring the extraction rule.
Waiting for your response.”
From the email content we can summarize two key points of his issue:
1. Load More button. (Tutorial: Scrape websites with Load More button)
We need to make sure all the items on the web pages are displayed after clicking the Load More button repeatedly.
2. Fetched only 29 data.
We need to check the extraction while the task is running with Local Extraction and figure out what the problem is.
So, our response is as follows:
About the Load More button
First of all, we need to make sure that, in your rule, all the items on this web page are displayed by scrolling to the bottom of the page and clicking the Load More button repeatedly.
BTW, sometimes the site will continue to load more items when scrolling down to the bottom before the "Load more" button appears, we can set the scroll time and intervals in order to smooth the extraction.
About the data extracted
When only 29 data records were extracted, you need to find out the reasons why the extraction stops. I checked your task in Local Extraction and found out that:
1. Some windows pop-ups during the extraction. In this case, you need to click the close button in the built-in browser manually. And restart the task.
2. If the extraction is completed without any pop-up windows, you need to find out the place the extraction stops.
Firstly, open the web page you want to scrape in Firefox. Let’s locate to the 28th data item on the web page - we can see that it’s the item named "Mohnot Instruments" in Firefox. We will use the FirePath tool to find out the XPath.
(Learn more about FirePath tool: Getting started with XPath)
Secondly, go back to Octoparse and check the Loop Item(Extract Data ). In the screenshot below, an item named DIV is extracted. It's obvious that there is something wrong with the original XPath and we need to edit the XPath manually.
Let’s copy the original XPath and paste it in FireBug. And you will find out that the original XPath couldn’t extract the items starting from the 29th. In this case, we need to modify the XPath which use to extract all items from the web page.
Thirdly, get the XPath of the section of the 29th item on the web page.
Fourth, the correct XPath should be .//*[contains(@id,'LST')]
After you modify the XPath and save it, you will find that more than 32 items are extracted in the loop.
Don't forget to keep an eye on the built-in browser during the extraction, and make sure the workflow is working well."
Through this example, we know how to scrape data from a website with the Load More button and modify the XPath that extracts all the data items from the web page.
Solve Web Scraping Load More Button with Python
"How to scrape the website if it has load more button to load more content on the page?"
You may have the same question as above from Stackoverflow, though you know something about coding. You can find answers and discussions about this question. However, we still recommend you to try Octoparse if you're still confused about it.