Web Scraping - Scrape Web Pages with Load More ButtonWednesday, July 01, 2020
Check out the latest version of this article with Octoparse 7.X : Dealing with Infinitive Scrolling/Load More
It’s easy for a newbie to feel intimidated by the huge number of automated web scraping software out there. The features of these tools, related information and limited free-trial versions available on the internet can be overwhelming.
To pick out the one that most suits your web scraping needs, we’ve decided to make a category specialized in sharing some typical web scraping issues encountered by our Octoparse users.
To help make things even clearer and private - we will also include some practical examples of each topic. All user information related to the examples provided in this article is anonymous.
In this article, we will tell you how to scrape data from a website with the Load More button.
A real-life example of this kind of issue from one of our users when he couldn't scrape all the data items from the website with the Load More button. Below is the situation.
He wrote us an email and said,
“I want help regarding scraping a website with a "show more product" button.
Type Links to scrape: http://dir.indiamart.com/mumbai/industrial-machinery.html
Type of data: 08447563983, Machinery And Spares.
I want to scrape the complete page including the "load more product" button.
I have created primary steps I have attached images in the attachment.
But this only fetched 29 data from page, I want you to tell me how to add load more feature in this process.
Also, tell me more about configuring the extraction rule.
Waiting for your response.”
From the email content we can summarize two key points of his issue:
1. Load More button. (Tutorial: Scrape websites with Load More button)
We need to make sure all the items on the web pages are displayed after clicking the Load More button repeatedly.
2. Fetched only 29 data.
We need to check the extraction while the task is running with Local Extraction and figure out what the problem is.
So my response is as follows.
About the Load More button
First of all, we need to make sure that, in your rule, all the items on this web page are displayed by scrolling to the bottom of the page and clicking the Load More button repeatedly. (Check out this tutorial: Scrape websites with Load More button )
BTW, sometimes the site will continue load more items when scroll down to the bottom before the "Load more" button appears, we can set the scroll time and intervals in order to the smooth of the extraction.
About the data extracted
When only 29 data records were extracted, you need to find out the reasons why the extraction stops. I checked your task in Local Extraction and found out that:
1. Some windows pop-ups during the extraction. In this case, you need to click the close button in the built-in browser manually. And restart the task.
2. If the extraction is completed without any pop-up windows, you need to find out the place the extraction stops.
Firstly, open the web page you want to scrape in Firefox. Let’s locate to the 28th data item on the web page - we can see that it’s the item named "Mohnot Instruments" in Firefox. We will use the FirePath tool to find out the XPath.
(Learn more about FirePath tool: Getting started with XPath 1)
Secondly, go back to Octoparse and check the Loop Item(Extract Data ). In the screenshot below, an item named DIV is extracted. It's obvious that there is something wrong with the original XPath and we need to edit the XPath manually.
Let’s copy the original XPath and paste it in FireBug. And you will find out that the original XPath couldn’t extract the items starting from the 29th. In this case, we need to modify the XPath which use to extract all items from the web page.
(Don’t know about XPath? We can configure the rule for you. firstname.lastname@example.org)
Thirdly, get the XPath of the section of the 29th item on the web page.
Fourth, the correct XPath should be .//*[contains(@id,'LST')]
After you modify the XPath and save it, you will find that more than 32 items are extracted in the loop.
Don't forget to keep an eye on the built-in browser during the extraction, and make sure the workflow is working well."
Through this example, we know how to scrape data from a website with the Load More button and modify the XPath that extracts all the data items from the web page.
If you feel a little bit lost with the XPath of Loop Item for extracting data in the rule - we offer data collection service and XPath modification service for you!
Artículo en español: Web Scraping - Scrape las Páginas Web con el Botón Cargar Más
También puede leer artículos de web scraping en el Website Oficial
Author: The Octoparse Team