Web Scraping - Scrape Web Pages with Load More Button
Wednesday, November 23, 2016
It’s easy for a newbie to feel intimidated by the huge number of automated web scraping software out there. The features of these tools, related information and limited free-trial versions available on the internet can be overwhelming.
To pick out the one that most suit your web scraping needs, we’ve decided to make a category specialized in sharing some typical web scraping issues encountered by our Octoparse users.
To help make things even clearer and private - we will also included some practical examples of each topics. All information related to the examples provided on this article is on an anonymous basis.
In this article we will tell you how to scrape data from a website with Load More button.
A real-life example of this kind of issue from one of our users when he couldn't scrape all the data items from the website with Load More button. Below is the situation.
He wrote us an email and said,
“I want a help regarding scraping a website with "show more product" button.
Type Links to scrape: http://dir.indiamart.com/mumbai/industrial-machinery.html
Type of data: 08447563983, Machinery And Spares.
I want to scrape complete page including "load more product" button.
I have created primary steps i have attached images in attachment.
But this only fetched 29 data from page, I want you to tell me how add load more feature in this process.
Also Tell me more about configuring the extraction rule.
Waiting for your response.”
From the email content we can summarize two key points of his issue:
1. Load More button. (Tutorial: Scrape websites with Load More button)
We need to make sure all the items on the web pages are displayed after clicking the Load More button repeatedly.
2. Fetched only 29 data.
We need to check the extraction while the task is running with Local Extraction and figure out what the problem is.
So my response is as follows.
About the Load More button
First of all, we need to make sure that, in your rule, all the items on this web page are displayed by scrolling to the bottom of the page and clicking the Load More button repeatedly.(Check out this tutorial: Scrape websites with Load More button )
BTW, sometimes the site will continue load more items when scroll down to the bottom before the "Load more" button appears, we can set the scroll time and intervals in order to the smooth of the extraction.
About the data extracted
When only 29 data records were extracted, you need to find out the reasons why the extraction stops. I checked your task in Local Extraction and found out that:
1. Some windows pop ups during the extraction. In this case, you need to click the close button in the built-in browser manually. And restart the task.
2. If the extraction is completed without any pop up windows, you need to find out the place the extraction stops.
Firstly, open the web page you want to scraped in Firefox. Let’s locate to the 28th data item on the web page - we can see that it’s the item named "Mohnot Instruments" in Firefox. We will use FirePath tool to find out the XPath.
(Learn more about FirePath tool: Getting started with XPath 1)
Secondly, go back to Octoparse and check the Loop Item(Extract Data ). In the screenshot below, an item named DIV is extracted. It's obvious that there is something wrong with the original XPath and we need to edit the XPath manually.
Let’s copy the original XPath and paste it in FireBug. And you will find out that the original XPath couldn’t extract the items starting from the 29th. In this case, we need to modify the XPath which use to extract all items from the web page.
(Don’t know about XPath? We can configure the rule for you. email@example.com)
Thirdly, get the XPath of the section of the 29th item on the web page.
Fourth, the correct XPath should be .//*[contains(@id,'LST')]
After you modify the XPath and save it, you will find that more than 32 items are extracted in the loop.
Don't forget to keep an eye on the built-in browser during the extraction, and make sure the workflow is working well."
Through this example we know how to scrape data from website with Load More button and modify the XPath that extract all the data items from the web page.
If you feel a little bit lost with the XPath of Loop Item for extracting data in the rule - we offer data collection service and XPath modification service for you!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
Most popular posts
- Related articles
- What's new in Octoparse 7.1
- About Octoparse
- Three Kinds of Analytical Modes to Extraction...
- Data Harvesting Is Solving These Two Problems
- The 1st Year at Octoparse: 300% Growth, A Pro...