Scraping Feature Study | Extract Data from A List of URLs with Similar Web Content LayoutsThursday, March 24, 2016 6:04 AM
In some cases, you may have a list of similar-structured URLs (like a batch of product URLs) on hand, and you want to extract the data from them directly. In this tutorial, we will introduce an easy and powerful way to extract data from multiple web pages by using a list of URLs.
When should you consider scraping by using a list of URLs?
Here are some cases where you can start the task with a list of URLs for extraction.
1. All the URLs are under the same domain, sharing the same webpage structure (Most Important).
- Example: I have a list of product URLs, and I want to start a task with a list of URLs directly to scrape updated pricing data regularly.
2. Some websites use infinitive-scrolling/load more to load the content. If you need to collect data by clicking on each URL to scrape details on the deeper layer, then you'll need to split the task into two. One task is to load the page and scrape URLs, and the other one is to use a list of extracted URLs for scraping the detailed info.
- Example: Zara's search result page uses infinitive-scrolling to keep loading new items. If the data you need is on the item page, then you need to set scrolling times and collect enough product URLs first for the next task.
3. The website applies AJAX(Deal with AJAX) to load new content, which means after clicking on the first product page, the system fails to go back to the listing page automatically (and click into the second product page from there). We'll need to extract the detail page URLs first, and then scrape the data you want with the URL list (video tutorial).
4. Some websites tend to load pages quite slowly while paginating, which might affect the data scraping of our scheduled tasks, so it's better to loop through page URLs directly to avoid the issue.
How do I know if the pages are with the same structure?
If you are scraping news articles from any particular website, most likely the article pages will share the same page structure, like:
Another example is from Google maps. Every business page is like this:
To scrape using a list of URLs, we'll simply set up a loop of all the URLs we need to scrape from then add a data extraction action right after it to get the data we need. Octoparse will load the URLs one by one and scrape the data from each page.
By creating a "List of URLs" loop mode, Octoparse has no need to deal with extra steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. Check how to speed up with URL list.
1. Can I use URLs that do not share the same page layout?
Unfortunately, only URLs that share the same page structure can be extracted using "List of URLs". To make sure data is extracted consistently and accurately, it is necessary to ensure that these pages share the same page layout.
To learn more about the "List of URLs" mode, you can check out the following article: Loop Item
2. Is there a limit to the number of URLs that I can add at a time?
Yes. We suggest adding no more than 10,000 URLs if you copy and paste the URLs directly into Octoparse. However, using the Batch URL input feature, you can input up to 1 million URLs.
3. Can Octoparse automatically collect and add URLs?
Octoparse Advanced API enables modifying the list of URLs without accessing the App.
To extract data from a list of URLs, the extraction process can generally be broken down into 3 simple steps:
You may need the links below to follow though:
In Octoparse, there are two ways to create a "List of URLs" loop.
1. Start a new task with a list of URLs
1). Select "+New" and click "Advanced Mode" to create a new task
2). Paste the list of URLs in the textbox and click "Save URL"
After clicking "Save URL", the "Loop URLs" (which loops through each URL of the list) are automatically created in the workflow. If you click the "Loop URLs", you can see that the URLs that you entered have been added to the "Loop Item".
2. Create a "List of URLs" loop in Workflow Designer
1). Add a "Loop Item" in the workflow
2). Go to "Loop mode" and select "List of URLs" and Click and paste the list of URLs. Don’t forget to click "Apply" to save the settings.
3). Add an "Open Page" under the "Loop Item", then tick "Load URLs in the loop" and "Apply" to confirm
If the scraping stops right after we start the extraction, we can try adding a longer Timeout for the opening webpage step, so the system will wait longer for the webpage to be fully loaded.
3. Extract data from the page
After the URLs are saved, the first page would be opened automatically, and you can select the data on the page to extract. Extract element text/URL/image/HTML/attribute
1. Sometimes if Octoparse works too fast, it is possible to have pages not loaded completely before the data extraction step is executed, which may lead to no or incomplete data being extracted. To avoid this, we can set up a "Wait before execution".
Click on the "Options" settings for the "Extract Data" step and set a wait time before the action is executed (2-3 seconds will usually work).
2. If you want to get data exported lined up with the original URL list you entered, you can add the current page URL here:
After the process we mentioned above, when you run the task, you will find that after finishing one website scraping, Octoparse will go to the next page automatically.
Should you have any questions, feel free to leave a message.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.