Scrape Data from a List of URLs by Creating a Simple Scraper

Thursday, August 10, 2017 10:13 AM

Web scraping can be done by creating a web crawler in Python. Before coding a Python-based crawler, you need to look into source and get to know the structure of the target website. And of course you need to learn Python. It will be much easier if you already know how to code. But for a tech noob, it’s very difficult to learn everything from scratch. So we create our app Octoparse to help people who know little to nothing about coding to easily scrape any web data.  

In this tutorial we will learn how create the simplest and easiest web scraper to scrape a list of URLs, without any coding at all. This method is best suited to beginners like some of you. (We will assume that Octoparse is already installed on your computer. If that’s not the case, download here)

 

This tutorial will walk you through these steps:

    1. Create a “Loop Item” in the workflow

    2. Add a list of URLs into the “Loop Item” created

    3. Click to extract data points from one webpage

    4. Run the scraper set up

    5. Export data scraped

 

 

 

1. Create a “Loop Item” in the workflow

After setting up basic information for your task, drag a “Loop Item” and drop it into the workflow designer.

 

 

 

 

 2. Add a list of URLs into the created “Loop Item”

 

After create a “Loop Item” in the workflow designer, add a list of URLs in “Loop Item” to create a pattern for navigating each webpage.

      · Select “List of URLs” Loop mode under advanced options of “Loop Item” step

      · Copy and paste all the URLs into “List of URLs” text box

      · Click “OK” and then save the configuration

Note:

     1) All the URLs should share the similar layout

     2) Add no more than 20,000 URLs

     3) You will need to manual copy and paste the URLs into “List of URLs” text box.

     4) After entering all the URLs, “Go To Webpage” action will be automatically created in “Loop Item”.

 

 

 

 

 

 3. Click to extract data points from one webpage

When the webpage is completely loaded, click the data point on the webpage to extract data you need.

      · Click the data you need and select “Extract Data”(“Extract Data” action will be automatically created.)

 

 

 

 

4. Run the scraper set up

The scraper is now created. Run the task with either “Local Extraction” or “Cloud Extraction”.

In this tutorial we run the scraper with “Cloud Extraction”.

      ·  Click “Next” and then Click “Cloud Extraction” to run the scraper on the cloud platform

 

 

Note:

     1) You are able to close the app or your computer when running the scraper with “Cloud Extraction”.

         Just sit back and relax. Then come back for the data. No need to worry about Internet interruption or hardware limitation.

     2) You can also run the scraper with “Local Extraction” (on your local machine).

 

 

 

 

5. Export data extracted

 

      · Click “View Data” to check the data extracted

      · Choose “Export Current Page” or “Export All” to export data extracted

Note:

       1) Octoparse supports exporting data in Excel(2003), Excel(2007), CSV, or having data delivered to your database.

       2) You can also create Octoparse APIs and access data. See more at: http://www.octoparse.com/tutorial/api

 

 

Now we've learn how to scrape data from a list of URLs by creating a simple scraper without any coding! Very easy right? Try it for yourself

 

Demo data extracted like below: 

(I also attach the demo task and demo task exported in excel. Find them here)

 

 

Now check out similar case studies:

     · URLs - Advanced Mode 

     · Scrape data from multiple web pages

     · Speed up Cloud Extraction (1)

     · Speed up Cloud Extraction (2)

 

btn_sidebar_use.png
btn_sidebar_form.png