In this article, I will show you how web crawlers extract data from a list/ table from web pages.
Download and install the free edition of Octoparse. Register a new account and log in.
In the start screen, the navigation panel is on the left-hand side of the main interface, lists all the folders. Users can quickly start a task, manage all the tasks and check tasks’ status here. In the operation panel, there are two modes (“Wizard Mode” and “Advanced Mode”) and four types of web pages for each mode (“Single Page”, “List or Table”, “List & Detail”, “List of URL”).
Now, go to “Advanced Mode” > “List or Table”> “Start”.
Enter a task name, and follow the prompt to click “Continue” > “Next”.
Enter the target URL, and follow the prompt to click “Continue” and the “Go” icon to open it in the browser.
Click “Next” > “loop click next page”. Create a loop action to process all the web pages. The action of pagination has been added to the configuration rule.
In this step, we click the first highlighted section.
Here, we will create a list of sections with similar layouts. So click “Create a list of items” and “Add current item to the list”. Then the first highlighted section has been added to the list. Then click “Continue to edit the list”.
Then we click the second highlighted section.
Then click “Add current item to the list” again. Now we get all the sections with similar layouts.
Then click “Finish Creating List”.
And Click “loop” to process the list for extracting the elements in each section.
In this step, we will extract the music video and views of the first section. Click these two elements and extract the text.
We can define fields in the table on the right-hand side of the interface.
Before executing the extraction rules, we drag the “Loop Item” into the “Cycle Pages” in the Workflow Designer so that we can grab all the elements of sections from multiple pages.
Then click “Next”> ”Next” to proceed.
Click “Local Extraction” > “OK” to run the task on your computer. And Octoparse will extract the data automatically.
In the data extraction panel, we can see the target web pages and the data extracted pane on the left-hand side. You can also select the extraction options on the right-hand side of this panel to optimize the extraction process.
Hit the “Export Data” option at the bottom of the data extraction panel to choose one format to save the file on the computer.
Now it’s done!