In this article, I will show you how web crawlers extract data from a list/ table from web pages.
Step 1
Download and install the free edition of Octoparse. Register a new account and log in.Â
Step 2
In the start screen, the navigation panel is on the left-hand side of the main interface, lists all the folders. Users can quickly start a task, manage all the tasks and check tasksâ status here. In the operation panel, there are two modes (âWizard Modeâ and âAdvanced Modeâ) and four types of web pages for each mode (âSingle Pageâ, âList or Tableâ, âList & Detailâ, âList of URLâ).
Now, go to âAdvanced Modeâ > âList or Tableâ> âStartâ. Â
Step 3
Enter a task name, and follow the prompt to click âContinueâ > âNextâ.  Â
Step 4
Enter the target URL, and follow the prompt to click âContinueâ and the âGoâ icon to open it in the browser.
Step 5
Click âNextâ > âloop click next pageâ. Create a loop action to process all the web pages. The action of pagination has been added to the configuration rule.
Step 6
In this step, we click the first highlighted section.
Here, we will create a list of sections with similar layouts. So click âCreate a list of itemsâ and âAdd current item to the listâ. Then the first highlighted section has been added to the list. Then click âContinue to edit the listâ.
Then we click the second highlighted section.
Then click âAdd current item to the listâ again. Now we get all the sections with similar layouts.
Then click âFinish Creating Listâ.
And Click âloopâ to process the list for extracting the elements in each section.
Step 7
In this step, we will extract the music video and views of the first section. Click these two elements and extract the text.
Step 8
We can define fields in the table on the right-hand side of the interface.
Before executing the extraction rules, we drag the âLoop Itemâ into the âCycle Pagesâ in the Workflow Designer so that we can grab all the elements of sections from multiple pages.
Then click âNextâ> âNextâ to proceed.
Step 9
Click âLocal Extractionâ > âOKâ to run the task on your computer. And Octoparse will extract the data automatically.Â
In the data extraction panel, we can see the target web pages and the data extracted pane on the left-hand side. You can also select the extraction options on the right-hand side of this panel to optimize the extraction process.
Hit the âExport Dataâ option at the bottom of the data extraction panel to choose one format to save the file on the computer.
Now itâs done!