How to Scrape an HTML Table into Excel in OctoparseTuesday, July 05, 2016 3:11 AM
Often we need to scrape structured tables from web pages, collect data from these tables and export the data extracted to an Excel Spreadsheet. In this tutorial, you will learn how to scrape an HTML table and grab the data from this table in Octoparse.
In this tutorial, we will use http://www.adtoi.in/md/delhi.html as an example. The GIF shows the HTML table in this website. Follow the instructions below to extract the data from the table into Excel Spreadsheet.
This video - Get Started with Octoparse in 2 Minutes will guide you through the extraction process.
Enter the URL of the website - http://www.adtoi.in/md/delhi.html into the address bar, and open it. Don't forget to save the URL.
There are two columns in the table with the header information in the first row. The first column is the list of contact persons and the second column contains basic company information. Here we will extract the table from the second row.
Click the cell in the second column of the second row. In the pop-up window, click the <TR> tag which defines a row in an HTML table. Then create a list of item.
Select “Create a list of items”. ➜ “Add current item to the list”. ➜ “Continue to edit the list”.
Click the cell in the second column of the third row. ➜ Click the <TR> tag to select the whole row.
Select “Add current item to the list”. ➜ “Finish Creating List”. ➜ Click “Loop" to process the list.
Extract the data. Here we will extract the contact person, and the basic company information of the second row.
Click the cell in the first column of the second row. In the pop-up window, select "Extract Text".
Click the cell in the second column of the second row. In the pop-up window, select "Extract Inner HTML, including the page source code, text with format and image".
Here you can see all the data you has extracted. Click "Save", and then click "Next" to complete your task.
In this step, you can choose to not to load image to speed up the extraction. Then click "Next" and choose "Local Extraction" to run the task on your own computer.
All the data extracted can be seen in the Data Extraction window. After all the data has been extracted, click the “Export Data” button at the bottom of the window and export data into Excel.
Open the Excel file. The company information - the company name, address, telephone No., etc. appear in the same column, so we need to split that information to show in several separate columns.
The following steps will help you split column values into several columns.
- Replace <br> with $
- Cancel all line breaks
- Press “Ctrl + H” button
- Press and hold “Alt”, and in the “Find What”, type “10” with small keyboard.
- Press “Replace All”
- Split cell contents into multiple cells with $ as division symbol.
- Replace all <strong>, </strong>, <b>, </br>, <font color="#000000"> with black space.