For anyone that looks into web scraping for turning web data into structured data such as xls, csv, txt, xml etc, chances are you don't know much about Python or does not want to deal with it. But it would still be nice to automate much of that copying and pasting which is hugely time consuming and painful. Here's a quick web scraping recipe you should definitely try - Google Sheets.
Google Sheets has been incorporated into many people's daily work routine since it's release, yet many may not know that Google Sheets has a family of import functions that can acutally be used to grab data from the web automatically, and without needing to write any codes at all.
For the purpose of this post, I will first show you the specific steps to build a simple web scraper with Google Sheets. Then, I will compare the exact process side by side using an automatic web scraper, Octoparse. Readers can decide whichever approaches may work better for your specific web scraping needs.
Build an easy web scraper using ImportXML in Google Spreadsheets
Step 1: Open a new Google sheet
Step 2: Using Chrome browser to open our target website: Games sales, right click on the page and select "inspect", then use a combination key: "Ctrl + Shift + C" to activate the "Selector", this way, whenever you cursor over an item, the corresponding information will be shown in the "inspection panel".
Combination Key: “Ctrl + Shift + C”
Step 3: Copy and paste the website URL into the sheet.
Let's try to grab price data with a simple formula: ImportXML
First, we need to copy the Xpath of the element.
The formula we need to input to the Google sheet is
Note here "Xpath expression" is the one we just copied from Chrome, remember to replace the punctuation "" in the Xpath expression to ''.
And here's the result we get.
There's another formula we can use: =IMPORTHTML(“URL”,“QUERY”, Index)
With this formula we can easily get the whole table extracted.
Now, let's see how the same scraping task can be accomplished with web scraper, Octoparse.
Step 1: Open Octoparse, and go to “Quick Start”, and select “New Task(Advanced Mode)”
Step 2: Enter the task name and select a category
Then click “Next”
Step 3: Enter the target website URL, and click "Go".
Same as the Google Sheets example, the target website is Games sales.
Step 4: Notice how the data we are interested in capturing are arranged in a list of similar section layouts. Create a list of all the section layouts, instruct the program to loop through each of the selected sections by selecting the "loop" option.
Step 5: From the section outlined in blue, click to extract any data needed. The outlined section is used as a template for the rest of the sections in the list. Meaning, once the extraction is set to run, the same type of data will be extracted from each individual section following the set-up of the template section.
Step 6: Create a loop for pagination
Simply locate the pagination button "Next" on the webpage, click on it then select "Loop click Next Page' from the prompt. Octoparse is configured to loop through all the pages.
By now, the scraper is done, click "Next" to follow through the remaining steps and start extracting.
Data looks nice and clean and I am happy. Click on the outlined button to export the captured data into xls, cvs, txt or other desired formats. Learn more about how Octoparse can quickly help you capture any data from the web.
Author: The Octoparse Team