In most cases, it takes a lot of time and effort to write a crawler that can extract information from websites. But what if you have a web scraping tool and no programming skills required? Octoparse is exactly what you need if you’re new to web scraping
Usually, when we talk about scraping a website, it refers to scrape the contents inside. Here we share the easiest way for beginners.
A website usually has a homepage, a list of pages, content pages as well as labels and classifications. The most important one is the content pages.
We take www.realtor.com for example.
After setting basic information of your task, click “Next”.
Then open the page you need to scrape data from in the browser.
For instance, the Apartment for Rent category in San Francisco.
Scroll down the page to the bottom.
To do the pagination scraping, you need to create a “Loop to keep clicking next page”. When you click on “Next page” button here, You can’t find the “Loop click next page” option as usual.
In this case, you need to configure pagination scraping rule in another way.
To do this, drop a “Loop” item into Workflow designer. Choose a “Loop Mode” under “Advanced Options”. Select “Single Element Option”.
Make sure you locate the right place of the pagination link.
Then click the X path, and paste it in the text box.
Next, drop a “Click element” action into the “Loop item”
Choose “Click Loop items” under Advanced Option.
Now you’ve configured pagination crawling.
Then create a list of item as usual.
Click on the first title > Select “Create a list of items” > Add current item to the list > Continue to edit the list.
Click on the second title > Add current item to the list > Finish creating list > Click Loop to process.
Now you’re on the detail page.
Then start scraping data you need. Click on the name. Choose “Extract text.”
You can also extract other information in this way.
To get rid of the dollar sign, select the field you want to reformat. Click the Customize Field button. Choose “Reformat extracted data”. Click “Add step”. Click “Replace strings”.Copy the dollar sign and paste it into the “Replace” box.
Don’t type anything in the “With” box. Click “Calculate”. And the dollar sign will be removed. Then click “OK”. Now the final output data has no dollar sign. Click “done”.
Now there’s no dollar sign in the data you captured.
Once done configuring scraping rule, click “Next”.
Now choose “Local extraction” to run the task on your computer.
If the data you want to scrape is huge, choose Cloud Extraction to run your task in the cloud.
The data extracted will be shown in this pane and we can also see the configured rule of the task. You can also check out the build-in browser to see if the task runs as expected.
Export the results to Excel files, or other database formats, and save the file to the computer.
Author: The Octoparse Team