Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
The latest version for this tutorial is available here. Go to have a check now!
In this tutorial, we would show you scrape, extract and mine data from yellowpages.com. With Octoparse, you can easily extract any data you need to generate maximum local leads for your business and boost your sales too. Business name, address, phone number, email, etc.... Any data you see on the webpage can be extracted with our free software, no coding needed. Just enter the URL and configure a little, and get thousands of potential list within minutes!
After configuration, simply run the task and get data in structured formats such as CSV, JSON or getting it delivered directly to your database. (Or connect Octoparse Data API with your own system.)
By mining data from Yellowpages, you can:
· Create your own local business directory websites
· Get massive phone numbers for cold calling
· Offer scraping services for businesses
· Sell the leads generated to your customers
· Scrape data for email marketing
In this tutorial, we will scrape anesthesiologists information in New York on yellowpages.com as an example. (To follow through, you may want to use this link)
1) "Go To Web Page" - to open the target page
· Create the task with "Advanced Mode".
· Paste the URL into the "Extraction URL" box and click "Save" to move on
· We strongly suggest turn on "Workflow" mode to get a better review of what you are doing with your task just in case you mess up with the steps. · "Advanced Mode" is highly suggested since it allows you to handle almost all complex extraction cases, such as keywords searching, scraping behind a login, opening dropdowns etc.
· Scroll down to the button and click on "Next" button in the target web page,
· Select "Loop click next page" in "Action Tips" panel.
· Click on first 2 product titles one by one to create a "Loop Item" for clicking through each item on the list
(Make sure you select the area that contains the URL to access the item page)
· Click "Select all" and "Loop clicks each element" buttons on "Action Tips" panel.
4) Extract data - to select data you need to scrape
· Select data you need on the item page to scrape, such as Name, Address, Opening hours, TEL etc.
· Select "Extract data" and rename the "Field name" column if necessary.
In some cases, the data you need might hide in the HTML with extra strings that you don't need. For example, we need to extract the star rating but it seems like it cannot be done by clicking to extract. In this case, we would need to extract the HTML first and then reformat the data extracted in order to trim the strings we don't need. To do this:
· Click star-rating area and select "Extract outer HTML of the selected element".
· Select "star-rating" row, click "Customize data field" icon, select "Refine extracted data" option and click "Add step" button.
· Click "Match with Regular Expression" and input the Regular Expression of "(?<=title=")(.+?)(?= star)" into "Regular Expression" box.
· Click "OK" button.
·In Octoparse, you are able to use Regular Expression to further process or clean the data you are going to extract.
6) Run extraction - to run your task and get data
· Click "Start Extraction" and "Local Extraction".
· Click "Export" button to export data after the extraction.
· Run/execute your tasks with Octoparse Cloud Extraction with a much better performance. When you run a task with "Cloud Extraction", it runs in the cloud with multiple servers using our IP's. You can shut down the app or your computer while the task is running. No need to worry about hardware limitation. Data extracted will be saved in the cloud and can be accessed at any time.