undefined

COLLECTING DATA USING OCTOPARSE

Thursday, June 30, 2016 2:52 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

With Octoparse, we can get any public data on the internet by creating a task workflow. In this article, we will use Yell.com as an example.

Yell is the UK's leading online business directory. You can search for local businesses across the UK on this website. In this tutorial, we will show you how to collect business details on Yell.com with Octoparse.

To demonstrate, we will use the URL below as an example.

https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=627415385&keywords=dentists&location=London

We will scrape data such as Title, Address, Phone number, and Website from the web page.

 

Here are the main steps in this tutorial: [Download demo task file here]

  1. Go to Web Page - open the target web
  2. Auto-detect web page data - to set up the workflow
  3. Extract data - modify the data fields
  4. Start extraction - run the task and get data

 

1. Go to Web Page - open the target web

a. Enter the URL on the home page

b. Click Start to create a new task

a. Click the Auto-detect web page data

b. Wait for the detection to complete

c. Go to Data preview to see if you're okay with the current data output

 

2. Auto-detect web page data - to set up the workflow

a. Click the Auto-detect web page data

b. Wait for the detection to complete

c. Go to Data preview to see if you're okay with the current data output

You can delete unnecessary data fields or modify the data field names directly by clicking the icon icon

yell1

d. Uncheck the option of Add a page scroll

e. Click Create workflow

Octoparse will automatically generate a workflow with the data fields it has detected.

 

3. Extract data - extract phone numbers and websites

There could be some information that is not detected by auto-detection and we can select them to scrape manually

a. Select the Website of the first business on the webpage (note to select from the area highlighted in red)

b. Choose Extract the URL of the selected link

yell2

c. Click ... and modify the XPath of the URL field into //a[contains(text(),'Website')]

d. Click Apply to confirm

yell3

yell4

Scraping phone numbers is tricky in this case as the numbers are not visible on the web page but are stored in the HTML code. We can scrape a field and modify the XPath of the field to get the phone number.

a. Select the Call button on the page and choose Extract the text of the element

b. Click ... and modify the XPath of the field into //span[@itemprop="telephone"]

c. Click Apply to confirm

Tips! The Email address cannot be scraped in this case as the web page does not include the email address in its source code. Clicking the Email button would direct you to a page where you can submit information.

d. Rename the fields if needed

 

4. Start extraction - run the task and get the data

a. Click save icon,then click run icon on the upper left side

b. Select Run on your device to run the task on your computer, or select Run in the Cloud to run the task in the Cloud (for premium users only)

You can export the result data in provided formats such as EXCEL, CVS, JSON, or in your database.

Here is the sample output.

yell5

 

 

Is this article helpful? Contact us at any time if you need our help!

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline