All Collections
Octoparse 101
Lesson 7: Wrap-up! Build your first scraping task
Lesson 7: Wrap-up! Build your first scraping task
Updated over a week ago

This is the last lesson of the intro series. We hope you've had fun learning something new and useful. To place all the puzzle pieces together, let's have a recap with a step-by-step tutorial on how to build a scraping task from scratch. We'll walk you through the entire process from entering the URL to downloading the extracted data. Let's dive right into it.


1. Start a new task

  • Enter the target URL into the search bar. Click Start to create a new task


2. Start the Auto-detect

As soon as the webpage is loaded in the built-in browser, select Auto-detect web page data from the Tips panel. Octoparse will start detecting web page data right away.

mceclip0.gif

3. Preview your data

Once the auto-detect process is completed, go ahead and check your data in the Data preview section. Double-click the field name to rename it or click the trash icon to remove those that are not needed.

mceclip1.gif

4. Save auto-detect settings

Go back to the Tips panel and check the settings below:

  • Check the Add a page scroll box if your target website is loading more items while the page scrolls

808080080880.png
  • Check the Paginate to scrape more pages box if you'd like to scrape more than one page

j.png
  • Check if the correct pagination button has been selected from the website (highlighted)

rrrrr.png

Now, click Create workflow and Octoparse will auto-generate the workflow.

mceclip1.png

Apart from the listing page, if you want to scrape more data from the product detail page, please follow the below steps:

  • Select Click on link(s) to scrape the linked page(s)

lplpp.png
  • Choose the option Click on an extracted data field select product_url from the dropdown menu and click Confirm

rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr.png

Notice how an extra step gets added to the workflow which is the Click URL in the list step.

mceclip2.png

5. Select data from the detail page

You will now arrive on the detail page. Once again, select Auto-detect web page data from the Tips panel

TIP: The auto-detection process will start automatically. You can switch between the detected results until you have the right data selected.

mceclip2.gif

Click Create workflow and the updated workflow should be like this:

mceclip3.png

You can also manually select the information on the web page to scrape it:

mceclip3.gif

6. Clean the extracted data

Looking at the extracted data, there's something we would like to change. For example, we would like to get rid of the preposition "from" in the "Location" field, therefore we need to use Clean Data to do so.

Click the more icon on the top right corner and select Clean data -

0890.png

Then click Add step - Replace. We need to get rid of "from" and ensure all the rows could be matched with it that we have to replace "from" with nothing, as this GIF is shown below.

mceclip4.gif

7. Test-run the task

The scraping task is now completed. As mentioned before, it's always recommended that you test the workflow step-by-step, making sure that each step does what it needs to do, for example, if you click on Go to Web Page, it should load the web page in the built-in browser without a problem.

Launch the workflow and test run it by clicking through all the steps from top to bottom and inside to outside for nested steps (like pagination). Observe if the web page is responding as expected.

mceclip5.gif

8. Schedule and run

Now that your task is fully tested and working, you can extract the data much faster by running the task in the Cloud or you can also schedule it to run on a recurring basis.

To start a cloud run, click Standard Mode or Boost Mode under Run in the Cloud.

To schedule the task, click Schedule Local Runs or Schedule Cloud Runs.

Pick your desired frequency and designate a day and time for the run.

mceclip0.png

9. Export your data

Go to the Dashboard to find your task and click open task status to view the data extracted. Click Export Data at the bottom and choose the format you'd like to download the data.

Congrats! You've done a good job of making this far and working your way to becoming the next web scraping expert. We hope this is not the end of your learning but the beginning of your web scraping journey.

Did this answer your question?