Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Lesson 3: Getting data - Capture text from a pageThursday, August 16, 2018
1) Start a new task and enter the URL of target web page [Download the task file in this lesson]
Once you're logged in, click "+ Task " button of Advanced Mode to create a new task. Then enter one or more URLs.
1. What is a task?
A task means a crawler for scraping data from usually one website with unlimited Page/URL inquiries. Crawlers in Octoparse are determined by the scraping tasks configured. Scraping tasks would tell Octoparse: which website to open and what data to crawl, etc.
2. Why should I use Advanced Mode?
Advanced Mode is an incredibly powerful mode offering extended flexibility to accommodate scraping all different kinds of websites. It allows you to customize individual action needed to perform the extraction including keywords searching, login authentication, opening dropdowns, etc.
Here we take one of our blog posts as an example. Suppose our goal is to extract the blog information from the page.
Copy and paste the URL in "Extraction URL" textbox. Then click on "Save URL" and Octoparse will open the web page in the built-in browser.
2) Click on target data to capture
Now, start to capture the data you need by clicking directly on the various pieces of information.
When the data is selected properly, the selection will be highlighted in green.
Click on the title of the post, the posted date or the content to capture.
Notice the data you clicked on is now showing on the Action Panel. You can edit the field names by clicking or leave till later. Select "Extract Data" to complete the text extraction action.
1. Turn on “Workflow” button to preview workflow you design.
2. Now in Octoparse 7.X, task name is automatically generated right on the top of configuration interface. To change it, just click the textbox and type in the desired name. Don’t forget to save your changes by clicking .
3) Save and run the task to capture data
Click "Save and run" from the Action Panel or alternatively, click "Start Extraction" to start running the completed task.
Here is the data we got from running the task.
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar