Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Advanced ModeThursday, August 16, 2018
What is Advanced Mode?
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, we strongly recommend Advanced Mode to start your data extraction project.
With Octoparse Advanced Mode, you can
· achieve data scraping on almost all kinds of web page;
· extract data like text, URL, image, and HTML;
· design a workflow to interact with webpage such as login authentication, keywords searching and opening a drop-down menu.
· customize your workflow, such as set up a wait time, modify XPath and reformat the data extracted;
If the website you are going to scrape is very simple, you can begin your first data hunting trip with Wizard Mode .
In this tutorial, we will guide you through 3 main steps of creating a task with Advanced Mode and cover the unique features of Advanced Mode.
1. Interact with webpage in the built-in browser
· Action Tips
2. Design the workflow
· Task actions in the workflow
· Workflow execution order
3. Customize the workflow
· Customize task actions
1) Create a new task in Advanced Mode
1. Click "+Task" under Advanced Mode
2. Enter the URL and Click "Save URL"
2) Design and customize the workflow
After clicking "Save URL", you enter the task configuration interface.
The most critical part of a task is the workflow for your specific data extraction requirements. Octoparse executes every action configured in the workflow to complete your data collection.
Under Advanced Mode, the task configuration interface can be switched between two modes: the Select Mode and the Workflow Mode .
Normally, Octoparse would have you entered the Select Mode by default. You can use the on-and-off button at the upper right corner to turn on the Workflow Mode. By turning on the Workflow Mode, you would have a better picture of what you are doing with your task and avoid yourself from messing up the steps.
Now, let's start building the workflow together.
1. Interact with the web page in the built-in browser - to capture any web data with simple clicks
1.1 Action Tips
While building a new task, usually you will begin by selecting the data you want on the web page for Octoparse to scrape.
Under Advanced Mode, when you interact with the web page in the built-in browser, Octoparse responds to you by offering notices and available activities in Action Tips.
You can capture any web data with simple clicks. All you need to do is click on the desired data field to capture and select the appropriate action to perform from Action Tips.
2. Design the workflow - to tell Octoparse where and in which order to select and extract the data you want
2.1 Task actions in the workflow
Once you've clicked on any elements from the page in the built-in browser, Octoparse intelligently predicts and detects the data you might want to capture and provide you with all the available activities to choose from in Action Tips. After you select the activity you need, the corresponding task action would be automatically generated in the workflow.
There are 10 task actions to form up the workflow.
For example, once you click "Extract the text of the selected element" from Action Tips, an Extract Data action will be added into the workflow; once select "Click element", a Click Item will be generated in the workflow.
Besides by clicking, you can also add a task action into the workflow by dragging and dropping. Hence, you can enjoy more flexibility while designing your workflow.
1.The Branch Judgement action can only be added to the workflow manually. Learn more about branch judgement.
2. Pagination Loop is one of Loop Item types, while Click to paginate is a variant of Click Item. You can see them created in the workflow when you extract multiple pages through pagination .
3. If you want to view the full introduction to all task actions in workflow, click here .
2.2 Workflow execution order
For actions added in the workflow, Octoparse executes each action from the top down. And actions wrapped in Loop Item would be executed for multiple times. You can modify your workflow order by dragging one action up and down.
3. Customize the workflow - to further configure every single action in the workflow
3.1 Customize task action
Now, you've finished the workflow designing. By clicking on each step in the workflow, you can easily see how Octoparse is interacting with the website and if the target data fields can be extracted as expected.
Under Advanced Mode, to achieve an effective data scraping, a full range of customizing options are offered to further configure extraction actions and the data extracted.
Click the action in the workflow, and then you can see all available customizing options displayed in Customize Action area.
For example, for Extract Data action, you can modify the filed name of the data extracted from "Field1_Text" into "Title", or delete the data extracted by clicking .
For Go To Web Page action, you can block pop-up window to avoid the ads from slowing down the extraction speed.
3) Run the task
When you confirm the configuration, click "Start Extraction" to run your task.
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar