Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
The latest version for this tutorial is available here. Go to have a check now!
Generally, a task created in Octoparse begins with opening the target web page. To facilitate this step, we provide two features to help: ad blocking and clear cache. Utilizing these features properly can greatly accelerate your web scraping process.
Features covered in this tutorial are:
The extraction speed of a crawler is affected by the speed of page loading. If many unexpected ads appear on the web page, such as banners and pop-ups, the page will load slowly and waste your time. Ad blocking can reduce your page requests and thus optimize the loading time.
How to block Ads
There are two ways in Octoparse to set up "Ad Blocking".
1. Select the step of "Go To Web Page", you can easily locate "Ad Blocking" in "Advanced Options".
2. Or click "Settings", then you can see the "Block ads" option.
Using the Ad blocking technique may change the structure of some web pages. If so, please adjust the XPath to re-locate the elements.
Learn more about locating elements with XPath .
In some cases, for example, if you need to clear cookies remembered for extracting data behind a login, Octoparse also offers the clear cache option for you to reload the page.
How to clear cache
1. Select the step of "Go To Web Page", "Clear Cache" could be easily found in "Cache Settings".
2. After the page opened, if you want Octoparse to remember the new cookie, it’s also easy.
Now Octoparse has "remembered" the new cookie.
1. As cookies come in different forms, their valid period is also different. Some stay longer, while others expire as soon as the browser is closed. In Octoparse, the saved cookies will no longer work if it expires. Then you need to "Clear Cache" and reload the cookie.
2. Cache Settings is quite important especially for websites requiring the login, learn more about extracting data behind a login .