Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Retry actionsThursday, August 16, 2018
Retry action is a feature provided in Octoparse for reloading the web page that you want to scrape based on some certain condition.
Why set up "Retry"?
When the web page is not loaded normally, Octoparse has problems in scraping the data from the page and even in executing next actions. In this case, Octoparse needs to retry loading the page before starting the extraction.
How to set up "Retry"?
Retry setting is only available in 3 page-loading-related actions in the workflow: Go To Web Page, Click Item and Click to paginate.
· Tick the "Retry when" box, then click to configure the condition
Octoparse needs a certain condition to tell whether the page is loaded normally and retry loading the page if the load does fail.
· Configure "URL/content/element(XPath) contains" option and "Contain/Does not contain" option
Usually when the load fails, the web page would respond to you with a message in the URL/content of the current page to indicate what happens, like "/errors", "500 Internal Server Error" or "Too many requests". Input a certain string like that in the textbox as the condition and select "Contains". Thus, Octoparse would retry loading the page when it detects the string in the URL/content of the current page.
You can also input the XPath of some certain element that would only be there when the page is loaded normally. In this case, you need to select "Does not contain". As a result, once Octoparse does not detect the set XPath on the current page, it would reload the page.
You can click to add multiple conditions for Octoparse to make the judgment.
· Set up "Maximum reload times" and interval time
To avoid Octoparse from being stuck in endlessly reloading the web page, you need to set up the maximum times of retrying. When Octoparse reaches the maximum times of retrying, it would stop and enter the next step.
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar