How to Avoid the CookieWall When Scraping the Website in OctoparseThursday, April 21, 2016 6:16 AM
Sometime cookie messages are appear on the screen to inform users that the cookie would be created and remained in the cookie file of your browser when you access certain websites.
If you want to scrape some web pages from a website and the cookiewall message would always come first when the web page is loaded in Octoparse, you can configure a rule to remove the cookiewall. Here we would take http://www.marktplaats.nl for instance and solve the problem by the following steps.
- Login to Octoparse and create a task.
- Set basic info and click ‘Next’.
- Open the website http://www.marktplaats.nl. Save the URL and load it.
- After the web page is loaded, the URL would change as http://www.marktplaats.nl/cookiewall/and a cookie message window has appeared on the screen.
- Click the ‘Cookies accepteren’and then choose ‘Click an item’.
- In the workflow designer, the Click Item action has been created. The web page would loaded again without cookiewall. The URL of the website has changed to a normal one without cookieswall.
- Add a ‘Go to the Webpage’action to the workflow designer and open any subpage of the website. Take the URL below for example:
8. After the second web page is loaded, you would find that the cookiewall has removed and then you can extract any data in the website.
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.