6 Tips to Use the Web Scraping Tool OctoparseFriday, December 16, 2016 4:12 AM
Looking for some beginner’s guide and tips on how to use Octoparse? Having trouble dealing with some complex websites? For the most part, it’s easy to learn Octoparse as you go, but there are a few beginner tips that will improve your experience as you begin your questing in the web scraping world.
1. Perform a "Launch Status Check" before your first data run
As little glitches may occur at any stage of your configuration, you may have no idea why your task is missing data or failing to click open an item or a webpage. So, in order to avoid unexpected errors during data runs, it is wise to check your task workflow step by step before the first run. By doing this "launch status check", you could find out which steps are not working from the built-in browser and fix them by modifying their settings. Check this tutorial to learn more about how to do that.
2. Set proper action timeouts and add necessary scroll page actions
Sometimes the task workflow may work perfectly in the "launch check", but data runs say otherwise. Why is my task missing data? The easiest fixation is to set a longer AJAX timeout for actions like “Go to Web Page”, “Click item” and “Click to paginate”. You may also set wait before action time to make sure webpages have loaded the data you need.
Some websites use a lazy-load technique to improve their SEO performance. As a result, there may be content that is not displayed until the webpage is scrolled. In that case, we need to add a page scrolling action in the workflow.
Remember, every action in Octoparse should only be executed when the page is fully loaded; if not, your task workflow will only work in theory.
One more thing, never tick “Open the link in new tab” and “Load the page with AJAX” at the same time unless you are dealing with tricky websites like LinkedIn.
3. Learn to write XPath on your own
The correct use of XPath is the key to mastering Octoparse or even web scraping. If you know how to write XPath, you can troubleshoot issues like pagination, missing data, and misplaced value fields. We strongly suggest every Octoparse user learn some knowledge about XPath. Just a little knowledge of XPath could help you solve a lot of problems in using Octoparse. Check the following series of tutorials to pick up XPath quickly.
- How to use an XPath plugin?
- Getting started with XPath 1
- Getting Started With XPath 2
- Modify XPath Manually in Octoparse
4. Run your task in the cloud and make them splittable
Occasionally a task may seem perfectly fine but you still couldn't get all the data records you want. This is likely to be caused by the volume of data or the complexity of the target webpage. That's why we recommend you to try Octoparse's cloud extraction function(exclusive to premium users). Once you start your task in the cloud, you won't need to worry about any network issues or computer glitches, even if the task takes a long time to complete.
Furthermore, we can use the List of URLs mode to speed up the extraction process. With the "List of URLs" loop mode, Octoparse has no need to deal with some steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. When a task built using "Lists of URLs" is set to run in the Cloud, the task will be split up into sub-tasks which are then set to run on various cloud servers simultaneously. Check the tutorial below to learn how to use the List of URLs mode.
5. Be careful about website cache and cookies
Sometimes you may find that the built-in browser won't open a target URL. This is likely to happen when you have opened other websites too many times. In that case, you need to clear the cache before opening the web page. You can click the Go to Web Page action and tick Clear cache before loading the web page.
Another thing is that when we extract websites that require log-in, we can tick Use Cookie to save the login information so that we can skip the log-in process for a long period of time.
6. Use the RegEx Tool to clean your data
Sometimes the raw data we extracted from the website has a low signal-to-noise ratio. In other words, they may have mixed the desirable data with some information we don't need. Occasionally we may also need some information from the HTML source code. To locate the information we want precisely, we can use the RegEx Tool to clean the datasets.
Hope the six tips above could help you move forward better with Octoparse. Happy data hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!