How to Build an Image Crawler Without CodingThursday, August 29, 2019
Saving an image from the webpage is straightforward. Simply right-click and select "save image as". But what if you have hundreds or even thousands of images that need to be saved? Will the same trick work? At least not for me!
In this article, I want to show you how to quickly build an image crawler with ZERO codings. Even if you have absolutely no tech background, you should be able to nail this within 30 mins. So for whatever reasons you may need the pictures for, re-blog, re-sell or machine training, the same trick can be extended to literally any websites. Ready? Let's get started.
Creating a project
Not all images are created equal! Some images can be fetched from the webpage directly, other images are triggered only by clicking the thumbnails. Well, in this tutorial, I will show you how to deal with each of these scenarios via a few of examples.
Example 1：Fetching Images Directly from Webpage
To demonstrate, we are going to scrape dogs' images from Pixabay.com. To follow along, search for "dogs" on Pixabay.com then you should arrive on this page.
1) Click "+ Task" to start a new task under Advanced Mode. Then, input the URL of the target webpage into the text box and click "Save URL".
You should arrive here:
2) Next, we are going to tell the bot what images to fetch.
Click on the first image. The Action Tips panel now reads "Image selected, 100 similar images found". This is great, exactly what we need. Go on to select "Select all", then "Extract image URL in the loop'.
3) Of course, we don't just want the images from page 1, but from all pages (or as many pages as needed).
To do this, scroll down to the bottom of the current page, spot the "next page" button, click on it.
We obviously want to click the "next page" button many times, so it makes sense to select "Loop click the selected link" from the Action Tips panel.
Now, just to confirm if everything was set up properly. Toggle the workflow switch on the upper right corner. The finished workflow should look like this,
Also, check the data panel and make sure we have the desired data extracted correctly.
3) There's just one more thing to tweak before running the crawler.
While debugging, I happened to notice that the HTML source code is being refreshed dynamically as one scrolls down the webpage. In other words, if the webpage is not scrolled down, we will not be able to get the corresponding image URLs from the source code. Lucky for us, Octoparse does auto-scroll down easily.
We will need to add auto-scroll both when the website loads for the first time as well as when it paginates.
Click on "Go to Webpage" from the workflow. On the right side of the workflow, spot "Advanced options", check "Scroll down to the bottom of the page when finish loading".
Then, decide how many times to scroll as well as at what pace. Here I set scroll times = 40, interval=1 second, scroll way = scroll down for one screen. This basically means Octoparse will scroll down one screen for 40 times with 1 second between each scroll.
I did not come up with this setting randomly but did a bit fine-tunning to make sure this setting works. I also noticed that it was essential to use "Scroll down for one screen" as opposed to "scroll down to the bottom of the page". Mainly because the image URLs we need only get refreshed to the source code gradually.
Apply the same setting to the pagination step.
Click on "Click to paginate" on the workflow, use the exact same setting for auto-scroll.
4) That's it. You are done! Isn't this too good to be true? Let's run the crawler and see if it works.
Click "Start Extraction" from the upper left-hand corner. Pick "local extraction". It basically means you'll be running the crawler on your own computer instead of the Cloud server. [Download the crawler file used in this example and try it out yourself]
Example 2: Scrape Full-sized Images
Question: What if you need the full-sized images?
For this example, we'll use the same website: https://pixabay.com/images/search/dogs/ to demonstrate how you can get the full-sized pictures.
1) Start a new task by clicking on "+ Task" under Advanced mode.
2) Input the URL of the target webpage into the text box then click "Save URL" to proceed.
3) Unlike the previous example where we could capture the images directly, we'll now need to click into each individual image in order to see/fetch the full-sized image.
Click on the first image, the Action Tips panel should read "Image selected, 100 similar images found".
Select "Select all".
Then, "Loop click each image".
4) Now that we've arrived on the page with the full-sized image, things are a lot easier.
Click on the full-sized image, then select "Extract the URL of the selected image".
As always, check the data panel and make sure we have the desired data extracted correctly.
5) Follow the same steps in Example 1 to add pagination steps.
Click on "Go to the webpage", spot "Next page" button then click on it. Select "Loop clicked the selected link" on the Action Tips panel.
The finished workflow should look like this,
If it doesn't look the same. Drag it around to move it.
Example 3: Getting Full-sized Images from Thumbnails
I am sure you have seen something similar when you shop online or if you happen to run an online store. For product images, thumbnail images are definitely the most common forms of image display. The use of thumbnails substantially reduces bandwidth and loading time, making it much more friendly for people to browse through different products.
There are two ways to extract the full-sized images from the thumbnails using Octoparse.
Option 1 - You can set up a loop click to click through each of the thumbnails, then proceed to extract the full-sized image once loaded.
Option 2 - As most thumbnail images share exactly the same URL pattern with that of the corresponding full-sized images but only with a different number indicative of the different sizes, it makes sense to extract the thumbnail URL then replace the thumbnail size number to that of the full-sized counterparts. This can be done easily with Octoparse's built-in data cleansing tool.
Since we've already gone through something similar to Option 1 in Example 2, I will elaborate on Option 2 in this example. We will use a product page on Flipcart.com to demonstrate.
Before we start the work, it's worthwhile to confirm if this tactic can be applied by looking at the image URL for the thumbnail and its full-sized counterpart. So I handpicked one of the thumbnails to check.
Thumbnail URL: https://rukminim1.flixcart.com/image/128/128/jatym4w0/speaker/mobile-tablet-speaker/v/u/7/philips-in-bt40bk-94-original-imafybc9ysphpzhv.jpeg?q=70
Full-size URL: https://rukminim1.flixcart.com/image/416/416/jatym4w0/speaker/mobile-tablet-speaker/v/u/7/philips-in-bt40bk-94-original-imafybc9rqhdna8z.jpeg?q=70
Notice the only difference between these two URLs is the number indicative of the image size. "128" for the thumbnail and "416" for the full-sized image. This would mean that as long we have the thumbnail URLs extracted, we can convert them into full-sized URLs simply by replacing "128" with "416". Let's see it in action.
1) Launch the Octoparse App, start a new task, then input the target URL into the text box.
2）Click on the first thumbnail image. The Action Tips panel now reads "Element selected. 5 similar buttons found." Bravo! Octoparse recognized the remaining thumbnails automatically.
Select "Select all".
Then, select "Extract the text of the selected elements". This is obviously not what we want, but we can change it later.
Toggle the "Workflow" switch at the upper right corner. Notice we had nothing extracted.
Click the "Customize" icon again. This time, click "Refine extracted data". There are a couple of data cleansing steps to add.
Click "Add step", then select "Match with regular expression". If you are not familiar with the regular expression, feel free to use the Built-in RegEx tool which I like a lot.
The RegEx tool is rather self-explanatory. Input the beginning and the end of the desired data string. Click "Generate" then the corresponding regular expression is generated. Click "Match" to see if the desired data can be matched successfully. If you expect to get more than one line matched, check for "Match all".
5) Are we done? Close, but not yet. Remember this is only the thumbnail URL and we still need to replace "128" with "416" in order to make them the full-sized image URLs.
Click "Add step" one more time. Select "Replace". Replace "128" with "416". Click "Evaluate". Finally, we have the URL we need.
Check the data extracted.
6) Test run the crawler.
In the example above, we had each image URL extracted as an individual row. What if you need to get the URLs extracted all together? This can be done by extracting the outer HTML of all the thumbnails at once. Then, use RegEx to match out the individual URLs, replace the size number, then you'll get all the full-sized image URLs fetched into one single row.
1) Load the website and click on one of the thumbnails. Click the "Expand" icon at the lower right corner of the Action Tips panel until the whole thumbnail section is highlighted in green, which basically means they are selected.
2) Select "Extract Outer HTML of the selected element" on the Action Tips panel.
3) Toggle back to workflow mode.
Artículo en español: Cómo Construir un Scraper de Imágenes sin Codificación
También puede leer artículos de web scraping en el sitio web oficial