How to Build an Image Crawler Without Coding
Friday, February 12, 2021Saving an image from the webpage is straightforward. Simply right-click and select "save image as". But what if you have hundreds or even thousands of images that need to be saved? Will the same trick work? At least not for me! Let's see how image crawling can help.
Table of Contents
Fetching Images Directly from Webpage
Getting Full-sized Images from Thumbnails
Image Crawling Without Coding
In this article, I want to show you how to quickly build an image crawler with ZERO coding. Even if you have absolutely no tech background, you should be able to nail this within 30 mins. So for whatever reasons you may need the pictures for, re-blog, re-sell or machine training, the same trick can be extended to literally any websites. Ready? Let's get started.
Installations
You will need the following tools:
• Octoparse: a coding-free visual web scraping tool
• TabSave: Chrome plugin to save images instantly upon providing a list of URLs
Prerequisites
It would be best if you are familiar with how Octoparse works in general. Check out Octoparse Scraping 101 if you are new to the tool.
Video Tutorial (Take Aliexpress as an example)
Want to bulk download thousands of images? This video is a tutorial which gives step-by-step guide to help users scrape and download images from Aliexpress with Octoparse. When you get a hang of the tool, you can download imgaes from any websites without efforts!
Building Your Image Crawler
Not all images are created equal! Some images can be fetched from the webpage directly, other images are triggered only by clicking the thumbnails. Well, in this tutorial, I will show you how to deal with each of these scenarios via a few of examples.
Example 1:Fetching Images Directly from Webpage
To demonstrate, we are going to scrape dogs' images from Pixabay.com. To follow along, search for "dogs" on Pixabay.com then you should arrive on this page: https://pixabay.com/images/search/dogs/
Step 1: Enter the URL
Click "+ Task" to start a new task under Advanced Mode. Then, input the URL of the target webpage into the text box and click "Save URL".
Step 2: Select the images you want to crawl
Next, we are going to tell the bot what images to fetch.
Click on the first image. The Action Tips panel now reads "Image selected, 100 similar images found". This is great, exactly what we need. Go on to select "Select all", then "Extract image URL in the loop".
Step 3: Crawl images across pages
Of course, we don't just want the images from page 1, but from all pages (or as many pages as needed).
To do this, scroll down to the bottom of the current page, spot the "next page" button, click on it.
We obviously want to click the "next page" button many times, so it makes sense to select "Loop click the selected link" from the Action Tips panel.
Now, just to confirm if everything was set up properly. Toggle the workflow switch on the upper right corner. The finished workflow should look like this,
Also, check the data panel and make sure we have the desired data extracted correctly.
Step 4: Crawl with Auto-scrolling settings
There's just one more thing to tweak before running the crawler.
While debugging, I happened to notice that the HTML source code is being refreshed dynamically as one scrolls down the webpage. In other words, if the webpage is not scrolled down, we will not be able to get the corresponding image URLs from the source code. Lucky for us, Octoparse does auto-scroll down easily.
We will need to add auto-scroll both when the website loads for the first time as well as when it paginates.
Click on "Go to Webpage" from the workflow. On the right side of the workflow, spot "Advanced options", check "Scroll down to the bottom of the page when finish loading".
Then, decide how many times to scroll as well as at what pace. Here I set scroll times = 40, interval=1 second, scroll way = scroll down for one screen. This basically means Octoparse will scroll down one screen for 40 times with 1 second between each scroll.
I did not come up with this setting randomly but did a bit fine-tunning to make sure this setting works. I also noticed that it was essential to use "Scroll down for one screen" as opposed to "scroll down to the bottom of the page". Mainly because the image URLs we need only get refreshed to the source code gradually.
Apply the same setting to the pagination step.
Click on "Click to paginate" on the workflow, use the exact same setting for auto-scroll.
Step 5: Start your crawler!
That's it. You are done! Isn't this too good to be true? Let's run the crawler and see if it works.
Click "Start Extraction" from the upper left-hand corner. Pick "local extraction". It basically means you'll be running the crawler on your own computer instead of the Cloud server. [Download the crawler file used in this example and try it out yourself]
Example 2: Scrape Full-sized Images
Question: What if you need the full-sized images?
For this example, we'll use the same website: https://pixabay.com/images/search/dogs/ to demonstrate how you can get the full-sized pictures.
Step 1: Start a new task
Start a new task by clicking on "+ Task" under Advanced mode. Input the URL of the target webpage into the text box then click "Save URL" to proceed.
Step 2: Select the images you want to crawl
Unlike the previous example where we could capture the images directly, we'll now need to click into each individual image in order to see/fetch the full-sized image.
Click on the first image, the Action Tips panel should read "Image selected, 100 similar images found".
Select "Select all".
Then, "Loop click each image".
Step 3: Extract URLs of the images
Now that we've arrived on the page with the full-sized image, things are a lot easier.
Click on the full-sized image, then select "Extract the URL of the selected image".
As always, check the data panel and make sure we have the desired data extracted correctly.
Step 4: Add pagination to crawl across pages
Click on "Go to the webpage", spot "Next page" button then click on it. Select "Loop clicked the selected link" on the Action Tips panel.
The finished workflow should look like this,
If it doesn't look the same. Drag it around to move it.
Step 5: Run your crawler!
Done! Test run the crawler. [Download the crawler file used in this example and try it out yourself]
Example 3: Getting Full-sized Images from Thumbnails
I am sure you have seen something similar when you shop online or if you happen to run an online store. For product images, thumbnail images are definitely the most common forms of image display. The use of thumbnails substantially reduces bandwidth and loading time, making it much more friendly for people to browse through different products.
There are two ways to extract the full-sized images from the thumbnails using Octoparse.
Option 1 - You can set up a loop click to click through each of the thumbnails, then proceed to extract the full-sized image once loaded.
Option 2 - As most thumbnail images share exactly the same URL pattern with that of the corresponding full-sized images but only with a different number indicative of the different sizes, it makes sense to extract the thumbnail URL then replace the thumbnail size number to that of the full-sized counterparts. This can be done easily with Octoparse's built-in data cleansing tool.
Since we've already gone through something similar to Option 1 in Example 2, I will elaborate on Option 2 in this example. We will use a product page on Flipcart.com to demonstrate.
Before we start the work, it's worthwhile to confirm if this tactic can be applied by looking at the image URL for the thumbnail and its full-sized counterpart. So I handpicked one of the thumbnails to check.
Thumbnail URL: https://rukminim1.flixcart.com/image/128/128/jatym4w0/speaker/mobile-tablet-speaker/v/u/7/philips-in-bt40bk-94-original-imafybc9ysphpzhv.jpeg?q=70 Full-size URL: https://rukminim1.flixcart.com/image/416/416/jatym4w0/speaker/mobile-tablet-speaker/v/u/7/philips-in-bt40bk-94-original-imafybc9rqhdna8z.jpeg?q=70 |
Notice the only difference between these two URLs is the number indicative of the image size. "128" for the thumbnail and "416" for the full-sized image. This would mean that as long we have the thumbnail URLs extracted, we can convert them into full-sized URLs simply by replacing "128" with "416". Let's see it in action.
Step 1: Enter the URL
Launch the Octoparse App, start a new task, then input the target URL into the text box.
Step 2: Select the thumbnail image
Click on the first thumbnail image. The Action Tips panel now reads "Element selected. 5 similar buttons found." Bravo! Octoparse recognized the remaining thumbnails automatically.
Select "Select all".
Then, select "Extract the text of the selected elements". This is obviously not what we want, but we can change it later.
Toggle the "Workflow" switch at the upper right corner. Notice we had nothing extracted.
Step 3: Extract URLs of the images

Step 4:Use RegEx to match the URLs
Use the Regular Expression tool to match the image URL from the whole chunk of outer HTML.
Click the "Customize" icon

Click "Add step", then select "Match with regular expression". If you are not familiar with the regular expression, feel free to use the Built-in RegEx tool

The RegEx tool is rather self-explanatory. Input the beginning and the end of the desired data string. Click "Generate" then the corresponding regular expression is generated. Click "Match" to see if the desired data can be matched successfully. If you expect to get more than one line matched, check for "Match all".
Are we done? Close, but not yet. Remember this is only the thumbnail URL and we still need to replace "128" with "416" in order to make them the full-sized image URLs.
Click "Add step" one more time. Select "Replace". Replace "128" with "416". Click "Evaluate". Finally, we have the URL we need.
Step 4: Test run the crawler
In the example above, we had each image URL extracted as an individual row. What if you need to get the URLs extracted all together? This can be done by extracting the outer HTML of all the thumbnails at once. Then, use RegEx to match out the individual URLs, replace the size number, then you'll get all the full-sized image URLs fetched into one single row.
1) Load the website and click on one of the thumbnails. Click the "Expand" icon at the lower right corner of the Action Tips panel until the whole thumbnail section is highlighted in green, which basically means they are selected.
2) Select "Extract Outer HTML of the selected element" on the Action Tips panel.
3) Toggle back to workflow mode.

Artículo en español: Cómo Construir un Scraper de Imágenes sin Codificación
También puede leer artículos de web scraping en el sitio web oficial
Most popular posts
Posts by topic