undefined
Blog > Data Collection > Post

How to Build an Image Crawler Without Coding

Thursday, August 29, 2019

Saving an image from the webpage is straightforward. Simply right-click and select "save image as". But what if you have hundreds or even thousands of images that need to be saved? Will the same trick work? At least not for me!

In this article, I want to show you how to quickly build an image crawler with ZERO codings. Even if you have absolutely no tech background, you should be able to nail this within 30 mins. So for whatever reasons you may need the pictures for, re-blog, re-sell or machine training, the same trick can be extended to literally any websites. Ready? Let's get started


Installations

You will need the following tools:
Octoparse: a coding-free visual web scraping tool
TabSave: Chrome plugin to save images instantly upon providing a list of URLs

 

Prerequisites

It would be best if you are familiar with how Octoparse works in general. Check out Octoparse Scraping 101 if you are new to the tool. 

 

Creating a project

Not all images are created equal! Some images can be fetched from the webpage directly, other images are triggered only by clicking the thumbnails. Well, in this tutorial, I will show you how to deal with each of these scenarios via a few of examples. 

 

Example 1:Fetching Images Directly from Webpage

To demonstrate, we are going to scrape dogs' images from Pixabay.com. To follow along, search for "dogs" on Pixabay.com then you should arrive on this page

1) Click "+ Task" to start a new task under Advanced Mode. Then, input the URL of the target webpage into the text box and click "Save URL". 

enter url

 

You should arrive here:

 

pixabay

 

2) Next, we are going to tell the bot what images to fetch. 

Click on the first image.  The Action Tips panel now reads "Image selected, 100 similar images found". This is great, exactly what we need. Go on to select "Select all", then "Extract image URL in the loop'. 

 

 

example 1

 

3) Of course, we don't just want the images from page 1, but from all pages (or as many pages as needed). 
To do this, scroll down to the bottom of the current page, spot the "next page" button, click on it. 

We obviously want to click the "next page" button many times, so it makes sense to select "Loop click the selected link" from the Action Tips panel.

 

example 1

 

 

Now, just to confirm if everything was set up properly. Toggle the workflow switch on the upper right corner. The finished workflow should look like this,

 

example 1

 

Also, check the data panel and make sure we have the desired data extracted correctly.  

 


3) There's just one more thing to tweak before running the crawler.

While debugging, I happened to notice that the HTML source code is being refreshed dynamically as one scrolls down the webpage. In other words, if the webpage is not scrolled down, we will not be able to get the corresponding image URLs from the source code.  Lucky for us, Octoparse does auto-scroll down easily. 

We will need to add auto-scroll both when the website loads for the first time as well as when it paginates. 

Click on "Go to Webpage" from the workflow. On the right side of the workflow, spot "Advanced options", check "Scroll down to the bottom of the page when finish loading".

Then, decide how many times to scroll as well as at what pace. Here I set scroll times = 40, interval=1 second, scroll way = scroll down for one screen. This basically means Octoparse will scroll down one screen for 40 times with 1 second between each scroll.

I did not come up with this setting randomly but did a bit fine-tunning to make sure this setting works. I also noticed that it was essential to use "Scroll down for one screen" as opposed to "scroll down to the bottom of the page". Mainly because the image URLs we need only get refreshed to the source code gradually. 

 

example 1

 

Apply the same setting to the pagination step. 

Click on "Click to paginate" on the workflow, use the exact same setting for auto-scroll. 

 

 example 1

 

4) That's it. You are done! Isn't this too good to be true? Let's run the crawler and see if it works. 

Click "Start Extraction" from the upper left-hand corner. Pick "local extraction". It basically means you'll be running the crawler on your own computer instead of the Cloud server. [Download the crawler file used in this example and try it out yourself]

 

 

Example 2: Scrape Full-sized Images

Question: What if you need the full-sized images? 

For this example, we'll use the same website:  https://pixabay.com/images/search/dogs/ to demonstrate how you can get the full-sized pictures.

1) Start a new task by clicking on "+ Task" under Advanced mode. 

 

octoparse example 2

 

2) Input the URL of the target webpage into the text box then click "Save URL" to proceed.
3) Unlike the previous example where we could capture the images directly, we'll now need to click into each individual image in order to see/fetch the full-sized image. 

Click on the first image, the Action Tips panel should read "Image selected, 100 similar images found". 

Select "Select all".

 

octoparse example 2

 

Then, "Loop click each image". 

 

octoparse example 2

 

4) Now that we've arrived on the page with the full-sized image, things are a lot easier. 

Click on the full-sized image, then select "Extract the URL of the selected image". 

 

Octoparse example 2

 

As always, check the data panel and make sure we have the desired data extracted correctly.

 

octoparse example 2

 

 

5) Follow the same steps in Example 1 to add pagination steps. 

Click on "Go to the webpage", spot "Next page" button then click on it. Select "Loop clicked the selected link" on the Action Tips panel

 

octoparse example 2

 

The finished workflow should look like this,

 

Octoparse example 2

If it doesn't look the same. Drag it around to move it. 

6) Done! Test run the crawler.  [Download the crawler file used in this example and try it out yourself]

 

 

Example 3: Getting Full-sized Images from Thumbnails

thumbnail

 

I am sure you have seen something similar when you shop online or if you happen to run an online store. For product images, thumbnail images are definitely the most common forms of image display. The use of thumbnails substantially reduces bandwidth and loading time, making it much more friendly for people to browse through different products. 

 

There are two ways to extract the full-sized images from the thumbnails using Octoparse.
Option 1 - You can set up a loop click to click through each of the thumbnails, then proceed to extract the full-sized image once loaded.
Option 2 - As most thumbnail images share exactly the same URL pattern with that of the corresponding full-sized images but only with a different number indicative of the different sizes, it makes sense to extract the thumbnail URL then replace the thumbnail size number to that of the full-sized counterparts. This can be done easily with Octoparse's built-in data cleansing tool. 

 

Since we've already gone through something similar to Option 1 in Example 2, I will elaborate on Option 2 in this example. We will use a product page on Flipcart.com to demonstrate. 

Before we start the work, it's worthwhile to confirm if this tactic can be applied by looking at the image URL for the thumbnail and its full-sized counterpart. So I handpicked one of the thumbnails to check.

 

Thumbnail URL: https://rukminim1.flixcart.com/image/128/128/jatym4w0/speaker/mobile-tablet-speaker/v/u/7/philips-in-bt40bk-94-original-imafybc9ysphpzhv.jpeg?q=70

Full-size URL: https://rukminim1.flixcart.com/image/416/416/jatym4w0/speaker/mobile-tablet-speaker/v/u/7/philips-in-bt40bk-94-original-imafybc9rqhdna8z.jpeg?q=70

Notice the only difference between these two URLs is the number indicative of the image size. "128" for the thumbnail and "416" for the full-sized image. This would mean that as long we have the thumbnail URLs extracted, we can convert them into full-sized URLs simply by replacing "128" with "416". Let's see it in action. 

1) Launch the Octoparse App, start a new task, then input the target URL into the text box. 

 

octoparse example 3

 

2)Click on the first thumbnail image. The Action Tips panel now reads "Element selected. 5 similar buttons found." Bravo! Octoparse recognized the remaining thumbnails automatically. 

Select "Select all". 

 

octoparse example 3

 

Then, select "Extract the text of the selected elements". This is obviously not what we want, but we can change it later. 

 

Toggle the "Workflow"  switch at the upper right corner. Notice we had nothing extracted.

 

 

Well, this is kinda expected as there was apparently no text to be extracted. What we really need to fetch is the image URL hidden in the Html source code. So now we'll need to change the type of data to extract. 
3) Hit the "Customize" icon  (little pencil) at the bottom, click "Define data extracted", check "Extract outer HTML, including source code, text for format and image". Click "OK" to save. 

 

4) Use the Regular Expression tool to match the image URL from the whole chunk of outer HTML. 


Click the "Customize" icon again. This time, click "Refine extracted data". There are a couple of data cleansing steps to add. 

Click "Add step", then select "Match with regular expression". If you are not familiar with the regular expression, feel free to use the Built-in RegEx tool   which I like a lot. 


The RegEx tool is rather self-explanatory. Input the beginning and the end of the desired data string. Click "Generate" then the corresponding regular expression is generated. Click "Match" to see if the desired data can be matched successfully. If you expect to get more than one line matched, check for "Match all". 

 

5) Are we done? Close, but not yet. Remember this is only the thumbnail URL and we still need to replace "128" with "416" in order to make them the full-sized image URLs. 

Click "Add step" one more time. Select "Replace". Replace "128" with "416". Click "Evaluate". Finally, we have the URL we need.

 

 

Check the data extracted. 

 

 

6) Test run the crawler. 


 

 

In the example above, we had each image URL extracted as an individual row. What if you need to get the URLs extracted all together? This can be done by extracting the outer HTML of all the thumbnails at once. Then, use RegEx to match out the individual URLs, replace the size number, then you'll get all the full-sized image URLs fetched into one single row. 

 

1) Load the website and click on one of the thumbnails. Click the "Expand" icon at the lower right corner of the Action Tips panel until the whole thumbnail section is highlighted in green, which basically means they are selected.

 

 

2) Select "Extract Outer HTML of the selected element" on the Action Tips panel.

 

 

 

3) Toggle back to workflow mode. 

Repeat the same data cleansing steps to match out the individual thumbnail URLs.
Check "Match All" this time as we need to match more than one URL from the extracted HTML code.
Again, replace "128" with "416". Lastly, we have all the full-sized image URLs extracted into one single line. [Download the crawler file used in this example and try it out yourself]
Now that we have the image URLs extracted, let's go ahead get the actual image files downloaded using one of my favorite tools, TabSave. There are also other similar image downloaders available on the web and most of them are free. 

 

1) Export the data extracted to Excel or similar formats. 
download data
2) Copy and paste the image URLs to TabSave.
tab save
3) Start to download the files by clicking the download icon at the bottom.

download image
I hope this tutorial gives you a good start on extracting images/data from the web. That said, each website is different. Depending on your target websites, some adjustments may be needed to make it work. Yet, the general concept and workflow can be extended to most websites. If you have any questions about how to set up a crawler within Octoparse, you are always welcome to check the Octoparse Help Center or reach out to support@octoparse.com

Artículo en español: Cómo Construir un Scraper de Imágenes sin Codificación

También puede leer artículos de web scraping en el sitio web oficial

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download