Build a Reddit Image Scraper without CodingMonday, June 07, 2021
It can be time-consuming to find, copy, and paste various images from Reddit. But have you ever thought about building a Reddit image scraper using Octoparse, the powerful web scraping tool? Let's find out how to do it.
Table of Contents
What is an Image Scraper and how does it work?
This article will introduce particularly how to build a Reddit image scraper but let’s start from the idea of image scraper. Image Scraper reduces your manual work of copying and pasting images from the web pages.
As in real-world scenarios, it is not feasible for individuals to copy-paste hundreds and thousands of images for whatsoever purposes. Image Scraper comes to the rescue for such people, which really reduces the time consumption and automates the process for you.
The steps to make an Image Scraper basically include:
- Scraping Image URLs from any particular website and storing them in an excel file
- Using the URLs to download images with some chrome plugins
How to build an Image Scraper?
There are two ways to build an Image Scraper. If you have a technical background and good knowledge of any programming language, you can use that to build an Image Scraper. Even if you are not from a technical background, no need to worry, you can still build your own customized Image Scraper using the software.
- Computer Language (Coding)
To build an image scraper, the tech people can easily build it using any programming language like Python. You just simply have to write a few lines of code to build an Image scraper on your own. You can make use of BeautifulSoup, Scrapy, Selenium like packages available in Python to build your own Image Scraper.
- Web Scraping tool (Without Coding)
And if you are someone who doesn't belong to any technical background, there are plenty of Scraping Softwares out there that will eventually help you with image scraping in minutes.
So, Let me introduce to you one here which has an easy to use and easy to understand interface. i.e., Octoparse.
Octoparse is a scraping software that provides you with free scraping services up to 10K rows in one go that you can scrape free of cost, and even if this limit is not satisfactory for you, you can still upgrade to no limits just only for 75$ a month. For more information on each and every plan, please visit this link: https://www.octoparse.com/pricing
Other than the free services, the Octoparse community provides a good amount of tutorials and articles for any type of real-world scenarios.
To build a successful Image Scraper, you will also need to add one chrome extension that will save images instantly upon providing the list of URLs. For this tutorial, I am using the Tab Save chrome plugin.
Walkthrough Example: Build a Reddit Image Scraper with Octoparse
Let us take an example of making a Reddit Image scraper using Octoparse. For this tutorial, I am using version 8 of Octoparse.
For this tutorial, We are scraping images from this Webpage of Reddit, https://www.reddit.com/rising/
To make Reddit Image Scraper, just follow these simple steps:
- Copy the link of the website that you want to scrape images. On the very left corner of the homepage, click on the "+New" button and choose "Advanced Mode" option from the drop-down as shown below.
- After that you will see another interface showing space for the URL. Just paste the copied URL in the specified space like below and click on "Save" to move ahead.
- Clicking on the "Save" button will take you to the next interface which will look like the image below divided into three sections. Other than the "Workflow" and "Data preview" section, the upper right contains the interface of the website that you want to scrape. You can even manually select elements and browse here as well like you do on the browser. To browse here, switch the "Browse" toggle on the upper right corner. As you can see in the below screenshot, there are two steps listed on the "Tips" panel as well. So, If you want to select elements from a webpage selectively manually, choose the second option. The version 8 also allows "auto-detect" function where the scraping bot automatically selects data for you from the listed web-page. It is good to select if you are a beginner to this software to know a little much about workflow and how it works. You can remove or keep attributes in the "Data preview" section according to your convenience later on.
- I am hereby illustrating steps on "Edit task workflow manually".
- The "Tips" panel will pop up on clicking the very first image. Then choose the "Select all" option to select all the images listed on the web page.
- Now look at the image below, you must be seeing some pretty differences. Just after selecting all the images, it has listed all the links for images in the "Data preview" Section. The "Tips" panel has listed some options to choose from. In order to extract image URLs, select the first option i.e., "Extract image URLs".
- To see which image link has been scraped, click on that link in the "Data preview" Section. You will get to see some highlights on the browsing interface like in the attached screenshot.
- Let us change some settings to scroll down more and to load more images. As you can see, it is scrolling for only one screen and getting only 3 images. To get more data, you have to specify some settings. For this particular case, I have attached some screenshots below.
- Double click the "Go to Web Page" box so that you can enter settings options.
- Check the box to scroll down more. Fill in other values like repeats and wait time as per your requirement.
- Don’t forget to update the settings of a "Loop Item" in Workflow.
- If the "Extract data in the loop" option is not selected automatically already, select this option in the setting of "Extract Data" from the workflow.
- You can see these two buttons on the top of the Workflow: "Save" and "Run". That's where you can save your task and run the crawler once all the settings have been upgraded.
- If you are using a free plan, Run the task on your device. Cloud services are only available for other plans.
- It will scrape the list of image links for you in just minutes. As you can see below, it scrapes 51 links in just 1 min 12 seconds. Isn’t it pretty cool!!
- Now to save the data file into your system, click "Export Data" and choose the format of your choice.
- This is how scraped URLs will look like in a structured format.
- The next step is to download the images, since you have all the links in one place, simply copy-paste the links to the Tab Save Chrome extension.
- Start downloading the files by clicking the download icon at the bottom.
You see, how easy it was! Just following these steps, you can build your own Reddit Image Scraper in just a few minutes. So, what are you waiting for? Go and Scrape. Take use of this software and this tutorial as much as you can.