Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Scrape Image URLs from a WebsiteWednesday, December 4, 2019
The latest version for this tutorial is available here. Go to have a check now!
There are 3 ways for us to scrape image URLs from a website by using Octoparse. We can choose one of them according to our requirements for data format.
Format 1: All extracted image URLs of a webpage are laid out in the same row but different columns.
If we want to get the data extracted into each different columns, just repeat the "click" & "extract" steps as following.
- Click the wanted image on the web page.
- Select "Extract URL of the selected image" on the "Action Tips".
Format 2: All the Image URLs on the same webpage are exported in one column but different rows.
If we build a loop item to scrape all the image URLs on one page, we could have each image URL extracted into one column but different rows.
- Click an image in the built-in browser
Octoparse will automatically detect all other similar images on the current page. The one you already selected will be highlighted in green while others will be highlighted in red.
- Click "Select all" on the "Action Tips"
- Click "Extract image URLs in a loop"
Format 3: All the image URLs are exported in one cell by using RegExp Tool.
Provided we need all the image URLs of a product/webpage extracted in a cell, we can use Octoparse RegExp Tool to pick up all the image URLs from the source code of the webpage.
- Click any spot on the targeted web page.
The selected area would be highlighted in green.
- Enlarge the selected area by clicking the button on the bottom of the "Action Tips".
- Click "Extract inner HTML of the selected element".
- Re-format the extracted data with Octoparse RegExp Tool.
According to the source code, all the image URLs start with "https://" and end with "jpg". Thus, we are able to pick up all of them with RegExp Tool.
- Click "Customize data field"
- Select "Refine extracted data"
- Select "Match with regular expression”
- Click "Try RegExp Tool"
- Check "Start with" and "Include start" and enter "https://" in the following box
- Check "End with" and "Include end" and enter "jpg" in the following box
- Click "Generate"
- Check "Match all" and click "Match"
- Click "Apply"
To further study the functions of Octoparse RegExp Tool, please refer to the following tutorials:
To download the image from the URLs we already scraped, please refer to the article - How to download images from a list of URLs?
Author: Erika F
- Most popular tutorials
- Use lists to extract
- Set up proxies
- Scrape data via Google Searching
- Extract data from source code
- How to export extracted data to a database?