undefined
Blog > Data Collection > Post

URL Extractor: Get URLs from Hyperlinks in A Web Page

Monday, December 27, 2021

This is a quick guide to help you pull down a list of URLs or a list of data on a web page into excel using Octoparse. Is this the URL extractor you are looking for? Let’s see.

 

Table of Contents 

URL Extractor / List Extractor

Scrape URLs in a Web Page

Prerequisites

Step-by-step Guide

Use Auto-detection

Octoparse: Boost Your Working Efficiency

 

URL Extractor / List Extractor 

I am not sure if you have an idea about what is a roundup article, but you must have read one, and most likely you have read something that you want to save for future use. 

 

Take this article’s 100 infographic submission sites as an example. If I am an SEO marketer and one day I come across this roundup post, what would come to my mind is like: 

 

 “Hey, look at this. I can pull these websites’ URLs down to a table and every time I have created a new infographic, I am going to submit it to these websites. This definitively could help boost my website traffic or at least number of backlinks.”

 

Yea, this is what the URL extractor can do. I am going to do this with a web scraping tool, Octoparse, in a few seconds.

 

Scrape URLs in a Web Page

This is a simple example of how you can scrape a list of URLs from a web page into excel. Octoparse can scrape all kinds of structured data from web pages efficiently. 

 

If you are looking to scrape other than URL data, more cases will be introduced in a video later. The video would help too if you find this textual tutorial boring.

 

Prerequisites 

 

When you enter the target URL into Octoparse, the web page will be rendered in the built-in browser. You will be able to browse it as if you are surfing on Chrome. One thing that differs from it is you can click and build a scraper while you are browsing.

 

 

 

Step-by-step Guide

      • Enter the target URL into Octoparse
      • Click the first hyperlink in the list
      • Click the second hyperlink in the list

(The whole list of infographic websites will be selected in green)

      • Click “Extract both text and URL of the link”

           (Now data can be previewed in the table)

      • Click “Create Workflow”
      • Click the blue-button “Run” above

 

 

That’s it. After a few clicks, you have built and run your URL extractor and get all of the 100 links into excel for your use. 

 

Use Auto-detection

If you find that after clicking a few pieces of data, the whole list on the web page is not selected automatically by Octoparse, maybe you need to find another method to do this.

 

You can try Octoparse’s auto-detection feature and let the AI algorithm select the data for you. If this is not working as well, well, the website you are scraping from is unique. It is not an average type. It has a structure, not recognizable to the bot.

 

  

In this case, you need to amend the Xpath and locate the data accurately. Curious about how to write an Xpath? You are getting onboard web scraping then.

 

Hey, don’t worry. Just assume your website is well-structured and test it with auto-detection. 

 

Maybe you can get more than you expect. That’s possible. The AI algorithm is not omnipotent but it is powerful enough to cover most types of web pages.

 

In this video, you will also see how powerful auto-detection is and how it helps scrape travel data from Lonely planet effortlessly.

 

Octoparse: Boost Your Working Efficiency 

If you are a digital marketer and have no idea about web scraping, this is a good chance for you to learn something new. I am a marketer and as I get hold of this web scraping tool, I collect data at a rate that I can never do manually.

 

That means: 

 

And a no-code web scraping tool is extremely friendly to a marketer, or anyone without coding knowledge who needs data. 

 

Give it a try.

 

Author: Cici

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline