Blog > Web Scraping > Post

15 Most Frequently Asked Questions of Web Scraping (Q&A)

Friday, February 01, 2019

Web scraping, a popular phrase being talked about, remains a mystery to many professionals. As a web scraping service provider, we decided to put together some of the common web scraping questions and answers to help unravel the mystery.

 

1. What is web scraping?

Web scraping, also known as web harvesting and data extraction, basically refers to obtaining data available on the World Wide Web via the Hypertext Transfer Protocol (HTTP) or through web browsers.

Read more: Web Scraping: How It All Started and Will Be

 

2. Is web scraping legal?

Web scraping itself is not illegal as it is just a tool for collecting data more easily. However, doing so might break the law when you use it to steal non-public information, or the targeted website strictly prohibits web scraping without prior permission or without mentioning some legal copyright aspects related to the use of its data. It is highly recommended you read the Terms and Conditions of the website thoroughly before scraping it.

 

3. What’s the best web scraping tool?

Choosing what scraping tool depends on the nature of the website and its complexity. As long as the tool can help you get the data fast and smoothly with an acceptable cost or none, you can choose any tool you’d like.

Read more: Best Data Scraping Tools for 2019

 

4. Can I scrape LinkedIn or Facebook?

Unfortunately, both websites block automated web crawling via their robots.txt. LinkedIn’s legal disputes with companies that have scraped data off them have been a hot topic. But it is possible to extract the two websites if you only scrape publicly available data and listings from them.

Read more: Scrape post from LinkedIn

 

5. What is web scraping used for?

Web scraping is aimed at collecting data so it can be applied in any industry that needs the data. It is used largely in market research, price monitoring, human capital optimization, lead generation, and many other fields.

Read more: Data Insight: 54 Industries Using Web Scraping

 

6. Can I extract data from the entire web?

Many people believe web scraping can be used to scrape data from the entire World Wide Web or at least hundreds of thousands websites. This is not feasible in practice. Since websites do not follow a universal page structure, it would be hard for one web scraper to interact with all pages.

 

7. Is web scraping data mining?

Web scraping and data mining are two different concepts. Web scraping is to collect raw data, but data mining is the process of discovering patterns in large data sets.

Read More: Data Mining (Wiki)

Data Mining Explained With 10 Interesting Stories

 

8. How to avoid being blocked from scraping a website?

Many websites would block you if you scrape them too much. To avoid being denied, you need to make the scraping process more like a human browsing a website. For example, adding a delay between two requests, using proxies or applying different scraping patterns can all help you not to be blocked.

Read More: How to Scrape Websites Without Being Blocked?

 

9. Can CAPTCHA be solved during web scraping?

CAPTCHA used to be a nightmare for web scraping, but now can be solved easily. Many web scraping tools have the feature of solving CAPTCHA automatically during the extraction process. And there are lots of CAPTCHA solvers that can be integrated with scraping systems.

Read More: 5 Things You Need to Know of Bypassing CAPTCHA for Web Scraping

 

10. Can I republish the content extracted via web crawling?

Republishing content needs to have consent from the owner. Though you can scrape text content from websites that allow bots, you still need to use this data in a way that does not infringe the copyrights of the publisher.

 

11. What is the difference between web scraping and web crawling?

Web scraping and web crawling are two related concepts. Web scraping as we mentioned before is a process of obtaining data from websites; web crawling is to systematically browse the World Wide Web, typically for the purpose of web indexing.

Read More: Data crawler

 

12. What is a robots.txt file?

Robots.txt is a text file that tells crawlers, bots, or spiders if a website could be or how it should be scrapped as specified by the website owner. It is critical to understand the robots.txt file to prevent being blocked while web scraping.

 

13. Can I scrape data behind a login page?

Yes, you can scrape data behind a login page easily if you have a functional account on the website. The scraping process after the login would be similar to that of a normal scraping.

Read More: Extract data behind a login

 

14. How do I extract the content from dynamic web pages?

A dynamic website would update data frequently. For example, there are always new posts on Twitter. To scrape from such website, it is the same process as scraping other websites but you would let the scraper access the website at a certain frequency to get the updated data continuously.

Read More: Scheduled crawlers running in the cloud

 

15. Can a web scraping tool download files from a website directly?

Yes, there are many scraping tools that can download files on the website directly and save to Dropbox or other servers when scraping text information.

 

Author: Yina Huang(Octoparse Team)

Edit: Ashley Weldon

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download