ChatGPT is a record breaker. One week after its launch, it reached 1 million users and more than 57 million monthly users in the first month. No chatbot has ever gotten this much attention or people talking so much. The impressive performance of ChatGPT, beyond all doubt, makes the public consider how it will change lives and be concerned if it may replace jobs anytime soon.
How much will ChatGPT change data extraction? Or, to put it more pessimistically, will ChatGPT eliminate the need for web scraping tools? To answer this question, we must have a clear idea about what ChatGPT is and what it can do in terms of web scraping first.
Toy or Tool? What is ChatGPT
ChatGPT is an AI language model. If you inquire about how to address it, it will reply that you can call it “ChatGPT” or “AI”. And its pronouns are “it” or “the model”. The experience of talking with it is very similar to talking with a real person, except it claims “I do not have personal preferences or emotions.”
As a chatbot that is developed by OpenAI, ChatGPT is built on top of OpenAI’s GPT-3 family. GPT-3, short for Generative Pre-trained Transformer 3, is a state-of-the-art language processing AI model capable of generating human-like text. OpenAI has fed the model 300 billion words via 570 GB of plain text, including books, articles, Wikipedia, posts on the Internet, etc., to function and improve it.
OpenAI has done a great job of training the model, which makes ChatGPT a great success. Everyone might find it helpful in their daily lives and at work. It can aid in the writing, debugging, and explanation of code as well as the job interview preparation. You can even ask it for suggestions on how to get along with people better in the real world. In one instance, a friend of mine planned to buy a car and checked out a second-hand car last week, but it wasn’t for her. So she asked ChatGPT what she should say to the seller without being rude. The model then gave its answer and explained how to respond in a polite way.
ChatGPT’s skills in content creation give the world a buzz. Even if you have no knowledge of how to write a fiction, film script, or song, you can get one after asking ChatGPT several questions and providing some details. But the quality of it might not be all that appealing. Because of this, ChatGPT and the enthusiasm surrounding it have drawn some criticism.
An article in The Atlantic pointed out that people should treat ChatGPT like a toy, not a tool because it cannot fully comprehend the complexities of human language and speech. As a result, what ChatGPT writes is OK but just OK. It can only write basic content, which is somewhat boring.
How Will ChatGPT Affect Web Scraping Tools?
ChatGPT in web scraping, similarly, shows its strengths and weaknesses at the same time. It can be a good advice provider but not a tool that can scrape data for you.
The most common use of ChatGPT in web scraping is that it can provide codes for data extraction. You can request a code that includes the website’s URL as the target. Following that, ChatGPT will output lines of code that can be copied and pasted. Additionally, it will specify which library you may use to scrape data using the example. For people who collect data with coding, ChatGPT can be an enhancer to save time since they don’t need to write code by themselves. A user posted a similar idea under the question “How will ChatGPT affect web scraping?” on Reddit. He believed it would reduce Googling time during scraping websites, but was less meaningful the more experienced people were.
Why not ask ChatGPT to scrape data from the website directly? We tried to input if it could scrape the target website and provide us with the data. ChatGPT’s response is, unsurprisingly, NO. It can only serve as a guide and aid in achieving the objective because it is merely a language model.
Web Scraping Tools are Still Indispensable
ChatGPT, according to its answer, can only provide advice regarding data scraping. Thus, discussing whether it can replace web scraping tools or become an alternative is like spending time discussing if the color red can replace the fruit apple.
People might need coding skills to grab information from online sources in the very beginning stage of the Internet age. But now, a wide range of web scraping tools makes it easier for everyone to extract data from websites no matter how much coding experience they have. People can neither spend no time Googling how to scrape data or correct code examples to extract data nor asking ChatGPT for help.
Many options are available in the market, and almost each of them aims to provide an easy-to-use experience to improve productivity in the workplace. Taking Octoparse as an example, it allows users to scrape data from a variety of websites within FOUR easy steps generally.
Step 1: Create a new task
Open the Octoparse software on your device, and next copy and paste the target URL into the search bar on Octoparse. The built-in browser will then begin loading the page.
Step 2: Select the wanted data
Once the page has finished loading, click “Auto-detect webpage data” in the Tips panel. Octopare will scan the page and highlight any extractable data. You may quickly determine whether the material is wanted or not, locate it on the webpage, then remove unwanted data at the bottom.
In this phase, you don’t need to be familiar with XML or HTML documents. In contrast, you’ll need the knowledge to examine these docs when utilizing the Python library. Even if ChatGPT gives you the code example, you still need to take the time to verify that the code is 100% accurate and meets your requirements. Some users have reported that ChatGPT sometimes made mistakes in coding.
Step 3: Create a workflow
After selecting all the desired data fields, click “Create workflow”. Then a workflow will show up on the right-hand side of the screen. It’s a flowchart that presents every step of the scraper. Clicking on each action on the chart, you can preview how it works and check if it works as intended.
Step 4: Run the task and export the data
After verifying all the information, click “Run” to start the scraper. Ocroparse now provides two options for running tasks. One is running on the local device. You must keep your device on and maintain excellent health throughout the scraping process to ensure everything goes as planned. The other one is running on Octoparse’s cloud servers. If you choose this option, your task will be sent to cloud servers, which can continue to operate for you around the clock regardless of whether you switch on your devices or the state of your local devices.
After the task has completed running, you can export data into an Excel, CSV, or JSON file or even a database like Google Sheets.
Isn’t it pretty easy to understand? To give a try on Octoparse’s powerful functions, you can first download and install it on your device. If you didn’t have an account on Octoparse, you’d need to sign up for a free account to log in. Octoparse, to some extent, will provide you with a remarkable experience just like talking with ChatGPT.
Make a Better Use of Octoparse with ChatGPT
Although ChatGPT frankly said it could not play the role of a web scraping tool, its abilities can still be leveraged to facilitate web scraping while employing tools to extract data. Here are two examples you might usually come up with when you try to grab information from websites, and how the powerful combination of Octoparse and ChatGPT can make you get data at ease than ever.
The best alternative to XPath tools
Auto-detection function is very easy-to-use in Octoparse. However, some websites with more intricate architecture than others may make auto-detection leave out some information you require. You’ll need XPath to locate accurate data fields in such a situation.
ChatGPT can provide you with an XPath once you enter your requirements, even if you have no idea what XPath is. Here’s how to get XPath in three easy steps using the IMDB Top 250 Movies page as an example.
Step 1: Find the page you want to scrape and copy its URL
Step 2: Tell ChatGPT what element(s) you want it to write XPath(s) on this page
Step 3: Copy what ChatGPT outputs
Beyond the result you can copy and use directly, ChatGPT explains each component of XPaths in patience to let you get a better understanding. While ChatGPT is an expert at writing both relative and absolute XPath, it’s best to make that explicit in your question to get the exact result you need. Then you can customize data fields with XPath on Octoparse and get wanted data from websites.
A good helper in regular expression
Octoparse has a feature that allows users to use regular expressions (Regex) to optimize scrapers. As a sequence of characters that specifies a search pattern in the text, a regular expression can be used to match elements or replace elements within text strings. Its adaptability makes it more potent than you might imagine in the area of data extraction. On Octoparse, you can use Regex to extract specific info and refine extracted data by replacing content, adding a prefix, etc.
Although there are many tools that can help write regular expressions, you might need a basic understanding of HTML documents and regular expressions to generate one. By contrast, ChatGPT is a more friendly helper that never requires you to know anything about these.
Take the IMDB Top 250 Movies page as an instance again. This chatbot gave a more comprehensive answer than we had ever expected after a question. It generated a regular expression and explained the meaning of each component letting us comprehend the expression better.
In addition, it provided a Python application example and information on the format of the collected data.
With ChatGPT, thus, you won’t bother to create regular expressions on your own anymore. You just need to copy and paste the generated code from ChatGPT into Octoparse to maximize your scraper’s ability to extract specific info and cleanse data.
ChatGPT is outstanding in the area of chatbots. It shows the huge progress of AI and lets the public imagine and think about a world with AI. ChatGPT can give a hand in many things, including web scraping, but it seems too early to call it a productivity tool. For web scraping, there are many great tools for users. Octoparse, as one of them, makes web scraping easy for everyone. Check out the articles below to find more possibilities for web scraping.