Sources Scraper:How to Get Coronavirus Data (COVID-19) Using Web ScrapingThursday, February 13, 2020
Since the outbreak of the new airborne contagious coronavirus, the lives of millions have been impacted and relevant news has been exploding on all platforms.
In this situation, we thought it’d be necessary to collect real-time data from both official and unofficial sources so that the public can have a fair-minded understanding of this outbreak with transparent data sources.
To pull data from these sources, you can take advantage of web scraping tools like Octoparse as we’ve built web scraping templates to extract data on China’s government report. This can keep you updated with the latest information. Now let’s take a look at how to use the template to extract live data.
Step 1: Launch Octoparse in your computer and build a scraping task by clicking “Task Template”.
Notice: There are numbers of scraping “recipes” ranging from eCommerce websites to social media channels. These are preformatted crawlers that can be used to extract data from target websites directly. You may check out this article to get a better idea of what a web scraping template is.
Step 2: Under the “Live” category, choose “national healthcare commission”.
You will see two templates. One is for extracting government news and announcement. The other is the Tencent news website, which is directly connected with China’s central and local Health Commission. This is so far the quickest method to get live data including the confirmed cases, the recovery, death toll and fatality rate in each city of China.
Step 3: Click ”real-time data 2019-nCov” as we want to collect live data.
There’s no need for configuration. Simply start the extraction and Octoparse will automatically scrape the data at ease. You can export the data into many formats, such as Excel, JSON, CSV, and to your own database via API. Here's what the data output in excel looks like.
You can also extract real-time information on social media channels. There are templates covering popular platforms such as Facebook, Twitter, Instagram, and YouTube.
For example, if you want to extract the latest tweets about the virus and see how people are reacting to it, you may take advantage of the “latest tweets” template. It’s designed to collect the latest tweets containing the search keyword that you put into. It allows you to extract web page URL, tweet URL, the handlers, posts, etc.
Now let’s run this template.
Step 1: Open Twitter, type in “coronavirus” and click on the “latest” tab. Copy the URL and paste it into the first parameter.
Step 2: Enter a number into the second parameter.
Twitter applies infinite scrolling technique, which means that we have to set a scrolling number until we get the desired numbers of posts. You can set any number you like from 1 to 10,000. The idea is to get the page fully loaded. For example, if you enter the number 10, the bot will scroll 10 times.
Step 3: Execute the scraper by clicking “save and run” and you'll get the results instantly.
We’ve covered how to use web scraping templates to collect real-time data about coronavirus in this video. If you also want to build your own scraper to extract articles from news portals like Wall Street Journal, New York Times, and Reuters, you may check out this video.
This blog post was originated from our article How Data Analysis Helps Unveil the Truth of Coronavirus.