How Web Scraping helps Tsinghua University's Scientific ResearchSaturday, June 01, 2019
Tsinghua University works with Octoparse to conduct scientific research in social science and economy related fields. Inside Tsinghua University, the Institute of Economics joins force with School of Social Science and Institute of Data Science and found a research team called “iCPI Study Group”, whose goal is to collect and analyze useful data collected from the internet. With Octoparse’s help on collecting big data on a large scale, Tsinghua University, by analyzing the big data, seeks to establish scientific laws and theories, so as to explain the behaviors that can apply to society.
Basic introduction about iCPI research
What is iCPI?
iCPI is the Internet-Based Consumer Price Index. In comparison to the ordinary CPI that is used to track the cost of living over time, the iCPI is designed to predict inflation based on real-time online data by studying consumers’ behavior and rapid price fluctuation among billions of online merchant transactions.
With the help of Octoparse, iCPI research team from Tsinghua University enables to devote to the project and makes significant progress. Real-time iCPI index analysis enables Finance, Economic data analysts and school researchers to reference data easily.
Step One: Defining the variable
According to the National Bureau of Statistics of China (NBS), there are eight major categories in the CPI basket: food, alcohol & tobacco, residence, clothing, transport & communication, health, household durables & services, education & recreation, and a miscellaneous category. There are also sub-components that are included in each category such as grains, pork and vegetable subcomponents in the food category.
Tsinghua University research group needs to choose the category and subcomponents that have the biggest online market shares with a stable selling price on multiple platforms.
Step Two: Data collection
iCPI research study is different from what we perceive in ordinary CPI research. We can’t collect data with traditional methods including experiments conducting, participant observations, surveys, and existing resources. Tsinghua University research team needs a massive amount of information about price and commodities on the internet so as to generate a reliable iCPI pattern as a reference for macroeconomic decisions. Thus, data collection needs to be done in the form of automatic web scraping instead of manual copy and paste.
Step Three: Correlation Analysis
Step Four: Indices Calculation
- Daily Index
- Weekly Index
- Monthly Index
Why choose Octoparse?
iCPI research study group needs a scalable amount of data to support the study. There are many obstacles for research members to obtain such amount of data.
First, it requires high-demand data analytic skills including writing codes. It can be a difficult task for research members who lack programming skills.
Second, considering the quantity, quality, efficiency and automation in the process of data collection. If one scrapes too fast from a website, it is very possible the website would set up some mechanism to block further scraping.
As a result, it is necessary for team members to be able to scrape without coding and prevent being blacklisted at the same time.
Octoparse is a web scraping software geared with anti-blocking features without writing a single line of code. Comparing to other web scraping software, Octoparse has:
1. The most prominent feature in Octoparse is the Task Template. There are 45 task templates covering 12 different categories including social media, e-commerce, google scholar, finances and etc.
2. Built-in browser: it visualizes the scraping process by simulates the web browsing process. As a result, people who have zero backgrounds about web data structures don’t have to deal with massive HTML codes to conduct web scraping.
3. IP rotations: it is necessary to change your IP address frequently if you intend to get a large amount of data from one website. It is because the website will detect abnormal actions initiated from one IP address, and ban excessive requests from the same IP address. IP rotations are designed to automatically switch the IP proxies so the scraping won’t get suddenly interrupted.
4. User-agent switch: user agent is the name tag used for a browser to send requests to websites. Some websites are picky about what kind of user agents are allowed to access the websites. Octoparse offers 9 user agents for uses to avoid such a situation.
5. Cookie cleansing: speed up the extraction process.
6. reCaptcha: reCaptcha is the technique for websites to decide if the requests sent from robots or actual human beings. Usually, reCaptcha is carried in forms of:
- Choosing the images
- Check a box that says “I’m not a robot”
- Typing in characters/numbers
These can be overcome by Octoparse’s built-in browser. The working theory is that you are able to do anything with its built-in browser as you do with your own. Hence, you can pass the reCapthca.
Author: Ashley Ng
Ashley is a data enthusiast and passionate blogger with hands-on experience in web scraping. She focuses on capturing web data and analyzing in a way that empowers companies and businesses with actionable insights. Read her blog here to discover practical tips and applications on web data extraction
Si desea ver el contenido en español, por favor haga clic en: 5 Razones por El Web Scraping Puede Beneficiar a Su Negocio