How Web Scraping Helps Tsinghua University’s Scientific Research

4 min read

Tsinghua University works with Octoparse to conduct scientific research in social science and economy-related fields. Inside Tsinghua University, the Institute of Economics joins forces with the School of Social Science and Institute of Data Science and found a research team called “iCPI Study Group”, whose goal is to collect and analyze useful data collected from the internet. With Octoparse’s help in collecting big data on a large scale, Tsinghua University, by analyzing the big data, seeks to establish scientific laws and theories, so as to explain the behaviors that can apply to society.

Basic introduction to iCPI research

What is iCPI?

iCPI is the Internet-Based Consumer Price Index. In comparison to the ordinary CPI that is used to track the cost of living over time, the iCPI is designed to predict inflation based on real-time online data by studying consumers’ behavior and rapid price fluctuation among billions of online merchant transactions.

Expected result:

With the help of Octoparse, iCPI research team from Tsinghua University enables to devote themselves to the project and makes significant progress. Real-time iCPI index analysis enables Finance, Economic data analysts, and school researchers to reference data easily.

Research Procedure

Step One: Defining the variable

According to the National Bureau of Statistics of China (NBS), there are eight major categories in the CPI basket: food, alcohol & tobacco, residence, clothing, transport & communication, health, household durables & services, education & recreation, and a miscellaneous category. There are also sub-components that are included in each category such as grains, pork, and vegetable subcomponents in the food category.

Tsinghua University research group needs to choose the category and subcomponents that have the biggest online market shares with a stable selling price on multiple platforms.

Step Two: Data collection

iCPI research study is different from what we perceive in ordinary CPI research. We can’t collect data with traditional methods including experiments conducting, participant observations, surveys, and existing resources. Tsinghua University research team needs a massive amount of information about prices and commodities on the internet so as to generate a reliable iCPI pattern as a reference for macroeconomic decisions. Thus, data collection needs to be done in the form of automatic web scraping instead of manual copy and paste.

Step Three: Correlation Analysis

  • iCPI
  • Categories
  • Subcomponents

Step Four: Indices Calculation

  • Daily Index
  • Weekly Index
  • Monthly Index

Why choose Octoparse?

iCPI research study group needs a scalable amount of data to support the study. There are many obstacles for research members to obtain such amount of data.

First, it requires high-demand data analytic skills including writing codes. It can be a difficult task for research members who lack programming skills.

Second, considering the quantity, quality, efficiency and automation in the process of data collection. If one scrapes too fast from a website, it is very possible the website would set up some mechanism to block further scraping.

As a result, it is necessary for team members to be able to scrape without coding and prevent being blacklisted at the same time.

Octoparse is a web scraping software geared with anti-blocking features without writing a single line of code. Compared to other web scraping software, Octoparse has:

1. The most prominent feature of Octoparse is the Task Template. There are 45 task templates covering 12 different categories including social media, e-commerce, google scholar,  finances and etc.

2. Built-in browser: it visualizes the scraping process by simulating the web browsing process. As a result, people who have zero background in web data structures don’t have to deal with massive HTML codes to conduct web scraping.

3. IP rotations: it is necessary to change your IP address frequently if you intend to get a large amount of data from one website. It is because the website will detect abnormal actions initiated from one IP address, and ban excessive requests from the same IP address. IP rotations are designed to automatically switch the IP proxies so the scraping won’t get suddenly interrupted.

4. User-agent switch: user agent is the name tag used for a browser to send requests to websites. Some websites are picky about what kind of user agents are allowed to access the websites.  Octoparse offers 9 user agents for use to avoid such a situation.

5. Cookie cleansing: speed up the extraction process.

6. reCaptcha: reCaptcha is the technique for websites to decide if the requests are sent from robots or actual human beings. Usually, reCaptcha is carried in the forms of:

  • Choosing the images
  • Check a box that says “I’m not a robot”
  • Typing in characters/numbers

These can be overcome by Octoparse’s built-in browser. The working theory is that you are able to do anything with its built-in browser as you do with your own. Hence, you can pass the reCapthca.

Hot posts

Explore topics

Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today


Related Articles