Web Scraping Tool and Twitter Data Set Processing

Twitter generates huge volumes of micro-blogs over time due to the constant interactions between users. Tweets contain much valuable info that worth mining, because they are real-time news reported and described by the participants evolved in some events, e.g., protests, pandemics, climate disasters, crimes. Besides as a clue of early event detecting, the collective sentiment measured from tweets can also reflect trends in long-lasting social events such as elections or movement in the stock market. Thus, we could make use of the tweets scraped from Twitter for influenza detection events and forecasting.

In our approach, we present a framework to analyze influenza infection, influenza vaccination and the relationship between infection trends and vaccination shots.

Based on tweets scraped from 2011 to 2015, we use the meta-data, content, spatial information and other related information of tweets to detect ongoing influenza events and forecast potential influenza events.

To start with our analysis, what we need to do is to scrape the data from Twitter about Influenza.

Actually, there are many ways to scrape data from Twitter. As well known, Twitter data provides public API for developers to read and get Twitter data conveniently. The REST API identifies Twitter applications and users using OAuth; Then We can utilize twitter REST APIs to get the most recent and popular tweets, And Twitter4j has been imported to crawl twitter data through twitter REST API.

However, this crawling or scraping process could be tough for people without a related API knowledge base. It’s already difficult for some people to use Twitter public API, let alone scrape by programming using Python or Ruby.

Thus, I’d like to share with you one automated web scraper tool that I once used – Octoparse, a powerful visual Windows web data scraper.

To start scarping with it, you need to download this application on your local desktop. As the figure below shows, you can click-and-drag the blocks in the Workflow Designer pane to customize your own task. Octoparse can simulate users’ browsing behaviors and scrape data in a structured way. It provides many advanced but easy-to-use options so that users can learn how to use it ASAP. After you complete your configuration of the task, you can export data in various formats as you need, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle). For more detailed information about this automated scraper tool, you can visit http://www.octoparse.com/ for your reference.

After completed scraping data from Twitter. We need to further analyze it.

Twitter data used in this project is the flu-related tweets from 2011 to 2013 posted in the United States. It contains 1,033,775 tweets, which all own an attribute named “geo-coded”. This attribute consists with the geo-location information associated with this tweet from which “country”, “state”, “city”, “county”, “longitude” and “latitude” are used in this project.

Every piece of tweet is pre-processed through Rosette Linguistics Platform with its natural language processing technology. As a result, every piece of tweet is transformed into an enriched form before acting as input to the methods described below. In an enriched form, lemma, the dictionary form of a word, is used to substitute the original token in tweet. Thus different forms of a word such as “sneeze”, “sneezes”, “sneezed”, and “sneezing” will be reduced to only one word “sneeze”. Additionally, because only content words are required in then text analysis, we remove those function words, e.g., conjunction, determiner, and conjunction which can convey little information. In addition, stop-words are removed from tweets using NLTK stop words corpus.

Twitter, as a new type of social media, has many unique features, which make it hard to be handled by traditional text process technologies.

One feature of Twitter content is heterogeneity. That is, every tweet comprises various types of entities, e.g., user, hash-tag, link, location and text.

Besides, the language on Twitter is very informal. Many words appear in an abbreviated form, such as “u” short for “you”. Because traditional text analysis technologies mainly focus on formal articles, they are not expected to have the same performance in tweet data.

Similarly, traditional machine learning can’t be directly applied to Twitter data, either. One feature of the tweet is its short length, which is up to 140 words. Thus would lead to a sparsity problem in a bag of words model which traditional learning methods will not have a good performance. Another important attribute is that the term Twitter used to describe particular domain changes dramatically as time goes by. For instance, in 2012, the most common term used in twitter to describe Mexican protests was ‘‘#YoSoy132’’, which is the name of the organization protesting against electoral fraud. However, in early 2013, the most common term for Mexican protests was ‘‘#CNTE’’ because of the large ongoing protest against Mexican education reform. It is hard for a text model designed for formal articles to capture this dramatic change perfectly.

In order to handle heterogeneity and changeable description about an event in Twitter, particular in this flu detection domain, we adopt dynamic query expansion (DQE) [14] algorithm to capture key terms dynamically and then use those captured key terms to search flu-related tweets from large volumes of data set. The basic idea of DQE model is to capture the changing of terms by leveraging an assumption that most significant domain-related terms are also the most common words used in tweets to describe this domain. Though DQE also has a key term set, unlike Lasso methods operating on a static keyword set, it can dynamically adapt its key term set according to the term usage in the current Twitter environment.