Web Scraping Tool & Twitter Data Set Processing
Wednesday, March 15, 2017
Twitter generates huge volumes of micro-blogs over time due to the constant interactions between users. Tweets data scraped from Twitter contains many valuable info which worth mining, because they are real-time news reported and described by the participants evolved in some events, e.g., protests, pandemics, climate disasters, crimes. Besides as a cue of early event detecting, the collective sentiment measured from tweets can also reflect trends in long lasting social events such as elections or movement in stock market. Thus, we could make use of the tweets data scraped from Twitter for influenza detection event and forecasting.
In our approach, we present a framework to analyze influenza infection, influenza vaccination and the relationship between infection trend and vaccination shots. Based on tweets scraped from 2011 to 2015, we use the meta-data, content, spatial information and other related information of tweets to detect ongoing influenza events and forecast potential influenza events.
To start with our analysis, what we need to do is to scrape the data from Twitter about Influenza.
Actually, there are many ways to scrape data from Twitter. As well known, Twitter data provides public API for developers to read and write twitter data conveniently. The REST API identifies Twitter applications and users using OAuth; Then We can utilize twitter REST APIs to get most recent and popular tweets, And Twitter4j has been imported to crawl twitter data through twitter REST API. However, this crawling or scraping process could be tough for people without related API knowledge base. It’s already difficult for some people to use Twitter public API, not alone to scrape by programming using Python or Ruby. Thus, I’d like to share with you one automated web scraper tool that I once used - Octoparse which is a powerful visual windows web data scraper. To start scarping with it, you need download this application on your local desk-top. As the figure below shows, you can click-and-drag the blocks in the Workflow Designer pane to customize your own task. Octoparse can simulate users’ browsing behaviors and scrape data in a structured way. It provides many advanced but easy-to-use options so that users can learn how to use it ASAP. After you complete your configuration of the task, you can export data in various formats as you need, like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle). For more detailed information about this automated scraper tool, you can visit its website http://www.octoparse.com/ for your reference.
After completion of scraping data from Twitter. We need to further analyze twitter data. Twitter data used in this project is the flu related tweets from 2011 to 2013 posted in the United States. It contains 1,033,775 tweets, which all own an attribute named “geo-coded”. This attribute consists with the geo-location information associated with this tweet from which “country”, “state”, “city”, “county”, “longitude” and “latitude” are used in this project.
Every piece of tweet is pre-processed through Rosette Linguistics Platform with its natural language processing technology. As a result, every piece of tweet is transformed into an enriched form before acting as input to the methods described below. In an enriched form, lemma, the dictionary form of a word, is used to substitute the original token in tweet. Thus different forms of a word such as “sneeze”, “sneezes”, “sneezed”, and “sneezing” will be reduced to only one word “sneeze”. Additionally, because only content words are required in then text analysis, we remove those function words , e.g., conjunction, determiner, and conjunction which can convey few information. In addition, stop-words are removed from tweets using NLTK stop words corpus.
Twitter, as a new type of social media, has many unique features, which make it hard to be handled by traditional text process technologies. One feature of Twitter content is heterogeneity, that is, every tweet comprises various types of entities, e.g., user, hash-tag, link, location and text. Besides, the language in the twitter is very informal, many words appear in an abbreviated form, such as “u” short for “you”. Because traditional text analysis technologies mainly focus on formal articles, they are not expected to have same performance in tweet data.
Similarly, traditional machine learning can’t be directly applied to Twitter data, either. One feature of tweet is its short length, which is up to 140 words. Thus would lead a sparsity problem in a bag of words model which traditional learning methods will not have a good performance. Another important attribute is that the term Twitter used used to describe a particular domain changes dramatically as time goes by. For instance, in 2012, the most common term used in twitter to describe Mexican protests was ‘‘#YoSoy132’’, which is the name of the organization protesting against electoral fraud. However, in the early 2013, the most common term for Mexican protests was ‘‘#CNTE’’ because of the large ongoing protest against Mexican education reform. It is hard for a text model designed for formal articles to capture this dramatic change perfectly.
In order to handle heterogeneity and changeable description about an event in Twitter, particular in this flu detection domain, we adopt dynamic query expansion (DQE)  algorithm to capture key terms dynamically and then use those captured key terms to search flu-related tweets from large volumes of data set. The basic idea of DQE model is to capture the changing of terms by leveraging an assumption that most significant domain-related terms are also the most common words used in tweets to describe this domain. Though DQE also has a key term set, unlike Lasso methods operating on a static keyword set, it can dynamically adapt its key term set according to the term usage in current Twitter environment.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!