logo
languageENdown
menu

Scraping Twitter and Sentiment Analysis using Python

5 min read

This is a use case of web scraping twitter for sentiment analysis. Let’s start from…Donald Trump.

I am not a big fan of Donald Trump. Technically, I don’t like him at all. However, he has this charismatic sensation effect. His name occupies most newspapers and social media all the time. People’s attitude towards him is dramatic and bilateral. His descriptive words are either highly positive or negative, which are some perfect material for web scraping and sentiment analysis.

The goal of this workshop is to use a web scraping tool to read and scrape tweets about Donald Trump with a web crawler. Then we conduct a sentiment analysis using python and find out public voice about the President. And finally, we visualized the data using Tableau public.

You Should Continue to Read:

  1. IF you don’t know how to scrape contents/comments on social media.
  2. OR/AND IF You know Python but don’t know how to use it for sentiment analysis.

Let’s start with scraping using Octoparse. Downloaded the newest version from official websites and finished registration by following the instructions. After you log in, open the built-in Twitter template. 

Tweet Data Extracted in the Scraper

  1. Name
  2. Publish time
  3. Content
  4. Image URL
  5. Tweet URL
  6. Numbers of comments, retweets, and likes

Enter “Donald Trump” at the Parameter field to tell the crawler the keyword. Just as simple as it seemed, I got about 10k tweets. You can scrape as many tweets as possible. After getting the tweets, export the data as a text file, name the file as “data.txt”.

Sentiment Analysis using Python

Before getting started, make sure you have Python and a text editor installed on your computer. I use Python 2.7 and Notepad++.

Then we use two opinion word lists to analyze the scraped tweets. You can download them from here. These two lists contain positive and negative words (sentiment words) that were summarized by Minqing Hu and Bing Liu from research study about presented opinions words in social media.

The idea here is to take each opinion word from the lists, return to the tweets, and count the frequency of each opinion words in the tweets. As a result, we collect corresponding opinion words in the tweets and the count.

First, create a positive and negative list with two downloaded word lists. They store all the words that are parsed from the text files.

Then, preprocess texts and massage the data by taking out all the punctuations, signs and numbers with the following code:

# process the tweets

with open (‘data.txt’) as f:

      txt = f.read ()

      txt = re.sub (‘[, \ . () “: ; !@#$%^&*\α] | \ ‘ s | \ ‘ ‘ , ‘ ‘ , txt)

      word_list = txt.replace (‘ \n ‘, ‘  ‘) . replace (‘    ‘, ‘    ‘) . lower () . split (‘  ‘)

As a result, the data only consisted of tokenized words, which makes it easier to analyze. Afterword, create three dictionaries: word_count_dict, word_count_positive, and word_count_negative.

Next, define each dictionary. If an opinion word exists in the data, count it by increasing word_count_dict value by “1”.

Afterwords counting, decide whether a word sounds positive or negative. If it is a positive word, word_count_positive increases its value by “1”, otherwise positive dictionary remains the same value. Respectively, word_count_negative increases its value or remains the same value. If the word is not present in either positive or negative list, it is a pass.

Polarity: Positive vs. Negative

As a result, I got 5352 negative words and 3894 positive words, saved the list with your choice of name, and opened it with Tableau public, and build up a bubble chart. If you don’t know how to use Tablau public to create bubble chart, click here.

The use of positive words is unilateral. There are only 404 kinds of positive word used. The most frequent words are, for example, “like”, “great” and “right”. Most word choices are basic and colloquial, like “wow” and “cool,” whereas the use of negative words is much more multilateral. There are 809 kinds of negative word that most of them are formal and advanced. The most frequently used are “illegal,” “lies,” and “racist.” Other advanced words such as “delinquent”, “inflammatory” and “hypocrites” are also present.

The choice of words clearly indicates the level of education of whom is supportive is lower than that disapproval. Apparently, Donald Trump is not so welcomed among Twitter users.

Summary

In this article, we talked about how to scrape tweets on Twitter using Octoparse. We also discussed how to preprocess data text and analyze positive/negative opinion words expressed on Twitter using Python. For a complete version of the code, you can download here (https://gist.github.com/octoparse/fd9e0006794754edfbdaea86de5b1a51)

Hot posts

Explore topics

image
Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today

Download

Related Articles