Blog > Knowledge > Post

Web Data Crawling & "Bag-Of-Words" for Data Mining

Tuesday, January 19, 2021

Sentiment Analysis, also known as Opinion Mining, is an example of data mining, which means to explore the preference or tendency of people about varied topics. With the explosion of data spreading over various web social media, like Twitter, Facebook, and etc, data is becoming available by crawling the websites. Thus, data mining has also been widely used in different industries for different uses.

To start with Sentiment Analysis, what comes first is to crawl data with high quality. Normally, people crawl the web to access the data resource. There are several ways for us to crawl the websites. Some websites have provided public APIs for people to crawl their data set, like Facebook, Twitter. However, if people would like to crawl more data not available in their public data set, people have to build a crawler on their own by programming or using certain automated web crawler tool, like Octoparse, Import.io and etc.

We encourage people to choose crawling websites by programming using Ruby or Python. But people without any coding skills or who want to save more time can choose certain professional automated crawler tool , like Octoparse based on my user experience.


Table of Contents

Data Crawling

Python: “Bag-of-words” Model

Generation of Sentiment Words

Data Evaluation




Note: The guide is based on the old version of Octoparse. Check out our Help Center and Blog to explore the guide for our latest version!

Data Crawling

We now need to conduct an experiment on "Bag-of-words". To start with our analysis, we should collect data first. As mentioned before, there are several methods we can use to crawl website data. In this writing, I’d like to propose an automated web crawler tool - Octoparse for your convenience to crawl data you need.


Octoparse is a Windows desk-top scraper tool, it can provide users with many advanced crawling service to crawl data from websites and have returned crawled data results exported to various formats. The UI of Octoparse is user-friendly, as you can see in the figure above. You can build your task by select&click in an easy way. This scraper tool can simulate users’ browsing behaviors and offer many more advanced options, like AJAX Timeout, Ad blocking, Proxy service, Cloud service and etc. Don’t hesitate to use, it can reduce your time cost greatly.


After completion of the scraping task either by programing or using scraper tool, like Octoparse. We can randomly select 8 books with their respective customer reviews.


Then we can score these selected books and visualize these scores using column graph as the graph shown below. As we can see, the distribution and trend of score is reasonable and characteristic since very few star ‘1’ is left in the comments board of books with over-averaged score, while the score shows certain specific trend for those books below-averaged score. Here, by observing the data we crawled, we can derive that Gone Girl suits well for our training data set, while the Unbroken doesn’t fit our training set since there is almost none ‘1’ star comment with it. 




Python: “Bag-of-words Model

"Bag-of-words" Model has been working well with subject classification, while not that accurate when analyzing Sentiment Analysis. One Sentiment Analysis research on movie reviews conducted by Bo Pang and Lilian Lee in 2002 shows an accuracy of mere 69%. However, if we use Naive Bayes、Maximum Entropy、Support Vector Machines which are common text classifier, then we can get a higher accuracy around 80%.

While, the reason we choose "Bag-of-words" Model is for the reason it can help us to deep learn the text content, and the three commonly used classifers are also based on "Bag-of-words" Model which could be deemed as an intermediate method.

So far, the NLP(Natural Language Processing)has been committing itself in dealing with the bag-of-word. Most of its work is applied in the field of machine learning based on statistics. Some people may be not acquainted with the conception of “ Bag-of-word” . So, what “ Bag-of-word” really means?  As known, the object of NLP is the natural language text. Specifically, comments, reviews, corpus, document, post, text with discourse all can be the input of the system of NLP. After the input of NLP, what comes next is tokenization. To put it another way, a bag-of-word is to tokenize the input natural language, and process with those tokenizations based on statistic models.

After completion of crawling web data using Octoparse, we categorize the comments data we crawled before into “Training Data Set” and “Testing Data Set”. There are around 40000 comments fo r “Gone Girl” in total. And we decide to use half of the data as our “Training Data Set”, and the rest of crawled data will be used for “Testing Data Set”. Plus, to make our experiment more accurate, we will adjust the size of training data set from 1000 to 20000 comments considering the factor of training data set size.

“Bag-of-words” keeps track of the occurence number of each word to build up the text Unigram Model which will be used as the text classifier feature then. In this model, you can only analyze the words seperately and assign subjective score to them respectively. If the sum of score is lower than the standard line, we can derive this text is negative, otherwise it is positive. “Bag-of-words” sounds easy may be, however, it is not that accurate since it doesn’t consider the grammer or sequence those words. To improve it, we can combine Unigram Model with Bigram Model, that means we decide not to tokenize words which are followed by “not”、“no”、“very”、“just” and etc. The pseudocode to build a “Bag-of-words” is as below.

list_BOW = []

For each review in the training set:

Strip the newline charachter “\n” at the end of each review.

Place a space before and after each of the following characters: .,()[]:;”  (This prevents sentences like “I like this book.It is engaging” being interpreted as [“I”, “like”, “this”, “book.It”, “is”, “engaging”].)

Tokenize the text by splitting it on spaces.

Remove tokens which consist of only a space, empty string or punctuation marks.

Append the tokens to list_BOW.

list_BOW now contains all words occuring in the training set.

Place list_BOW in a Python Counter element. This counter now contains all occuring words together with their frequencies. Its entries can be sorted with the most_common() method.


Generation of Sentiment Words

Here comes the issue that how we can relate the sentiment score with the whole text sentiment score? Here, we propose the method that using statistic standard derived from training data set to give subject score of each word. Thus, we need judge the occurence number in certain class of each word, and this can be realized by using anda Dataframeaas datacontainer (Dictionary is the only way or other data formats). Code is as below.

from sets import Set

import pandas as pd


BOW_df = pd.DataFrame(0, columns=scores, index='')

words_set = Set()

for review in training_set:

    score = review['score']

    text = review['review_text']

    splitted_text = split_text(text)

    for word in splitted_text:

        if word not in words_set:


            BOW_df.loc[word] = [0,0,0,0,0]

            BOW_df.ix[word][score] += 1


            BOW_df.ix[word][score] += 1


def expand_around_chars(text, characters):

     for char in characters:

     text = text.replace(char, " "+char+" ")

     return text


def split_text(text):

     text = strip_quotations_newline(text)

     text = expand_around_chars(text, '".,()[]{}:;')

     splitted_text = text.split(" ")

     cleaned_text = [x for x in splitted_text if len(x)>1]

     text_lowercase = [x.lower() for x in cleaned_text]

     return text_lowercase


The output result as below contains the occurence number of each word belong to each class.




By dividing the sum number of all words which have occured using the occurence number of each word. We can get a relative occurence number of each word within each class. Therefore, we can build sentiment words within this training data set and use this to evaluate the comments in the Testing Data Set.


Data Evaluation

We can use “Bag-of-word” Model to evaluate whether a comment is negative or positive with an accuracy above 60% by considering ‘4’ or’ 5’ star as positive and ‘1’ or ‘2’ star as negative which you can see as the table below.



Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

Sign up today!


Author's Picks

Be the Best Junior Management Consultant: Skills You Need to Succeed

Web Scraping|Scrape Booking Reviews

Web Scraping|Scrape Data from Online Accommodation Booking Sites

5 Steps to Collect Big Data

A Must-Have Web Scraper for Data Comparison Software - Octoparse

10 Best Free Tools for Startups - Octoparse

30 Free Web Scraping Software



We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline