How Web Scraping Helps in the News Media

In the era of digital, people who work in news media have to face the increasing pressure of competition. Good content brings attention. Attentions bring ads. Ads mean cash. The revenues generated from digital advertising have climbed sharply these years. Such profit-oriented practice distorts the definition of good news media. Thinking about Youtube influencers. There is nothing bad to be an influencer, in fact, influencers foster freedom of expression. However, if the influencer passes the wrong message to his audience, there are social consequences and backlashes.

Today, let’s use web scraping to extract the news content from news media, and then we will analyze the speech and language using Python. Finally, we are able to find how news media is politically imposed that would mislead the audience.

Let’s take 5G as an example. In the beginning, it is just a technology, then the news media steps in. A potentially revolutionary change triggered the international conflicts between two superpower countries. From the most advancing technology to “threats” and “theft”, how does the news media portray HUAWEI and lead Americans homogeneously to align with such inception?

What is 5G and how fast is it?

5G is the next generation of cellular network connections. We all hear that 5G is a lot faster than 4G LTE, but how fast is it really? Let’s be more concrete. 4G LTE uses a rather low-frequency band, in comparison, 5G uses extremely higher frequencies ranging between 30-300 GHz. That said, 5G can support 1,000 more devices within one meter than 4G can. Speed-wise, 5G can be 20 times faster than 4G. With the amount of time it takes for you to download one movie with 4G, you can download 20 movies with 5G.

What does the Media say about HUAWEI?

I scraped the news content related to HUAWEI from Reuters and CNN and analyzed the attitude and choice of words to see how biased a news media company can be.

The word diagrams are clustered by the number of occurrences. The most frequently used negative words in Reuters are “fell”, “concerns” and “risk”. In comparison, “concerns” “risk” and “death” are the most frequently used negative words on CNN. It is understandable since HUAWEI is depicted and considered a “security threat” to America. When you look closer at the word diagram and pay attention to the difference, it’s not hard to find that CNN is a bit more biased. Other most frequent words in CNN news are like “bad”, “fears”, ”criminal” and “fraud”. In contrast, Reuter is more neutral. Words used by Reuters are “dispute”, “fall”, “losses”, “lost” and “difficult”.

CNN’s audience pool is much larger than that of Reuter. According to Columbia Journalism View, after reviewing over 1.23 million articles published and shared, the most frequently shared news articles for both Twitter and Facebook users are New York Times, CNN, and Baribart. However, many news media that hold a neutral point of view like Reuters are shared by a small fraction of the audience pool in comparison with CNN. As a result, the words used by CNN can make a huge difference in how such an international issue is perceived by a large population. That said, if CNN relates HUAWEI with more critically biased words like “bad” and “criminal”, the American population will be likely shifted to similar attitudes.

Political implication

The most frequently shared media sources for Twitter users that retweeted either Trump or Clinton. There are more Twitter users who retweet Clinton than Donald Trump from CNN. Fox News shows even more distinct flavor for Donald Trump. However, news media with a neutral point of view who share smaller pies over the audience pool doesn’t appear in the chart.

I also scraped Trish Regan, who hosts the prime time on Fox Business Network. She was accused of being biased as a famous media person. Inspired by Trish Regan’s attitude of denial, I scraped her tweets about HUAWEI using Octoparse.

I got 800 tweets, the words used are shown below:

Frequently used words are “wrong”, “brutal”, “hypocrisy” and “bad” are emotional and biased. Some of the noteworthy comments and tweets are like these:

It’s nothing wrong to show emotion and feelings, but it is wrong when a social media host becomes an agitator by sharing biased ideas.

Twitter has restricted rules regarding presented messages. When I scan through the comments I can see a lot of comments are deleted due to explicit vulgarity. considering it’s impossible to wipe out all insinuating messages with defamation, we should walk away rather joining such a meaningless fight.

What to do to stop hate speech on social media?

Understand the rationale behind the acts: knowing the languages and speech which stir up the fire would help us prevent the situation of dissemination.
Counter-speech research: we are looking for the perpetrator of bad speech, not combat against it.