Blog > Big Data > Post

How Web Scraping Helps in the News Media

Friday, May 31, 2019

In the era of digital, people who work in news media have to face the increasing pressure of competition. Good content brings attention. Attentions bring ads. Ads mean cash.  The revenues generated from digital advertising have been climbed sharply these years. Such profit-oriented practice distorts the definition of being good news media. Thinking about Youtube influencers. There is nothing bad to be an influencer, in fact, influencers foster freedom of expression. However, if the influencer passes the wrong message to his audience, there are social consequences and backlashes. 

InToday, let’s use web scraping to extract the news content from news media,  and then we will analyze the speech and language using Python. Finally, we are able to find how news media is politically imposed that would mislead the audience.

Let’s take 5G as an example. In the beginning, it is just a technology, then the news media steps in. A potentially revolutionary change triggered the international conflicts between two superpower countries. From the most advancing technology to “threats” and “theft”, what does the news media portray HUAWEI and lead Americans homogeneously to align with such inception?

 

l What is 5G and how fast is it?

5G is the next generation of cellular network connection. We all hear that 5G is a lot faster than 4G LTE, but how fast is it really? Let’s be more concrete. 4G LTE uses a rather low-frequency band, in comparison, 5G uses extremely higher frequencies ranged between 30-300 GHz. That said, 5G can support 1,000 more devices within one meter than what 4G can. Speedwise, 5G can be 20 times faster than 4G. With the amount of time it takes for you to download one movie with 4G, you can download 20 movies with 5G.

 

What does the Media say about HUAWEI?

I scraped the news content related to HUAWEI from Reuters and CNN and analyzed the attitude and choice of words to see how biased a news media company can be. 

 

These word diagrams are clustered by the number of occurrences. The most frequently used negative words in Reuters are “fell”, “concerns” and “risk”. In comparison, “concerns” “risk” and “death” are the most frequently used negative words on CNN. It is understandable since HUAWEI is depicted and considered as a “security threat” to America. When you look closer at the word diagram and pay attention to the difference, it’s not hard to find that CNN is a bit more biased. Other most frequent words in CNN news are like “bad”, “fears”, ”criminal” and “fraud”. In contrast,  Reuter is more neutral. Words used by Reuters are “dispute”, “fall”, “losses”, “lost” and “difficult”.

 

 

 

 

CNN’s audience pool is much larger than that of Reuter. According to Columbia Journalism View,after reviewing over 1.23 million articles published and shared, the most frequently shared news articles for both Twitter and Facebook users are New York Times, CNN and Baribart. However, many news media that hold a neutral point of view like Reuters are shared by a small fraction of the audience pool in comparing with CNN. As a result, the words used by CNN can make a huge difference in how such an international issue perceived by a large population. That said, if CNN relates HUAWEI with more critically biased words like “bad” and “criminal”, the American population will be likely shifted to similar attitudes.

Political implication

The chart above shows the most frequently shared media sources for Twitter users that retweeted either Trump or Clinton. There are more Twitter users retweet Clinton than Donald Trump from CNN. Fox News shows even more distinct favor on Donald Trump. However, for news media with a neutral point of views who share smaller pies over the audience pool doesn't appear in the chart. 

I also scraped Trish Regan, who hosts the prime time on Fox Business Network. She was accused of being biased as a famous media person. Inspired by Trish Regan’s attitude of denial, I scraped her tweets about HUAWEI using Octoparse:

 

 

 

I got 800 tweets, the words used are shown below:

 

Frequently used words like are “wrong”, “brutal”, “hypocrisy”  and “bad” are emotional and biased.  Some of the noteworthy comments and tweets are like these:

 

 

It’s nothing wrong to show emotion and feelings, but it is wrong when a social media host becomes an agitator by sharing biased ideas.

Social media platforms open up the opportunity for freedom of speech. However, filter bubbles incubate language polarization and hate speech like below

 

 

Twitter has restricted rules regarding presented messages. When I scan through the comments I can see a lot of comments are deleted due to explicit vulgarity. considering it's impossible to wipe out all insinuating messages with defamation, we should walk away rather joining such a meaningless fight.

What to do to stop hate speech in social media?

 

  • Understand the rationale behind the acts: knowing the what languages and speech which stir up the fire would help us prevent the situation of dissemination.
  • Counter-speech research: we are looking for the perpetrator of bad speech, not combat against it.

 

 

Author: Ashley Weldon

 


 

Octoparse - Turning Websites into Structured Data

Author's Picks:

 

How web scraping and data analysis can help to grow your business?

American Dream is losing affordability in housing.

Data-Driven Ecommerce Pricing Strategy using Web Scraping

 

 



Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download