The 2018 World Cup began on June 14th. We were quite curious about what people were really interested in, and what they were talking about the matches. So we scraped and analyzed the 612 video information under the search result of "World Cup 2018", and we got over six thousand comments.
Here we would like to share with you the data extraction process and what we’ve found based on the data extracted.
What Information We Captured?
With the assistance of Octoparse 7, we configured two crawlers for this project. The difficulty of scraping Youtube is the infinite scroll down of the results page and video pages, so we need to set up automatically scroll down in Octoparse, to obtain the video information as much as we can.
And we would like to capture Title, Duration, Publisher, Time, Views, Likes and Dislikes of the video.
With the Point & Click function of Octoparse 7, it’s quite easy to extract the video information we want.
As for retrieving the comments of each video, we built another crawler for it. The trick here is adding an extra loop item inside the crawler workflow to extract each comment one by one.
Within each comment, we just scraped the content and the "Like" number.
As a result, we could transform the video information and comments of Youtube videos into structured datasheet.
What Did We Found From the Information Extracted?
I employed the crawler to scrape the data last Friday, with only 56 games happened at that time. Following are the results I would like to share with you.
Ten Most Popular Videos about 2018 World Cup
It’s quite surprising that the music videos of the official songs are so popular. Among the 10 most popular videos, 4 of them are music videos of the official songs of the 2018 World Cup.
Ten Most Popular 2018 World Cup Matches on Youtube
Popularity is related with the playing team, instead of the matching stage.
We found out that there are 7 popular matches are group stage matches, with 3 of them are Last 16 matches.
The most popular team is Argentina (3 matches listed), the following are Portugal, Germany, Brazil (2 matches listed respectively).
People's Preference for the 10 Most Popular Matches
Now we would like to explore further people’s preference for the 10 Most popular matches, simply from the "Like" and "Dislike" numbers.
Ten Most Dislike 2018 World Cup Matches
This is calculated by the ratio of the number of "Dislike"/Total View number. It’s quite interesting that 6 of them are also listed in "10 Most Popular World Cup Matches".
Sentiment Analysis on 2018 World Cup Matches
After further process on the comment scraped, we got the analyzed result of the sentiment analysis of 8 2018 World Cup Matches, as below:
Words Surrounding 2018 World Cup Matches Comments
We imported the comments under several matches to create the word cloud, finding the most commonly occurring words.
Sweden vs Swaziland
South Korea vs Germany
Spain VS Russia
England vs Belgium
How People Commenting on the 2018 World Cup Final: France Vs Croatia
Word clouds of the comments on the 2018 World Cup Final: France vs Croatia
Words used most often in the comments
The data is beautiful. Thanks to the rapid development and popularization of data extraction and data analytics tools, now we can collect the information we want and employ an analysis, in a much faster and easier way.
Author: Surie M. (The Octoparse Team)