By Arafath Hossain
This article is about a follow-up study on analyzing YouTube contents about Bangladesh similar to the one done before on tweets about Bangladesh. In that last article I analyzed tweets that contained ‘Bangladesh’ to understand the topics that people commonly talked about, the overall sentiment expressed and whether the sentiments varied based on the specific topic such as Rohingya crisis. Due to the sensitivity of some of the video content, access to them may have restrictions in place, at least in a few regions of the world. Internet users could use a Youtube proxy to view such videos in case it is not available to them. In this study, I have harvested statistics and comments of viewers from videos that contained “Bangladesh” in their titles from YouTube. Then I applied different text analytics techniques and machine learning models to find out what the videos are about, how viewers interacted with the videos, what specific issues the viewers talked most inside different topic areas, expressed their sentiments and how different sentiments changed over the period of time in different topics.
Similar to the last article, to put our discussion in a context we will consider this study as a random walk on the street called “YouTube” with a celebrity named “Bangladesh”. During our walk, we will record all the videos and comments made on the videos created about Bangladesh. Then we will analyze the records, both videos and comments, and try to understand what people talk about most and what they feel about the topics that they talk about.
Our journey starts with collecting videos posted on YouTube about “Bangladesh”. To do that, YouTube’s publicly available API was used to harvest reports of the videos that had ‘Bangladesh’ in their titles. Our search produced a report of total 589 videos that were posted in year of 2017. Let’s start our study by looking at the trend of video posting over the year.
From figure-1 (a) we can see that there was a growing trend starting from August which reached at its peak in October and then again gradually fell. Other than August, March and June are the other two months with unusual spikes. Figure-1 (b) shows how this trend is seen among the top viewed videos.
The abundance of ‘Light Teal’ and ‘Green’ colors on figure-1 (b) immediately tells us that majority of the most viewed videos were posted during the month of June and July. Moreover, we can make some more observations from this plot:
- There is a wide dispersion among the videos in terms of their count of views.
- Similar to what we have seen in the last article, ‘Cricket’ is a common topic here too. There are multiple videos about Cricket in this list of top videos.
- But unlike the last analysis, here we see multiple videos about brothel. Which is quite unusual considering the general societal norm in the country.
A manual inspection of the related videos reveals that all the three videos in the top 20 list are about a specific brothel situated in ‘Doulatdia’, a small town in the southern part of Bangladesh. But why did that brothel attract that much attention? The answer can be found from a manual inspection of the top three viewed videos. Aljazeera, a middle eastern news channel published a documentary in June, which is the fifth most viewed video in the list of our top videos. The rest of the two videos were posted from another channel (channel name: Asia Forever) that has no credible record like Aljazeera. Which tells us that the likely intention of these videos was to get more views banking on the popularity of the video from Aljazeera and the controversial nature of the topic.
Moving back to our discussion about the video topics, figure-2 presents a list of most frequent words used in the titles created by keywords extraction from the video titles.
Figure 2 Top 20 most frequently used words
For obvious reason, Bangladesh and 2017 are the top two most frequent words. The third most frequent word is Dhaka. Interestingly, we see a trend here similar to what we have seen in our last analysis on tweets. There is a prevalence of words those relate to the areas of Cricket and Rohingya. Names of our neighboring countries (India and Myanmar) also appeared in the most frequent words list like it did in our Twitter analysis. Let’s see how these frequent words were used to represent the common topic areas. Figure-3 visualizes the most frequent words in different topic areas to serve our purpose.
Figure 3 Most frequent words in different topic areas
Figure-3 reveals an unusual abundance of ‘dance’ related words. Looking at the plot of the most frequent video posters in figure-4, we can understand the reason. A channel named, ‘Skydance Company’, has made dance related words common by their frequent video postings in this topic. Which we will be able to see more clearly when we will look at the word association later. But for now, from the most frequent video posters, we can get another insight. We see that Al Jazeera is one of the top 10 most video posters about Bangladesh.
Figure 4 Top 10 most video posters
Meaning, in last year Bangladesh has got quite a lot of attention from the global news giant Al Jazeera! It would be interesting to see how Bangladesh is represented in these news videos. But since this study is about the general perceptions rather than understanding how Bangladesh is portrayed in the global news, we will leave this avenue open for some future endeavor. Now before moving further into the text mining area, we will take a look into the numbers of likes, dislikes and comments. We will get to know about the most liked videos, most disliked videos and videos which got most traction with the viewers through comments.
Interestingly, looking at the lists of 10 most liked and disliked videos we can see some common names! Four out of the ten videos are in both the lists of highest liked and disliked videos. Moreover, we see that both the most loved and hated videos were created by non-Bangladeshi people! Which give us the good news that the most hated video was not created by us! On the contrary, the not so good news is that we can assume that there is still room for improvement in our local content so that they be liked by mass people!
Let’s now take a look at how the number of likes and dislikes relate to each other. Plotting the like and dislike counts produces the plots shown in figure-6. Where the plot on the left represents all the videos and the plot on the right side represents all but the videos with extremely high or low counts (lowest and highest quartile).
Figure 6 Relationship between like counts vs dislike counts
From the high density in the bottom left corner, we can clearly see the high degree of skewness in the data, meaning that there is a high level of disparity among the videos in terms of the number of likes and dislikes, similar to what we have observed in case of the number of views. Besides, we can observe a somewhat linearity between likes and dislikes counts. Statistically which is found to be 0.64 (correlation coefficient), meaning in 64% of the cases the high number of likes co-occur with the high number of dislikes and vice versa! In simple words, no matter how many likes a video can gather there is always the chance to gather a good number of dislikes along the way!
Let’s also look at viewer traction with the videos through a comment in combination with the likes and dislikes. Figure-7 plots the most commented videos with a ratio of likes vs dislikes plotted as the color code. We see some common names on this list as we have seen before in the charts of most viewed, liked and disliked videos.
Figure 7 Top 10 most commented videos
We can immediately see the prevalence of extreme colors from both ends (dark blue and very light blue). Here ‘dark blue’ represents low like counts vs dislike counts (around 20 likes vs 1 dislike) and ‘light blue’ represents high like counts vs dislike counts (around 40-60 likes vs 1 dislike). Which eventually means that the majority of the top most commented videos are either highly liked or extremely disliked. Videos that didn’t evoke any extreme emotion don’t seem to make a ripple in the comment section from viewers.
We will dissect this relationship between likes, dislikes with comments, a bit more. And try to understand if there is any difference between the most liked and disliked videos in terms of how they generate viewer interaction. In figure-8 we plot like and dislike counts of videos with their respective count of comments categorized.
Figure 8 Like counts vs dislike counts
From the abundance of light red color on figure-8, we can see that majority number of videos produce comments lower than or equal to 1,000. Video generating the most number of videos, above 3,000 (Violet color) are the ones with higher likes compared to dislikes. On the contrary, the videos with both higher likes and dislikes (upper right quadrant) generate higher than 1,000 comments (1,000 to 2,000, light green) but not as high as above 3,000. Higher correlation between likes count and comment count (0.73) versus the lower correlation between dislike count and comment count (0.58) also reflects the possibility that positive video, videos with higher likes, tend to have better traction through comments from viewers compared to negative videos, videos with higher dislikes.
This far we have analyzed the statistics of the video and learned what type of videos are most liked or hated and how engagement of viewers varies based on their like or dislike. At this stage of our analysis, we will move onto analyzing the comments. We will try to understand the interest areas and sentiment of the viewers from their verbatim. Because of the restriction of the YouTube API on the number of records that can be collected, we will restrict our comment analysis within the 10 most viewed videos in these topic areas: Dhaka, India, Rohingya, and Cricket.
Looking at figure-9 we see comments under videos of different topics followed different trend over the year. Which reflects how social media activities spike and then flattens with the time around any issue.
Figure 9 Trend of video posting on different topics over the months
For example, the videos posted about Rohingya topic had the highest traction in September 2017 around the time when the crisis started. But gradually comments about the issue have slowed down since November 2017. We can also look at the trend of comments for individual video during the lifetime of each video. We will try to understand further how the number of comments changes over the lifetime.
Figure 10 Comment counts over the lifetime of videos
From left plot above, we can see that the highest number of comments are left right after videos are posted (where the age of video is ‘zero’ meaning video posted and comment posted dates are same) then falls sharply and plateaus till the end. The plot on the right excludes extreme numbers so that we can make a better sense out of the graphs and figure out a trend if there is any. There we can see that on an average number of comments tend to slow down after 300th to 350th day from the date video posted. Here the exceptions are the videos about India. Where comments have been being posted even after the 400 days of age of the videos. So probably we people from these two neighboring counties like a lot to talk about us?
Let’s now see how general sentiments were expressed in these different areas. To do that a lexicon-based sentiment extraction approach will be followed similar to what was done in the Twitter analysis in our last analysis. To explain briefly what a lexicon is, lexicon libraries are stock of words that are prelabeled with the sentiment that they carry. For example happy would be labeled as a word carrying positive sentiment while cry as carrying negative sentiment. The lexicon used in this analysis is NRC Emotion Lexicon (EmoLex) which is a crowd-sourced lexicon created by Dr. Saif Mohammad, senior research officer at the National Research Council, Canada. NRC lexicon has a division of words based on 8 prototypical emotions namely: trust, surprise, sadness, joy, fear, disgust, anticipation, and anger and two sentiments: positive and negative. In our analysis, the comments are broken down into words and then these words are matched with the lexicon library. Eventually, we have a list of words representing different sentiments. Which gives us an idea of how different sentiments are present in different topics through the comments.
Figure 11 Trend of categorized sentiment in different topics across the time
For ease of visualization, positive and negative sentiments have been plotted separately in figure-11. Looking at the left plot above, two interesting trends can be spotted:
- There has been a sudden growth of anger during the months of June and July 2018 around the topic of India.
- On the other hand, after a heightened positive sentiment a spike of negative sentiment about the Rohingya issue is observed over the months of June and July 2018.
The plot on the right reflects the sentiments expressed on the left plot. Where we see a heightened positivity about Dhaka in April 2018.
So, what specific issues that the viewers talked most about in each of these topic areas or how frequently they talked about those issues? To have an idea we will investigate the ‘Network Graphs’, plots showing how different words relates to each other and how frequently they appear, created from comments. In this case, we will investigate network graphs consisting of two parts of speech: noun and adjectives from the comments under videos related to the topics of ‘Rohingya’ and ‘India’. Which should give us an idea about which topics (nouns) are discussed and roughly in which tone (adjective). With which we will end our today’s analysis.
Figure 12 Network graph of comments on Rohingya and India related videos
On the network plots red lines show how different words are used together and the opacity of the red lines shows how frequently those linked words were used, the darker the line the more frequent the use of the linked words together. From the network plot about Rohingya topic (figure-12 left side) we see that the most used phrase is ‘criminal muslim’ or ‘muslim criminal’. From other darker red lines, we can also see other phrases used most frequently. In the case of India, the most frequent phrase is ‘bangladesh bengal’ or ‘bengal bangladesh’ followed by the phrase ‘indian time’. And ‘bangladesh’ is the word to be associated most frequently with other words. These network graphs provide us with some more specificity about the analysis of the topics discussed. We see a lot of discussions related to Rohingya people are originating from their religious ethnicity while in case of India a lot of discussions are about Bangladesh, it’s people and the border between Bangladesh and India.
So, to conclude let’s reiterate why we should care about analyzing digital contents about a country. Analyzing contents on social networking and content sharing platform gives us an idea about collectively what people talk and how they talk about the country over the internet. Such discussion builds part of the digital footprint of that country. This digital footprint that a country leaves on the internet, in turn, shapes up the impression that it makes to the world. So, studying such digital footprint through social media content study helps us find out areas of improvement such as the strategic use of communications to manage perceptions. For example, it was a great feat by Bangladesh to accommodate around 1,000,000 Rohingya refugees during one of the worst humanitarian crises of recent times. But could Bangladesh spread out the great news effectively to create a good impression on her? Or India and Bangladesh share a great history of friendship starting with the independence of Bangladesh. But do we still share the warmth of friendship or it’s slowly turning sour? Though we can’t conclusively say anything, our analysis gives us a mixed signal on these areas. But regularly conducting similar studies on a large scale, on social media and other online platforms may lead us to an answer. Or in the context that we set earlier, we should go for more walks with Bangladesh on different online streets such as Facebook, Instagram, global news platforms, search engines and so on to know what people perceive about her and where to put efforts to better manage the perception.
About Writer
ARAFATH HOSSAIN is a graduate student of MBA and Quality Management & Analytics programs at Illinois State University, United States. Before starting his graduate studies, he worked for three years in the areas of market research and technology product management. He takes a great interest in the applicability of data analytics techniques as a tool for informed decision making. For any question/query/comment, he can be reached at [email protected] or [email protected]. A detailed technical version, including R codes, of this analysis is also available here: https://rpubs.com/arafath/YouTubeVideos_Analysis