This was the question I had to investigate for my Big Data Platforms final project using 25M+ tweets.
The tweets, from mid-October through mid-November, were extracted from the Twitter API and were provided for the project through Google Cloud Storage. The project itself was conducted using PySpark, a Python interface for the Apache Spark big data processing framework, on the Google Cloud Platform.
COVID-Related Tweets: Before doing any COVID tweet analysis, I first had to make sure I was only working with COVID tweets. The Twitter API provides a variable called “text” which is a string variable containing only the text of the tweet. Using this variable, I filtered the dataset for records with “text” that only contained COVID-related terms using the PySpark contains() functions. This reduced the number of tweets I was working with to only include tweets with the terms: “covid”, “coronavirus”, “pandemic” and “vaccine” among others.
Exploratory Data Analysis: Once I had COVID-only tweets, I explored the dataset to drop records with missing values for categorical variables or replaced missing values with 0s for numerical variables. The data, which was structured in a deeply nested JSON structure, had many variables, so I also had to explore the dataset to decide which variables I wanted to keep for my analysis – this also cut down on the amount of data that needed to be processed for the analysis.
Influence Score: One variable that was not provided by the Twitter API and was required for the project was an “Influence Score”; a score that would help determine which COVID twitterers were the most influential. This score was a metric that I had to create, so my “Influence Score” was the total retweets for a tweet divided by the total engagements for that tweet times the total tweet volume per twitterer. This score had a higher weight for original tweet content and was averaged for each twitterer. I used “Retweet” as the main factor for my “Influence Score” because the more one’s original content is shared, the more users are seeing their content.
Organizations: In order to better understand who COVID twitterers were, I also had to create a variable indicating what kind of organization a twitterer belonged to or what kind of twitterer they were. Using the information in twitterers “user name” and “user description” along with the PySpark contains() functions, I filtered for terms related to Government, Healthcare, News, and Celebrity to find and classify twitterers. Additionally, I used “follower count” to help classify twitterers as Celebrities as well as Social Media Influencers depending on the size of their following. (Note: All twitterers that were categorized were verified, all non-verified twitterers were put under “Other”)
Findings & Analysis
Overall, I found that there were 13.6M+ COVID tweets from roughly 2.7M twitterers. Among these twitterers, the majority did not belong to a credible organization and were non-verified, therefore, all COVID tweets should not be considered as a credible source of information.
One trend that I noticed was that the majority of COVID tweets/twitterers belonged to countries with the most Twitter users1.
Are COVID tweets aligned with spikes in cases?
Daily COVID tweet counts had a large spike in mid-October and were consistently lower after that, which is not aligned with world COVID cases2 which started increasing in mid-October. Otherwise, spikes are higher on the weekdays meaning that people were tweeting less about COVID on weekends.
Among the different twitterer organizations/types, they all had a similar weekday trend with news accounts having the highest counts and spikes aligned with the fact that they report on pretty much everything COVID-related.
Are twitterers sharing tweets with original content or duplicate tweets?
A text similarity analysis between original tweets within each organization/type shows that there is a higher percentage of unique tweets compared to tweets with duplicate content.
Health, government, and news twitterers had higher percentages of duplicate content (compared to influencers and celebrities) which could be a sign that they are sharing similar COVID content. Social media influencers and celebrities had more original content as they don’t have the similar expectation of credible organizations to share all important COVID information.
Who are the most influential COVID twitterers?
Among the top 50% of influential COVID twitterers – based on my “Influence Score” – I found that twitterers belonging to news organizations and social media influencers not only tweeted the most about COVID but were also retweeted the most. The reach that news accounts have is good, but the fact that influencers have such a high reach, and that there are so many non-verified accounts tweeting about COVID, means that Twitter should pay attention to the COVID content that they share and Twitter users should be cognizant of information that they read.
Can we trust the most influential COVID twitterers?
If one were to determine the most Influential COVID twitterers based on their number of COVID Tweets and Retweets, we would have twitterers that are non-verified, that do not belong to a credible organization, or are just Twitter bots. However, if we determined the most Influential COVID twitterers based on my “Influence Score” by each organization/type, we end up with a more trustworthy bunch of twitterers:
Among news twitterers, The Straits Times and Hindustan Times are aligned with the UK and India being top countries with COVID Twitterers. Under health Twitterers, we have Liz Szabo who is a Senior Correspondent at Kaiser Health News, a source we can trust. Even among social media influencers, Dorit Reiss is a professor of public health law.
Unfortunately, the twitterers that one can trust are not viewed as often as our non-verified twitterers who tweet thousands of times and are retweeted hundreds of thousands of times. So what can Twitter do to combat this?
Conclusions & Recommendations
Time series analysis of COVID tweets over time show that tweets do not align with worldwide COVID spread or risks, so do not consider an increase in COVID tweets to mean that there are more cases. However, it would be worth it for Twitter to analyze trending COVID tweets by country or location to better understand if there is a correlation between COVID tweets and COVID spread or risks at a local level.
Text similarity analysis of original content showed that many credible organizations shared original content that is similar to other tweets, so seeing similar tweets should not be a sign of an unreliable source.
The majority of COVID twitterers are non-verified and do not belong to a credible organization, so all COVID tweets should not be considered a reliable source of COVID information. When it comes to determining which COVID tweets to trust, the Influence Score by twitterer type is a good indicator for determining credible COVID twitterers rather than original tweet count or retweet count.
In order to address this issue, Twitter could de-prioritize tweets from non-verified twitterers that don’t belong to an established organization in the Twitter algorithm, while prioritizing those of influential and credible twitterers determined by their influence score and background based on their profile information. Additionally, adding badges to tweets by credible COVID twitterers would be helpful for users to know what information to trust.