Exploratory Data Analysis on COVID-19 Tweets

Shilpi Parikh
Analytics Vidhya
Published in
4 min readNov 22, 2020

--

During this pandemic, social media platforms were flooded with tons of feeds related to COVID-19. Today in this article I will present you an Exploratory Data Analysis on the kaggle covid19-tweets dataset.

PANDEMIC COVID-19

For performing the EDA, I have used the kaggle dataset- covid19_tweets.csv.

What is EDA?

EDA stands for Exploratory Data Analysis. Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In simple words, EDA means trying to figure out the meaning of dataset before performing any statistical action on the dataset. Knowing the data well before making sense out of it, is what EDA is.

Now let us perform EDA on the covid19_tweets dataset…

First of all import all the required libraries and the dataset:

Python Libraries

By using the .head() and .tail() function we can view the first 5 and last 5 rows of the dataset respectively.

.head() function
.tail() function

To know the statistical summary of the DataFrame columns, we use the describe() function in python. And to know the total number of rows and columns of the dataset, we use the .shape function in python.

describe() Function

We can use different types of graphs such as Barplot, Histogram, Heatmap, Scatterplot, Boxplot and many more for visualization of the dataset.

For this dataset I have implemented the code to visualize the missing_values and unique_values in the dataset. Below are the plots :

Heatmap for Missing Values
Barplot for Unique Values

Now we will visualize the most frequent users who post tweets related to covid-19:

Barplot of most frequent users

Second we will visualize the most frequent locations related to covid-19 tweets:

Barplot of most frequent Locations

Now to actually understand the content of the tweets which people tweet, we need to know the most common types of words used in these tweets. To do so, we will have to create a Word Cloud of the 50 top most words used in the tweets. We will make the word clouds based on the location from where the tweets have been posted. We will create an individual word cloud for each location. Below are some of the screenshots of word cloud of top most words in India.

Similarly, we can create a word cloud for each location mentioned in the dataset.

If you want to look at the whole code of this EDA on Covid-19 tweets, refer to have github repository https://github.com/ShilpiParikh/EDA-on-COVID-19-tweets.git.

Thus, by using EDA techiques we can get to know the dataset completely and we can bring out meaning information from the dataset and also could figure out if any flaw exists in dataset or not.

--

--