Often on having a look at any dataset, we see a bunch of rows and columns filled with numbers or even with some alphabets, words, or abbreviations. Understanding this data and attempting to gain as many insights as possible is a smart strategy to begin the process of model development. In this article, we will learn about EDA, its types, techniques, underlying assumptions, tools and also, we will do Exploratory Data Analysis on a sample dataset to understand why it’s so important and helpful.
So let’s begin with a brief intro.
What is Exploratory Data Analysis?
According to NIST (National Institute of Standards and Technology, USA ), EDA is a non-formal process with no definitive rules and techniques; rather, rather it is more of a philosophy or attitude about how data analysis should be conducted. Furthermore, a famous mathematician and statistician, John W. Tukey, in his book “Exploratory Data Analysis”, describes EDA as a detective’s work. An analyst or a data scientist uses it to establish the assumptions needed for model fitting and hypothesis testing, as well as for handling missing values and transforming variables as necessary.
To simplify further, we can describe EDA as an iterative cycle where you:
generate questions about your data.
search for answers by visualizing, transforming, and modeling your data.
- 3 use what you learn to refine your questions and/or generate new questions.
These questions can be:
- What is the typical value or central value that best describes the data?
- How spread out is the data from the typical value?
- What is a good distributional fit for the data?
- Does a certain feature affect the target variable?
- What are the statistically most important features/variables?
- What is the best function for relating a target variable to a set of other variables/ features?
- Does the data have any outliers?
May interest you
Exploratory Data Analysis vs. Classical Data Analysis
Apart from EDA, there are also other data analysis approaches, Classical Data Analysis being one of the most popular ones. Both Exploratory Data Analysis and Classical Data Analysis start with a problem, followed by collecting the related data that can be used to understand the problem. Both of them end with yielding some inferences about the data. This is where their similarities end, let us see the differences now:
Exploratory Data Analysis
Classical Data Analysis
does not impose deterministic or probabilistic models on the data. Instead, it allows the data to suggest admissible models that best suit the data.
imposes deterministic and probabilistic models on the data.
the structure of the data, outliers, and models suggested by the data.
parameters of the model, and generates predicted values from the model.
generally graphical, for example, scatter plots, character plots, box plots, histograms, bi-histograms, probability plots, residual plots, and mean plots.
generally quantitative, for example, ANOVA, t-tests, chi-squared tests, and F-tests.
suggestive, insightful and subjective in nature.
rigorous, formal, and objective in nature.
uses all of the available data, in this sense, there is no corresponding loss of information.
condenses data into important characteristics such as location, variation, etc. while filtering some other important factors such as skewness, tail length, autocorrelation, etc., resulting in loss of information.
makes little or no assumptions as these techniques use all of the data.
dependent on underlying assumptions such as normality.
Differences between parameters for Exploratory Data Analysis and Classical Data Analysis
It should be noted that in the real world, we might use elements from both of these approaches along with other ones during data analysis. For example, it is really common to use ANOVA and chi-squared tests to understand the relations between the different features of a dataset while doing EDA.
Univariate analysis vs. multivariate analysis
Often our dataset contains more than one variable, and in such cases, we can do univariate and multivariate analyses to understand our data better.
The term univariate analysis refers to the analysis of one variable and is basically the simplest form to analyze the data. The purpose of the univariate analysis is to understand the distribution of values for a single variable and not to deal with the relationship among the variables in the entire dataset. Summary statistics and frequency distribution plots such as histograms, bar plots, and kernel density plots are some of the common methods to do univariate analysis.
On the other hand, multivariate analysis can take all the variables in the dataset into consideration which makes it complicated as compared to univariate analysis. The main purpose of such analysis is to find the relationship among the variables to get a better understanding of the overall data. Usually, any phenomenon in the real world is influenced by multiple factors, which makes multivariate analysis much more realistic. Some of the common methods used in multivariate analysis are regression analysis, principal component analysis, clustering, correlation, and graphical plots such as scatter plots.
Exploratory Data Analysis (EDA) tools
Some of the most common tools used for Exploratory Data Analysis are:
Exploratory Data Analysis (EDA) assumptions
Every measuring procedure includes some underlying assumptions that are presumed to be statistically true. In particular, there are four assumptions that commonly form the basis of all measurement procedures.
Data is randomly drawn.
The data belongs to a fixed distribution.
The distribution has a fixed location.
- 4 The distribution has a fixed variation.
In simpler words, we want the data to have some underlying structure that we can discover. Otherwise, it will be a complete waste of time trying to make any sense out of the data, which comes across as random noise.
If these four underlying assumptions are true, we will attain probabilistic predictability, which allows us to make probability claims about both the process’s past and future. They are referred to as “statistically in control” processes. Additionally, if the four assumptions are true, the approach can yield reliable conclusions that are reproducible.
But the interpretation of these assumptions might differ across different problem types. So, here we will describe these assumptions for the simplest problem type, i.e., univariate problems. In the univariate system, the response comprises a deterministic(constant) and a random(error) part, so we can rewrite the above assumptions as:
The data points are uncorrelated with one another.
The random component has a fixed distribution.
The deterministic component consists only of a constant.
- 4 The random component has a fixed variation.
The univariate model’s universality and significance lie in its ability to extrapolate with ease to more general problems when the deterministic component is not only a constant but rather a function of several variables.
In this article, we will also see how to test these assumptions using some simple EDA techniques, viz histogram, lag plot, probability plot, and run sequence plot.
Exploratory Data Analysis with a sample tabular dataset
Now before going through the rest of the article, I’ll take an example of a dataset – “120 years of Olympic history: athletes and results”, which is a dataset containing basic data of Olympic athletes and medal results from Athens 1896 to Rio 2016.
The main variables or attributes in this dataset are:
- ID – Unique number for each athlete;
- Name – Athlete’s name;
- Sex – M or F;
- Age – Integer;
- Height – In centimeters;
- Weight – In kilograms;
- Team – Team name;
- NOC – National Olympic Committee 3-letter code;
- Games – Year and season;
- Year – Integer;
- Season – Summer or Winter;
- City – Host city;
- Sport – Sport;
- Event – Event;
- Medal – Gold, Silver, Bronze, or NA.
After storing this data in a pandas dataframe, we can see the top 5 rows as follows:
As mentioned earlier, it is a good practice in EDA to generate questions about the dataset to understand the data. For instance, with regard to this data, I would like to find out answers to the following questions:
- Which countries produce more gold-winning athletes?
- Does any of the physical features of an athlete, such as height, give an athlete an edge over others?
- Are there any features that are highly correlated and thus can be dropped?
- Is there any kind of bias in the data?
Of course, you can have a completely different set of questions about this data, which might be more relevant to your use case for this dataset. In the upcoming sections, along with going over the concepts, we will try to get answers to the aforementioned questions.
Descriptive statistics summarizes the data to make it simpler to comprehend and analyze. Remember that one of the purposes of EDA is to understand variable properties like central value, variance, skewness and suggest possible modeling strategies. Descriptive Statistics are divided into two broad categories:
The measure of central tendency
They are computed to give a “centre” around which the measurements in the data are distributed. We can use mean, median, or mode to find the central value of the data.
The mean is the most widely used approach for determining the central value. It is calculated by adding all of the data values together and dividing the total by the number of data points.
The value at the exact middle of the dataset is defined as the median. Locate the number in the middle of the data after organizing the values in ascending order. In case there are two numbers in the middle, the median is calculated as the mean of them.
The mode is perhaps the most simple way to calculate the central value in a dataset. It is equal to The most frequent number, i.e., the number that occurs the highest number of times in the data.
It is to be noted that the mean is best for symmetric distributions without outliers, while the median is useful for skewed distributions or data with outliers. The mode is the least used of the measures of central tendency and is only used when dealing with nominal data.
Measure of dispersion
The measure of dispersion describes “data spread”, or how far away the measurements are from the centre. Some of the common measures are:
The range of a particular data set is the difference between its greatest and lowest values. The higher the value of the range, the higher the spread in data.
Percentiles or Quartiles
The numbers that split your data into quarters are called quartiles. Typically, they split the data into four sections based on the positions of the numbers on the number line. A data collection is divided into four quartiles:
- First quartile: The lowest 25% of numbers.
- Second quartile: The next lowest 25% of numbers (up to the median).
- Third quartile: The second highest 25% of numbers (above the median).
- Fourth quartile: The highest 25% of numbers.
Based on the above quartiles, we can also define some additional terms here such as:
- The 25th Percentile is the value which is the end of the first quartile.
- The 50th Percentile is the value which is the end of the second quartile (or the median)
- The 75th Percentile is the value which is the end of the third quartile.
- IQR, also known as the interquartile range, is a measure of how the data is spread out around the mean.
We can plot percentiles using a box plot, as we will see later in the article
The variance measures the average degree to which each point differs from the mean. It can be calculated using the following formula:
Where xi is a data point, and μ is the mean calculated for all data points.
In the example mentioned earlier, the variance for the following data points: 6,8,7,10,8,4,9 is 3.95
The standard deviation value tells us how much all data points deviate from the mean value, but it is affected by the outliers as it uses the mean for its calculation. It is equal to the square root of the variance.
A deviation from the symmetrical bell curve, or normal distribution, in a collection of data is referred to as skewness. A skewness value greater than 1 or less than -1 indicates a highly skewed distribution. A value between 0.5 and 1 or -0.5 and -1 is moderately skewed. A value between -0.5 and 0.5 indicates that the distribution is fairly symmetrical. We can use pandas functions skew to find skewness of all numerical variables:
We can use a simple pandas method to find most of these statistics such as min, max, mean, percentile values, and standard deviation for all numerical variables in the data:
Moving onto the techniques used in Exploratory Data Analysis, they can be broadly classified into graphical and non-graphical techniques, with most of them being graphical. Although non-graphical methods are quantitative and objective, they do not provide a complete picture of the data. Therefore, graphical methods, which are more qualitative and involve some subjective analysis, are also necessary.
A histogram is a graph that illustrates the distribution of the values of a numeric variable (univariate) having continuous values as a series of bars. Each bar normally spans a range of numeric values known as a bin or class, where the height of the bar shows the frequency of data points within the values present in the respective bin.
Using histograms, we can get an idea about the centre of the data, the spread of the data, the skewness of the data, and the presence of outliers.
For example, we can plot the histogram for the numerical variable such as height in the dataset.
From this histogram, we can confirm that the median height of athletes lies around 175 cm, which is also evident from the output of “data.describe” in the last section.
Normal Probability Plot
In general, a probability plot is a visual tool for determining if a variable in a dataset has an approximately similar theoretical distribution, such as normal or gamma. This plot generates a probability plot of sample data against the quantiles of a specified theoretical distribution, in this case, a normal distribution.
For example, we can plot the Normal Probability Plot for the numerical variable height in the dataset.
As we can see, the histogram is a bit skewed, thus there is a slight curve in the normal probability plot. We can perform techniques such as power transform, which will make the probability distribution of this variable more Gaussian or Normal.
Using the Histogram and Probability Plot, we can test for one of the EDA assumptions i.e., fixed distribution of data. For example, If the normal probability plot is linear, the underlying distribution is fixed and normal. Also, as histograms are used to represent the distribution of data, a bell-shaped histogram implies that the underlying distribution is symmetric and perhaps normal.
Kernel Distribution Estimation or KDE plot
The Kernel Distribution Estimation plot depicts the probability density function of the continuous numeric variables and can be considered analogous to a histogram. We can use this plot for univariate as well as multivariate data.
For example, we can plot the KDE Plot for a numerical variable such as height in this dataset. So here we plot KDE for gold medal-winning athletes in basketball and swimming sports.
The y-value is an estimate of the probability density for the corresponding value on the x-axis, which is the height variable, so the area under the curve between 175 cm and 180 cm gives the probability of the height of an Olympic athlete being between 175 cm and 180 cm.
We can clearly see in the KDE plots that the probability of winning gold is higher for a basketball athlete if he/she is tall, whereas height is a relatively small factor when it comes to winning gold in swimming.
A pie chart is a circular statistical graphic which is used to illustrate the distribution of a categorical variable. The pie is divided into slices, with each slice representing each category in the data. For the above dataset, we can describe the share of gold medals among the top 10 countries using a pie chart as this:
Through this pie chart, we can see that the USA, Russia, and Germany are the leading countries in the Olympics.
A bar chart, sometimes known as a bar graph, is a type of chart or graph that displays a categorical variable using rectangular bars with heights proportionate to the values they represent. The bars can be plotted either horizontally or vertically.
For this dataset, we can plot the number of gold medals won by the top 20 countries as follows.
It is obvious that we will need a pretty big pie chart to display this information. Instead, we can use a bar chart as it looks more visually pleasing and easy to understand.
Stacked bar chart
A stacked bar chart is an extension to a simple bar chart where we can represent more than one variable. Each bar is further divided into segments where each segment represents a category. The height of the bar in the stacked bar chart is determined by the combined height of the variables.
We can now show the number of gold, silver, and bronze won by the leading 20 countries as follows:
So, as we can see in the stacked graph above, the USA is still leading in the number of gold medals as well as the total number of medals won. When we compare Italy and France, although France has more total number of medals in their name, Italy has slightly more gold medalists. Thus, this plot allows us to get more granular information that we can otherwise miss easily.
A line chart or a curve chart is similar to a bar chart, but instead of bars, it shows information as a collection of data points that are connected by a line in a certain pattern. Line charts have an advantage – it’s easier to see small changes on line graphs than on bar graphs, and the line represents the overall trend very clearly.
As mentioned, the line plot is an excellent choice for describing certain trends, such as an increase in women athletes competing over the last years.
From the above line plot, we can see a sharp rise in the number of women participating in the Olympics after 1980.
Run Sequence plot
If we plot a line graph between the values of a variable and a dummy index, we get a run sequence plot. It is important as we can test for the fixed location and fixed variation assumptions made while conducting Exploratory Data Analysis.
If the run sequence plot is flat and non-drifting, the fixed-location assumption holds, whereas If the run sequence plot has a vertical spread which is about the same over the entire plot, then the fixed-variation assumption holds.
So we used this plot to check if the variable height in the dataset has fixed-location and fixed- variation and, as we can see, the graph appears to be non-drifting and flat, with a uniform vertical spread over the entire plot, so both these assumptions hold true for this variable.
An area chart is similar to a line chart, except that the area between the x-axis and the line is filled in with colour or shading. The use cases of line charts and area plots are almost similar.
For our dataset, we can use an area plot to compare the gold medals won by men and women over the years.
As a consequence of more female athletes since 1980, we can also see a spike in the number of gold medals won by women. This is an important observation, as based on the data before 1980, we can wrongfully conclude that a male athlete has a higher chance of winning gold as compared to a female athlete. Hence, we can say that there is a bias known as prejudice bias present in this dataset.
A box plot, also called a box and whisker plot, shows the distribution of data for a continuous variable. It usually displays the five-number summary, i.e., minimum, first quartile, median, third quartile, and maximum for a dataset. A box is drawn from the first quartile to the third quartile, and the median of data is represented by a vertical line drawn through the box. Additionally, a box plot can be used as a visual tool for verifying normality or for identifying possible outliers.
A box plot also contains whiskers which are the lines that extend away from the box. For a more general case, as mentioned above, the boundary of the lower whisker is the minimum value of the data, while the boundary of the upper whisker is its maximum value.
In cases when we also want to find outliers, we use a variation of the box plot where the whiskers extend 1.5 times from the Interquartile Range (IQR) from the box’s top and bottom. The Interquartile range (IQR) is the distance between the upper(Q3) and lower quartiles(Q1) and is calculated by subtracting Q1 from Q3. The data points that fall outside of the end of the whiskers are referred to as outliers and are represented by dots.
In our dataset, we can plot box plots for our numeric variables such as height, age, and weight.
So, from the above box plots, we can get a good idea regarding the distribution of the height, weight, and age variables. We can also see how weight and age features have a lot of outliers, predominantly at the higher en.
In most cases, scatter plots are used to examine correlations between two continuous variables in a dataset. The values of the two variables are represented by the horizontal and vertical axes, and their cartesian coordinates correspond to the value for a single data point.
In our dataset, we can try to find the relation between height and weight variables as follows:
To move one step further, we can add one more categorical variable, such as the sex of an athlete, into the comparison as follows:
From the scatter plot above, we can conclude that the majority of male athletes have an advantage over female athletes when it comes to height and weight. Also, we cannot miss the fact that as the weight increases, the height of an athlete also increases, which may be an indication of the overall fitness of an athlete.
A lag plot is a special kind of scatter plot in which the X-axis and Y-axis both represent the same data points, but there is a difference in index or time units. The difference between these time units is called lag.
Let Y(i) be the value assumed by a variable/feature at index i or time step i (for time series data), then the lag plot contains the following axes:
Vertical axis: Y(i) for all i, starting from 0 to n.
Horizontal axis: Y(i-k) for all i, where k is the lag value and is 1 by default.
The randomness assumption is the most critical but least tested, and we can check for it using a lag plot. If the data is random, the points on the graph will be dispersed both horizontally and vertically quite equally, indicating no pattern. On the other hand, a graph with a form or trend (such as a linear pattern) shows that the data is not purely random.
We can plot the lag plot for the height variable of our dataset as follows:
Here the data seems to be completely random, and there appears to be no pattern present. Hence the data also fulfills the randomness assumption.
A pair plot is a data visualization that shows pairwise associations between various variables of a dataset in a grid so that we may more easily see how they relate to one another. The diagonal of the grid can represent a histogram or KDE, as shown in the following example in which we compare the height, weight, and age variables of the dataset.
In this plot, we can try to find if any two features are correlated. As we can see, there appears to be no clear relation between age and height or age and weight. As seen earlier, there seems to be a correlation between weight and height, which is not surprising at all. An interesting thing to check will be if we can drop any of these features without losing much information.
A heatmap is a two-dimensional matrix representation of data where each cell is represented by a colour. Usually, during EDA, we use this visualization to plot the correlations among all the numerical variables in the dataset.
Let us try to find such relationships among a few variables of our dataset.
Correlation is a statistical term which measures the degree up to which two variables move in coordination with one another. If the two variables move in the same direction, then those variables are said to have a positive correlation, and vice versa. Also, if the two variables have no relation, then the correlation value is near zero, as is between height and age in our example.
So now I have the answers to my questions, but some of these answers lead to a new set of questions –
- We now know that the USA has the most medals in the Olympics, but It will be interesting to know which and why other countries are lagging behind.
- We found out some factors like athlete height can be advantageous when it comes to basketball, so it would make sense to add more tall athletes to the basketball teams.
- We now also know that there is a chance that we can drop either the weight or height feature without losing much information about the data.
- Also, it is clear that the data is biased, and if we use this data to train any model, it may produce a model biased against female athletes.
To answer subsequent questions, you can do EDA in a more granular and detailed way and find some more interesting things about this data.
Even though EDA is mostly centred around graphical techniques, it includes certain quantitative approaches. Most of the quantitative techniques fall into two broad categories:
- 2 Hypothesis testing
In this section, we are going to cover them briefly. I would like to point to this resource if you want to read about these techniques in depth.
The concept of interval estimate is used to create a range of values within which a variable is expected to fall. The confidence interval is a good example of this.
- The confidence interval represents the statistical significance of the expected distance between the real value and the observed estimate.
- An N% confidence interval for some parameter p, is an interval having a lower bound(LB) and an upper bound (UB) that is expected with probability N% to contain p such that LB<=p<=UB.
- The confidence interval is a way to show what the uncertainty is with a certain statistic.
A statistical hypothesis is a statement that is considered to be true until there is substantial evidence to the contrary. Hypothesis testing is widely used in many disciplines to determine whether a proposition is true or false.
Rejecting a hypothesis implies that it is untrue. Accepting a hypothesis, however, does not imply that it is true; it only means that we lack evidence to believe otherwise. As a result, hypothesis tests are defined in terms of both an acceptable (null) and an unacceptable (non-null) outcome (alternative).
Hypothesis testing is a multi-step process consisting of the following:
- Null hypothesis: This is the statement that is assumed to be true.
- Alternative hypothesis: This is the statement that will be accepted if the null hypothesis is rejected.
- Test statistic: The test determines if the observed data fall outside of the null hypothesis’s expected range of values. The type of data will determine which statistical test is used.
- Significance level: The significance level is a figure that the researcher specifies in advance as the threshold for statistical significance. It is the highest risk of getting a false positive conclusion that you are ready to tolerate.
- The critical value: The critical region encompasses those values of the test statistic that lead to a rejection of the null hypothesis
- The decision: The null hypothesis is accepted or rejected based on the relationship between the test statistic and the critical value.
I hope this article gave you a good idea about some core concepts behind Exploratory Data Analysis. Although there are numerous EDA techniques, especially graphical techniques described in this article, there are a lot more out there and which ones to use depends on the dataset and your personal requirement. As mentioned earlier in this article, EDA is like a detective’s work and is mostly subjective, so you are free to raise as many questions as possible about your data and find their answers using EDA.