Blog » Data Exploration » Top Tools for Data Exploration and Visualization With Their Pros and Cons

Top Tools for Data Exploration and Visualization With Their Pros and Cons

When you are working on a data science project or trying to find data insights to strategize your plans, there are two key steps that can not be avoided – Data Exploration and Data Visualization

Data Exploration is an integral part of EDA (Exploratory Data Analysis). Whatever you decide to do in the later phases (creating/selecting a machine learning model or summarizing your findings), will depend on the assumptions you make in the exploration phase. It’s not a single step phase, but we get to determine a lot about our data during data exploration e.g. checking data distribution, finding correlation, finding outliers and missing values, etc. 

Data Visualizations aren’t part of any specific phase in a data analytics project. We can use visuals to represent the data at any point in our project. Data visualization is nothing but a mapping between tables or graphs and data (inputs or outputs). Data visualization can be done in two forms – tabular and graphical. 

We need visualization as a visual summary of the data, because it’s easier to understand for identifying relations and patterns. Many visuals are used in the data exploration phase to find outliers, correlation between features, etc. We also use charts and graphs to check the performance of models or while categorizing or clustering the data. 

Choosing a correct chart to communicate your findings about data is also important. Using a line chart instead of a scatter chart might not make sense. There are some basic and widely used charts which we use or see in our day-to-day work – in data science and otherwise:

  1. Line chart
  2. Bar chart
  3. Histogram
  4. Box plot
  5. Scatter plot
  6. Heatmap

While trying to make accurate assumptions, we need the best tools to explore and visualize the data. There are several tools and libraries available in the market. It’s nearly impossible to remember all the libraries, it can be confusing to decide which one to use. The aim of this article is to:


READ ALSO:
➡️ How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)
➡️ Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools
➡️ The Best Tools for Machine Learning Model Visualization


List of data exploration and visualization tools

1. Matplotlib

Matplotlib was introduced to imitate all the graphics supported by MATLAB, but in a simpler form. Throughout the years, multiple functionalities have been added to the library. Not just this, but many visualizations libraries and tools are built on top of Matplotlib with new, interactive, and attractive visuals.

Data exploration Matplotlib

To learn more about Matplotlib, let’s work with a dataset to unlock and see how some of the functions work:

#Load the dataset
 
import pandas as pd
netflix_df = pd.read_csv('netflix_titles.csv')
netflix_df.head(2)
Matplotlib table

We have type of content, title, date added, and other information. But what do we want to do with this information? We could find how many shows and movies are on Netflix (according to the dataset), or we could see which country has produced more content. 

#Install matplotlib
import matplotlib.pyplot as plt
 
#Find the count of shows and movies
counts = netflix_df["type"].value_counts()
plt.bar(counts.index, counts.values)
plt.show()
Matplotlib plot

In the above code, you can see we’ve imported matplot’s pyplot as plt. Each pyplot function makes some change to a figure – creating a figure, creating a plotting area, plotting some lines, introducing labels in the plot, etc. Then we used pyplot as plt to call a bar chart, and visualize the data inline. 

One thing to remember here is we will have to use plt.show() command every time a new plot is created. If you want to avoid this repetitive task, you can use the below command after importing matplotlib.

%matplotlib inline

There’s a lot you can do beyond just creating a simple bar chart. You could provide x and y labels, or you could give different colors to the bars according to their values. You have the choice to change markers, line styles and widths, add or alter text, legend, and annotations, change the limits and layout of your plots, and much more.

We can use Matplotlib to find anomalies in the data too. Let’s try to create a customized plot.

import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt 
 
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
#create the dataframe
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
 
fig = plt.figure(figsize =(10, 7)) 
# Creating axes instance 
ax = fig.add_axes([0, 0, 1, 1]) 
ax.set_xlabel('Distance')
# Creating plot 
bp = ax.boxplot(boston_df['DIS']) 
plt.title("Customized box plot") 
# show plot 
plt.show() 
Matplotlib plot

As this package provides flexibility, it can be a bit tricky to choose or even remember things when you start working with it. Luckily, documentation contains real life examples, each plot’s argument related details, and all other information we need. Don’t feel overwhelmed, just remember that there can be more than one solution to a problem. 

Now that we have some idea what Matplotlib is, let’s discuss the pros and cons, and which tools integrate with it.

Advantages

  • Fast and efficient, built on NumPy and SciPy.
  • Gives you full control over your graph and plot, you can make a number of alterations to make your visuals more understandable.
  • Large community and cross-platform support, it’s an open-source library.
  • Several high-quality plots and graphs.

Disadvantages

  • No interactive plots, only static plots.
  • A lot of repetitive code is needed when you make customized plots.
  • You have full control over your graph for each step, so you will have to define a matplotlib function, which can be time-consuming.

Matplotlib integrations 

A lot of popular Python visualization libraries are built on Matplotlib. For example, seaborn uses matplotlib to display the plot once the figure is created. Not just this, but many tools have also integrated with Matplotlib. Neptune.ai is one of them. 

Achievement

The first image of a blackhole was produced using NumPy and Matplotlib. It’s also used in sports for data analysis. 

2. Scikit Learn

Scikit learn was developed in a Google Summer code project by David Cournapeau. Later, in 2010, FIRCA took it to another level and released a beta version of the library. Scikit learn has come a long way, now it’s the most useful robust library. It’s built in Python on top of NumPy, SciPy and Matplotlib. 

Data exploration Scikit learn

It doesn’t focus on one aspect of any data science project, it provides a vast collection of efficient tools for data cleaning, curation, modelling, etc. 

It has tools for:

  1. Classification
  2. Regression
  3. Clustering
  4. Dimensionality Reduction
  5. Model Selection
  6. Preprocessing

Where does data exploration and visualization fit? Scikit Learn has a collection of tools to meet exploratory data analysis requirements – discover problems and recover them by transforming the raw data.

If you’re looking for datasets to experiment on, Scikit learn has a dataset module which has some popular dataset collections. You can load a dataset as below, and you won’t have to download it on a local machine. 

 from sklearn.datasets import load_iris
 data = load_iris()

Scikit learn plays an important role when it comes to pre-processing, ie. cleaning and curating. Assume you have few missing values in your dataset. There are two ways to handle it:

  1. Drop all those rows/columns with missing values,
  2. Impute some values. 

Dropping rows/columns is not always a good choice, so we impute values – zeroes, average/mean, etc. 

Let’s have a look at how to do this using scikit’s impute module.

#Create a dataframe
import numpy as np
import pandas as pd
X = pd.DataFrame(
    np.array([1,2,3, np.NaN, np.NaN, np.NaN, -7,
              0,50,111,1,-1, np.NaN, 0, np.NaN]).reshape((10,3)))
X.columns = ['feature1', 'feature2', 'feature3']

#Impute values when null found
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit_transform(df)
Scikit learn array

Above, we used a SimpleImputer module to create an imputer to replace null values with mean. Scikit learn is the only tool with functions/modules for almost everything. No other tool provides a simple imputer module as Scikit learn. 

When it comes to feature scaling, or normalizing distribution, Scikit learn has functions available in the preprocessing module: StandardScalar, MinMaxScalar, etc. It has modules for feature engineering as well. Scikit only deals with numeric data, so you will need to convert the categorical variables to numeric to explore the data. 

Where scikit learn leads in data exploration, it has minimal use for data visualization. The visual modules are only for visualizing metrics like confusion metrics, trade off curve, roc curve, or recall precision curve. In the next example, we’ll see how we can use the visualization function.

from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
 
# Create training and test data sets
X_train,X_test,y_train,y_test=train_test_split(
        X,y,test_size=0.25, random_state=0)
 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_confusion_matrix
 
# Deliberately over-regularise model with low C to create more error
lr=LogisticRegression(C=1,random_state=0)
lr.fit(X_train,y_train)
 
# predict test set
plot_confusion_matrix(lr, X_test, y_test,display_labels=class_names,
                                 cmap=plt.cm.Blues)  
plt.show()
Scikit learn plot

Even though Scikit has some visualization modules, it still doesn’t support any visualization for regression problems. But, without a doubt, it’s the most effective, easily adaptable data mining tool.

Advantages

  • Open-source.
  • Strong community for support.
  • Efficient and best performance data exploration utilities readily available for use.
  • Scikit learn APIs can be used to integrate its tools into different platforms.
  • Provides pipeline utility that can be used to automate machine learning workflows.
  • Easy to use, it’s a whole package and relies on a small number of libraries.

Disadvantages

  • Scikit learn works only with numeric data and will have to encode categorical data.
  • It has low flexibility, while using any function you won’t be able to alter anything other than provided parameters.

👉 See Neptune’s integration with Scikit Learn.

3. Plotly

The previous two tools didn’t have any interactive visualization. Most of these tools are built in Python, and it has limited flexibility in terms of visuals. 

Plotly develops online data analytics and visualization tools. It offers graphics and analytics tools for different platforms and frameworks like Python, R, and MATLAB. It has a data visualization library plotly.js, an open-source JS library for creating graphs. To let Python use its utilities, plotly.py has been built on top of it.

Data exploration Plotly

It supports 40+ unique chart types to cover statistical, financial, geographic, scientific, and 3D use cases. It uses D3.js, HTML and CSS, which helps in integrating many interactive functionality like zoom-in and out, or mouse hover. 

Let’s check out how we can introduce interactivity in the plots using plotly.

#Install plotly
pip install plotly==4.14.3

#Load the iris dataset
from sklearn import datasets
import pandas as pd
 
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data)
iris_df.columns  = ['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']
 
#Data distribution check - Histogram
import plotly.graph_objs as go
data = [go.Histogram(x=iris.data[:,0])]
layout = go.Layout( title='Iris Dataset - Sepal.Length', xaxis=dict(title='Sepal.Length'), yaxis=dict(title='Count') )
fig = go.Figure(data=data, layout=layout)
fig
Plotly plot

You can see above that plotly’s plot lets you save the image, zoom-in and out, autoscale and more. You can also see that, after the mouse hover, we can see the x and y axis values. 

Let’s draw some more plots using plotly to understand how it can help end users.

To understand the relationship between variables we need a scatter plot, but it can be difficult to read the plot when we have many data points. The mouse hover function can help to read the data without making too much effort.

data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode = 'markers')]
layout = go.Layout(title='Iris Dataset - Sepal.Length vs Sepal.Width', xaxis=dict(title='Sepal.Length'), yaxis=dict(title='Sepal.Width'))
fig = go.Figure(data=data, layout=layout)
fig
Plotly plot

If you want your charts to be interactive, attractive, and readable, plotly is the answer.

Advantages

  • You can build interactive plots with JavaScript without its knowledge.
  • Plotly lets you share the plots publicly without even sharing your code.
  • Simple syntax, almost for all plots it uses the same sequence of parameters.
  • You don’t need any technical knowledge to use plotly, you can use the GUI to create visuals.
  • Provides 3D plots with multiple interactive tools.

Disadvantages

  • Layout definition becomes complex as we try to create complex plots.
  • Unlike other tools it limits per-day API calls depending on tools.
  • Public chart availability can be a benefit but can be a problem for others.

👉 See Neptune’s integration with Plotly.

4. Seaborn

Matplotlib is a base for many tools, and Seaborn is one of them. In Seaborn, you can create attractive charts with minimal effort. It has high-level functions for common statistical plots to make them informative and attractive.

Data exploration Seaborn

It integrates closely with pandas, and accepts inputs in pandas data structures format. Seaborn has not reimplemented any of the plot but has tweaked the functions of Matplotlib in a way that we can use the plots by providing minimum parameters.

Seaborn has collected some common plots from Matplotlib and categorized them: relational(replot), distributional(displot), and categorical(catplot).

  1. Replot – scatterplot, lineplot
  2. Displot – histplot, kdeplot, ecdfplot, rugplot
  3. Catplot – stripplot, swarmplot, boxplot, violinplot, pointplot, barplot

What was the need to categorize plots if we could just use them directly? Here’s the twist! Seaborn lets you use categorized plots directly, which is called axis level plotting. These plots, like histplot(), lineplot(), are self-contained plots, and a direct replacement of Matplotlib, though they allow some alternation like adding axis labels and legends automatically. When you want to use two plots together, or play around more, to make customized plots you’ll need to use plot category: figure level plotting.

Let’s try to some of the plots to see how easy seaborn is.

#Load the data set
import pandas as pd
breast_cancer_df = pd.read_csv("data.csv")
 
#create heatmap
plt.figure(figsize= (10,10), dpi=100)
sns.heatmap(breast_cancer_df.corr())
Seaborn plot

Just two lines to create a heatmap! Now we will try some plots which we’ve already tried above with other tools.

#Count plot
plt.figure(figsize=(8,5))
ax = sns.countplot(x="diagnosis", data=breast_cancer_df)
plt.show()
Seaborn plot

We just created a count plot without counting anything, much unlike Matplotlib. 

The library is not limited to above mentioned plots only. It also has joinplot, subplot, or regplot functions that can help create customized and statistical plots with minimal coding.

Advantages

  • You can easily customize plots.
  • Default approach is much more visually appealing than Matplotlib.
  • Has some built-in plots that Matplotlib doesn’t: facet and regression. For regression, with one function you can create a regression line, confidence interval and a scatter plot.
  • Seaborn works well with pandas data structure compared to matplotlib.

Disadvantages

  • No interactive plots.
  • Seaborn is easy to visualize, and much easier to get insights from multiple graphs.
  • Automates the creation of multiple figures, which sometimes leads to OOM (out of memory) issues.

5. Pandas

One of the most popular libraries in Python for data analysis and manipulation. It started off as a tool to perform quantitative analysis for financial data. Because of this, it’s very popular in time series use cases. 

Data exploration Pandas

Most data scientists or analysts work with table format data like .csv, .xlsx etc. Pandas provides SQL-like commands that make it easier to load, process and analyze the data. It supports two types of data structure: series and dataframe. Both data structures can hold different data types. Series is a one-dimensional indexed array, dataframe is a two-dimensional data structure – table format, and is popular when dealing with real life data.

Let’s see how series and dataframe can be defined, and unlock some of the features.

#creating a series from dataframe
ser1=pd.Series(breast_cancer_df['area_mean'])
ser1.head()
Pandas dataframe

You can perform almost all operations and use all the functions we will be discussing further with pandas series also. You can also provide indexing to your series.

data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
data
Pandas dataframe

Also, you can pass dictionary data (key value object), and it can be converted into series too. 

#Describe the dataframe - take peek inside the data
breast_cancer_df.describe()
Pandas table

With one line of code, we were able to have a look at the data. That’s the power of pandas.

Say we want to create a subset of main dataframe, that also can be done with few lines of code.

subset_df=breast_cancer_df[["id", "diagnosis"]]
subset_df
Pandas dataframe
#select data by column and position
 
print("print data for one column id: ",breast_cancer_df["id"])
print("print all the data for one row: ",breast_cancer_df.iloc[3])
Pandas dataframe 3

Let’s see how pandas handle missing data, first check which column has missing values.

data = {'Col1': [1,2,3,4,5,np.nan,6,7,np.nan,np.nan,8,9,10,np.nan],
        'Col2': ['a','b',np.nan,np.nan,'c','d','e',np.nan,np.nan,'f','g',np.nan,'h','i']
        }
df = pd.DataFrame(data,columns=['Col1','Col2'])
df.info()
Pandas dataframe

Non-null count column will show you how many non-null values are available. You can drop the rows with null values or impute some values.

We can handle string values differently, but we won’t go into that level of detail. We can also do statistical calculation using pandas like calculating mean, average, median, etc. There are many string functions available, like covering lower/upper case, substring, replacing string, and using regular expression for pattern matching.

Pandas provides functions for viewing data (head or tail), creating subsets, searching and sorting, finding correlation between variables, handling missing data, reshaping – joining, merging, and more.

Not just this, but pandas also has visualization tools. However, it only does basic plots, but they’re easy to use. Unlike Matplotlib or other tools, you just provide an extra command plt.show() to print the plot. 

breast_cancer_df[['area_mean','radius_mean','perimeter_mean']].plot.box()
Pandas subplot

The above plot is identifying the outliers with a single line of command. It also allows you to alter the plots like their colors, labels, and more.

corr = breast_cancer_df[['area_mean','radius_mean','perimeter_mean']].corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
Pandas chart

The two charts above were easy to create, but imagine if we want to create a bar chart for breast cancer data, and want to know the count of each type of diagnosis. We’d first need to find the count, and then would only be able to plot the box graph. Pandas doesn’t provide customized plots. In order to use a plot of your choice, you’ll have to first manipulate the data, and then feed appropriate data into the plot function.

Advantages

  • Readable representation of data.
  • Extensive file format compatibility.
  • An extensive set of features available, like SQL format to join, merge and filter the data.
  • Efficient in handling in large datasets.
  • Supports common visualization graphs and plots.

Disadvantages

  • Poor compatibility with 3D data.
  • Consumes more memory compared to NumPy.
  • Indexing is slower in series objects.

👉 See Neptune’s integration with Pandas.

6. D3.js

D3.js is a JavaScript library to create dynamic and interactive visualizations in web browsers. It uses HTML, CSS and SVG to create visual representations of data. D3 stands for data-driven documents, it was created by Mike Bostock. It’s one of the best tools for data visualization for online analytics, as it manipulates the DOM by combining visual components and a data-driven approach.

Data exploration D3

We can use the Django or Flask web frameworks to create a website. This way, we can take advantage of Python’s simplicity and D3’s amazing plot collection. Python will work as a backend system, and D3 can integrate with HTML, CSS and SVG for the frontend. If your requirement is to create a dashboard, you can simply use the data that you want to analyze and use D3.js to display it.

Explaining an example of a website, webpage or dashboard with D3 code here would be a bit difficult, but let’s look at what D3 has to offer.

For one thing, relationship visualization or network flow with an aesthetically pleasing circular layout can be coded as – chord diagram and the result of this code can be pleasing to the reader’s eyes – 

D3 plot

Chart to stack negative categories to the left and positive categories to the right.

With the below chart you can visualize the hierarchy and the size will adjust as you change the depth. You can find the source code here.

D3 chart

D3 has a large collection of plots and it will be rare that you will have to code from scratch. You can pick any plot, and make the changes you want. Though there’s no question that you will have to write lots of code, more code means more flexibility to change.

Advantages

  • D3 is flexible, it doesn’t provide specific features and gives you full control on creating your choice of visualization.
  • Efficient, can handle large datasets.
  • D3 is a data-driven document, which makes it more suitable and the best tool for data visualization.
  • It comes with around 200k visuals.

Disadvantages

  • It should be used for online analytics.
  • It can be time-consuming to generate a D3 visualization.
  • It has a steep learning curve as the syntax is complex.
  • Can’t be used with notebooks, focused on web-based analytics only.

7. Bokeh

Bokeh is a Python data visualization library that lets users generate interactive charts and plots. Similar to plotly, because both libraries let you create JavaScript-powered charts and plots without writing any JS code. Bokeh gives active interaction support like plotly and D3.js, like zooming, panning, selecting, and saving the plot.

Data exploration Bokeh

Bokeh comes with two different interfaces/layers, which lets developers combine them based on their need and how much time they want to spend coding. Let’s find out the difference between these interfaces and their usage through some examples.

Bokeh.model

This provides a low-level interface for developers. Charts can be configured by setting values for various properties. This way developers can manipulate the properties as they require. 

from bokeh.models import HoverTool
 
#mouse-hover 
hover = HoverTool(
        tooltips=[
            ("(x,y)", "($x, $y)"),
        ]
    )
#step1 - create a plot using figure
p = figure(plot_width=400, plot_height=400, tools=[hover])
#step2 - add triangle render with size,color
p.triangle([5, 3, 3, 1, 10], [6, 7, 2, 4, 5], size=[10, 15, 20, 25, 30], color="blue")
#show the plot 
show(p)
Bokeh model

Bokeh.plotting

In this interface, you’ll have the freedom to create plots by combining visual elements: circle, triangle, line, etc., and adding interaction tools: zooming, spanning, etc. The interaction elements will be added with the help of bokeh.model. 

from bokeh.io import output_notebook, show
from bokeh.plotting import figure #import figure to create plot object  
output_notebook() #Output mode

#step1 - create a plot using figure
p = figure(plot_width=400, plot_height=400)
#step2 - add triangle render with size,color
p.triangle([5, 3, 3, 1, 10], [6, 7, 2, 4, 5], size=[10, 15, 20, 25, 30], color="blue")
#show the plot 
show(p)
Bokeh plotting

There was one more interface called bokeh.chart. It had pre-built visuals like line chart, bar chart, area plot, heatmap, but it has been deprecated.

In many ways Bokeh can be a good choice for data visualization, as it gives you Matplotlib’s simplicity and an option to make your charts more interactive.

Advantages

  • It gives a choice of low-level interface, where a developer/analyst will have more flexibility to alter plots.
  • Lets you convert charts and plots of Matplotlib, ggplot.py and seaborn.
  • Interactive plots. 
  • Plots can be exported to PNG and SVG file format.
  • Bokeh produces outputs in different formats – html, notebook, and server.

Disadvantages

  • Provides limited interactivity options.
  • Doesn’t have a large support community yet, and is going through lots of development.
  • Doesn’t have 3D graphic functionalities.
  • You will have to define the output mode before you create any plot, ie. notebook, server, and web browser mode.

👉 See Neptune’s integration with Bokeh.

8. Altair

Altair is a declarative data visualization library. It’s built on vega lite, which lets you create visualizations for data analysis by defining properties in JSON format. You won’t be writing any json declaratives, but Python. Altair converts the inputs into dictionary format for vega lite.

Data exploration Altair

It’s basically a Python interface for vega lite. Altair supports data transformation within chart definition. 

Altair provides inbuilt charts. Bar chart, line chart, area chart, histogram, scatter plot and more. Let’s draw some plots to see how Altair can help us explore data through visuals.

import altair as alt
import pandas as pd
 
#create dataframe or load data from a dataset
source = pd.DataFrame({
    'a': ['Col1', 'Col2', 'Col3','Col4', 'Col5', 'Col6'],
    'b': [28, 55, 43, 50, 30, 99]
})
 
#define altair chart
alt.Chart(source).mark_bar().encode(
    x='a',
    y='b'
)
Altair plot

You can see Altair gives you options to save the image, view the source (data), and edit the chart in vega. When you open the chart in vega editor, this is what you will see.

Altair plot

Your Python code will be translated into JSON format to let you play around with it in vega. Altair has more to offer than just simple charts, it lets you combine two charts and create dependencies between them. 

Advantages

  • Simple and easy to use because it’s built on top of vega lite visualization grammar.
  • Minimal code is required to produce effective and appealing visualization.
  • Gives you an option to edit graphs in vega lite.
  • Lets you focus on understanding the data rather that struggling with displaying it.

Disadvantages

  • Provides interactive charts, but not at the same level as most tools.
  • Doesn’t support 3D visualization.

👉 See Neptune’s integration with Altair.

9. YellowBrick

YellowBrick is a machine learning visualization library with two primary dependencies: Scikit learn and Matplotlib. It’s highly focused on feature engineering, and evaluating ML model performance. It has the following visualization capabilities:

  1. Feature Visualizers – Outliers, Data distribution, Dimension reduction, Rank features
  2. Target Visualizers – Feature correlation, Class Balance in training data 
  3. Regression Visualizers – Residual plot, prediction check, parameter selection
  4. Classification Visualizers – ROC, AUC, confusion matrix
  5. Clustering Visualizers – Elbow method, Distance map, Silhouette
  6. Model selection – Cross validation, Learning curve, Feature importance, Feature elimination
  7. Text Modeling Visualizers – Token frequency, Corpus distribution, Dispersion plot
  8. Visualizers for Non-scikit – missing values, scatter plot 

This list can help you identify which plot/utility should be used for what kind of requirement. To understand more about YellowBrick, let’s look at some examples.

from sklearn.tree import DecisionTreeClassifier
from yellowbrick.features import FeatureImportances
 
clf = DecisionTreeClassifier()
viz = FeatureImportances(clf)
viz.fit(X_sample, y_sample)
viz.poof()

It looks like YellowBrick is a combination of data exploration – before, during and after data modelling. This is a data exploration tool in the truest sense. 

Advantages

  • It makes many jobs easier, like feature selection, hyper parameter tuning, or model scoring.
  • With the help of Yellowbrick, data scientists can evaluate their model quickly and easily.
  • The only visualization tool that does model visualization.

Disadvantages

  • Doesn’t support interactive visualization. 
  • Doesn’t support 3D plots.

10. Folium

Folium is a Python library for visualizing geospatial data, and a wrapper of the JS library Leaflet.js. Leaflet.js is an open-source JS library for interactive maps. Folium has adopted Python’s data wrangling and mapping feature of Leaflet.js. 

The library uses tilesets from OpenStreetMap, MapBox, Cloudmade API. You can customize the map by adding Tile Layers, Plotting Markers, showing directions. With the help of plugins, Folium can really help developers create customized maps easily.

Data exploration Folium

Visualizing geospatial data on maps can help understand the data better. You can get a visual representation of location data points, and they’ll be easy to relate with the world. Like a number of sickness cases, showing that information on a map by countries, states and cities can help in containing the information more easily. 

Let’s draw our first map with Folium and see how easy can it be.

import folium
from folium.plugins import MarkerCluster
m = folium.Map(location=[28.7041, 77.1025], zoom_start=10)
popup = "Delhi"
marker = folium.Marker([28.7041, 77.1025], popup=popup)
m.add_child(marker)
m
Folium map

By just inputting latitude and longitude, we were able to draw a map and mark it. Let’s check out how we can add the functionality when you can view the map in different formats. Let’s add tile layers.  

import folium
from branca.element import Figure
from folium.plugins import MarkerCluster
 
popup = "Delhi"
fig=Figure(width=500,height=300)
m = folium.Map(location=[28.7041, 77.1025])
fig.add_child(m)
folium.TileLayer('Stamen Terrain').add_to(m)
folium.TileLayer('Stamen Toner').add_to(m)
folium.TileLayer('Stamen Water Color').add_to(m)
folium.LayerControl().add_to(m)
m
Folium map

Folium makes it easier for developers to avoid the hustle of using Google Maps, putting markers and showing direction on them. In Folium, you can just import a few libraries, draw a map and focus on inputting and understanding the data.

11. Tableau

Tableau is one of the best data visualization tools. Organizing, managing, visualizing, and understanding data is extremely easy. It has easy drag-and-drop functionality, but also tools that can help discover patterns and find insights in data. 

Data exploration Tableau

With Tableau, you can create a dashboard, which is nothing but a collection of different visuals in one place. A dashboard is like a storyboard, where you can include multiple plots, use a variety of layouts and formats, and easily enable filters to select specific data. For example, you can create a dashboard to check the performance of a brand’s marketing campaign.

Integrating with different types of data sources in Python can take lots of coding and effort, but with a business intelligence tool like Tableau, that will be a one-click job. It has many data connectors like Amazon Athena, Redshift, Google Analytics, Salesforce, and more. 

It’s a business intelligence tool with limited support to curate data, but it lets the analyst use Python or R. By using scripting programming, the analyst can feed clean data to Tableau and create better visuals. To connect Python with Tableau, you can check out this blog on Tableau’s website.

Here’s a featured example of a Tableau dashboard, doesn’t it look like a newspaper clip?

Tableau dashboard

Advantages

  • Tableau can easily handle large datasets and still provide faster computations.
  • It has a wide range of plots and graphs.
  • It’s efficient, your plot is often just a few clicks away.
  • Lets you incorporate Python to perform complex tasks and improve visualizations.
  • Supports various numbers of data sources.
  • Has both web and desktop versions.

Disadvantages

  • The desktop version can be expensive.
  • Tableau’s web version is public, which can raise some security concerns.
  • It can be a challenge when you’re dealing with data requested via http, like xml, JSON.

Conclusion

There are many tools and libraries in the market, and we choose them based on our requirements, capabilities, and budgets. Throughout this article, I discussed some of the best tools for data exploration and visualization. Each of these tools are best in their own way, and they have their own systems and structures to dig deeper into the data and make sense of it.

Data exploration is important for business, management, and data analysts. Without exploration, you will often find yourself in blind spots. So, before you make any big decision, it’s a good idea to analyze what can happen, or what has been happening in the past. In other words, visualize your data to make better decisions.


READ NEXT

The Best Tools for Machine Learning Model Visualization

4 mins read | Paweł Kijko | Posted May 25, 2020

The phrase “Every model is wrong but some are useful” is especially true in Machine Learning. When developing machine learning models you should always understand where it works as expected and where it fails miserably.

There are many methods that you can use to get that understanding:

  • Look at evaluation metrics (also you should know how to choose an evaluation metric for your problem)
  • Look at performance charts like ROC, Lift Curve, Confusion Matrix, and others
  • Look at learning curves to estimate overfitting
  • Look at model predictions on best/worst cases
  • Look how resource-intensive is model training and inference (they translate to serious costs and will be crucial to the business side of things) 

Once you get some decent understanding for one model you are good, right? Wrong 🙂

Typically, you need to do some or a lot of experimenting with model improvement ideas and visualizing differences between various experiments become crucial. 

You can do all of those (or most of those) yourself but today there are tools that you can use. If you’re looking for the best tools that will help you visualize, organize, and gather data, you’re in the right place.

Continue reading ->
Plotly tutorial

Plotly Python Tutorial for Machine Learning Specialists

Read more

The Best Tools for Machine Learning Model Visualization

Read more
Pandas plotting

Pandas Plot: Deep Dive Into Plotting Directly with Pandas

Read more
PyLDAvis tool

pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know

Read more