Data visualization
There are multiple libraries for data visualization in python.
In this lecture, we will focus on the most famous and used, matplotlib, plus seaborn that is built on-top of it.
While these libraries are general enough, others like plotly, geopandas and Bokeh may come handy for specific tasks.
matplotlib
matplotlib is a plotting library that produces figures in a variety of hardcopy formats and interactive environments.
It has many kind of heavily customizable plots: line plots, bar plots, stacked bar plots, scatter plots, histograms and more.
matplotlib can handle categorical data, timestamps and other data types.
Import convention
Core plot functions are in the .pyplot subpackage that is conventionally imported as plt.
|
|
Jupyter magic
IPython (Interactive Python, the shell that powers Jupyter kernels and offers support for interactive data visualization) provides magic commands that can be triggered with %.
From the docs:
With the following backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it.
The resulting plots will then also be stored in the notebook document.
|
|
Anatomy of a figure
Let’s begin by inspecting the anatomy of a figure to better undestand the names of each element and what we are doing.
Each of these elements is called an Artist.
There are Artists for the axes, for the labels, for the plots, etc.
A Figure represents the figure as whole.
Axes is the region of the image with the data space. An Axes contains two (or three) Axis.
Axis objects are the axis of the figure.
|
|
Create a new figure
We can create a new figure with the .figure() method.
This step is not mandatory and, if you don’t instantiate a new figure, one will be created with the default parameters.
|
|
<Figure size 3000x3000 with 0 Axes>
Figure with a single Axes
The .subplots() function creates, in a single call, a figure and a set of subplots.
You can provide the number of rows and columns in the plot.
|
|
|
|
Plotting examples
After getting familiar with the names and with figure creation, let’s move to the actual plotting.
|
|
[<matplotlib.lines.Line2D at 0x7f9d99f4d9b0>]
Multiple plots on the same Axes
How can we add a second plot with the sin function?
We just plot on the same Axes multiple times.
|
|
Setting the ticks
Ticks are placed automatically and this automagical placement usually works very well.
However, if you want to setup ticks, you can:
- Setup them manually (i.e., providing values where to place ticks)
- Setup a Locator
- Locators define the placement of ticks according to some rule.
- This placement is performed by the AutoLocator() Locator.
Setting ticks manually
|
|
Setting a Locator instead
|
|
Setting up axis limits
You may need to setup the limits of the Axis.
|
|
Adding a legend
You can easily add a legend to your plot by setting a label for each sub-plot and calling the .legend() method.
|
|
Setting the labels
We can set the Axis labels and Figure title as follows.
|
|
Plot types
Line plots
Line plot is a type of chart that displays a series of data points called "markers" connected by straight line segments. [Wiki]
The associated method is .plot and it is highly customizable.
For an extensive list of properties, markers and styles, visit the documentation!
As an example, you can configure line color, width and markers.
|
|
[<matplotlib.lines.Line2D at 0x7f9d99ec2d30>]
You can also plot only the markers.
|
|
[<matplotlib.lines.Line2D at 0x7f9d99b316d8>]
Scatter plots
A scatter plot is a type of plot that displays values as markers.
Scatter plots are often used to display two (or more) variables encoded as x and y coordinates, but also as color and size of the markers.
This kind of plot is used to visually inspect the data and find, for instance, relations between the variables.
You can create scatter plots in matplotlib with the .scatter method.
|
|
<matplotlib.collections.PathCollection at 0x7f9d99b0fda0>
This kind of plot is highly customizable.
|
|
<matplotlib.collections.PathCollection at 0x7f9d9a027a90>
Barplots
Barplots represent categorical data with rectangular bars.
They can be plotted vertically or horizontally and the height (or length) of each bar depends on the values.
Barplots and horizontal barplots can be created with the .bar and .barh methods respectively.
|
|
<BarContainer object of 10 artists>
Stacked barplots
Stacked barplots are barplots that stack multiple values of the same category together.
The height (or length!) of the resulting bar shows the combined result.
Vertical stacked barplots
You just use the .bar method and provide the sum of the previous groups as the bottom (offset) parameter.
|
|
<matplotlib.legend.Legend at 0x7f9d99ca26d8>
Horizontal stacked barplots
You just use the .barh method and provide the sum of the previous groups as the left (offset) parameter.
|
|
<matplotlib.legend.Legend at 0x7f9d98231588>
Seaborn
Seaborn is a library for making statistical graphics in Python.
It has built-in functions to show relationships between variables and to visualize univariate and bivariate distributions and it also provides estimators and linear regression models.
It has advanced support for categorical data.
Seaborn comes nice built-in themes to improve your plots and with better default colors that are studied to improve readability from users. It also has advanced functions to simplify the construction of the plot (e.g., grids, legends).
Seaborn is built on top of matplotlib and is closely integrated with pandas.
Import convention
Seaborn is conventionally imported as sns.
|
|
Themes
Seaborn comes with nice themes that affect even your matplotlib plots. The default can be set with
.set_theme(context=‘notebook’, style=‘darkgrid’, palette=‘deep’, font=‘sans-serif’, font_scale=1, …)
The context parameter affects the scale elements of the figure and is meant to switch to different contexts (paper, poster, etc) easily.
The style parameter affects some aesthetic elements like colors of the axes and of the grid.
The palette parameter affects the color palette.
The other parameters are self-explanatory.
You can also set the parameters above individually.
|
|
|
|
[<matplotlib.lines.Line2D at 0x7f9d904aa550>]
Context
You can set the style using the .set_context() function.
|
|
|
|
[<matplotlib.lines.Line2D at 0x7f9d9040c240>]
Styles
You can set the style using the .set_style() function.
|
|
|
|
[<matplotlib.lines.Line2D at 0x7f9d903d3828>]
Color palette
You can set the style using the .set_palette() function (and visualize them with .color_palette().
|
|
|
|
[<matplotlib.lines.Line2D at 0x7f9d903a9550>]
Removing the spines
You can also remove the spines (axis) using .despine().
|
|
|
|
Structured multi-plot grids: FacetGrid
When visualizing data, you may need to plot multiple instances of the same plot on different subsets of your dataset.
For this purpose, seaborn provides FacetGrids, which are basically grids of Axes.
FacetGrid can have up to three dimensions: row, col and hue.
We will not discuss how to create FacetGrids manually as many functions automatically create them.
|
|
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d90323860>
Plotting with seaborn
Seaborn has a number of very versatile plotting functions.
We will only focus on a few that work as a “wrapper” for the basic ones.
Another nice thing is that it can create multiple Axes automagically through FacetGrid, depending on the rows and columns parameters that correspond to the columns of your DataFrame.
Relations
How can we plot relations with seaborn?
The relplot() function provides access to several different axes-level functions that show the relationship between two variables with semantic mappings of subsets.
seaborn.relplot(data=None, x=None, y=None, hue=None, size=None, style=None, row=None, col=None, palette=None, sizes=None, markers=None, dashes=None, legend=‘auto’, kind=‘scatter’, …)
The kind parameter selects the underlying axes-level function to use:
- scatterplot() (with kind=“scatter”; the default)
- lineplot() (with kind=“line”)
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d902fb668>
|
|
|
|
Distributions
Distributions can be plotted with .distplot()
seaborn.displot(data=None, x=None, y=None, hue=None, row=None, col=None, weights=None, kind=‘hist’, rug=False, log_scale=None, legend=True, palette=None, color=None, …)
It can plot histograms, kernel density estimates (KDE) or empirical (cumulative) distribution function (ECDF). The KDE and rug plot (showing the individual observations) can also be added to the plot.
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d8fe694a8>
|
|
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d8e12beb8>
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d8de6f898>
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d8d288860>
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d8cd39518>
Catplot
The catplot function provides several functions that show the relationship between a numerical and one or more categorical variables.
seaborn.catplot(data=None, x=None, y=None, hue=None, data=None, row=None, col=None, col_wrap=None, estimator=<function mean at 0x7fa4c4f67940>, ci=95, n_boot=1000, units=None, kind=‘strip’, …)
The kind parameter selects the underlying axes-level function to use. There are categorical:
- scatterplots (stripplot with kind=“strip”, swarmplot with kind=“swarm”)
- distribution plots (boxplot with kind=“box”, violinplot with kind=“violin”, boxenplot with kind=“boxen”)
- estimate plots (pointplot with kind=“point”, barplot with kind=“bar”, countplot with kind=“count”)
|
|
|
|
|
|
|
|
Regressions
lmplot provides an easy way to fit regression models and plot the.
It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.
seaborn.lmplot(*, x=None, y=None, data=None, hue=None, col=None, row=None, palette=None, col_wrap=None, x_estimator=None, x_bins=None, x_ci=‘ci’, scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, seed=None, order=1, logistic=False, lowess=False, robust=False, logx=False, x_partial=None, y_partial=None, truncate=True, x_jitter=None, y_jitter=None, scatter_kws=None, line_kws=None, size=None)
|
|
|
|
Heatmap and clustermap
Heatmap is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. [Wiki]
The color may vary in hue or intensity and, if the rows and columns may be reordered to find clusters in the data, the plot is called clustered heatmap.
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d8f2d6080>
|
|
Exercise
Given the CSV file from the previous lecture, load it and perform some statistical analysis.
Load the CSV file.
|
|
Year | Album | Artist | Genre | Subgenre | |
---|---|---|---|---|---|
Number | |||||
1 | 1967 | Sgt. Pepper's Lonely Hearts Club Band | The Beatles | Rock | Rock & Roll, Psychedelic Rock |
2 | 1966 | Pet Sounds | The Beach Boys | Rock | Pop Rock, Psychedelic Rock |
3 | 1966 | Revolver | The Beatles | Rock | Psychedelic Rock, Pop Rock |
4 | 1965 | Highway 61 Revisited | Bob Dylan | Rock | Folk Rock, Blues Rock |
5 | 1965 | Rubber Soul | The Beatles | Rock, Pop | Pop Rock |
Plot the histogram of the number of albums in the cart for each year.
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d719cd048>
Show the unique values of the Genre column.
|
|
[Rock, Rock, Pop, Funk / Soul, Rock, Blues, Jazz, ..., Electronic, Funk / Soul, Rock, Funk / Soul, Blues, Rock,ÊPop, Electronic, Rock, Funk / Soul, Blues, Pop, Rock, Reggae, Latin]
Length: 63
Categories (63, object): [Rock, Rock, Pop, Funk / Soul, Rock, Blues, ..., Rock, Funk / Soul, Blues, Rock,ÊPop, Electronic, Rock, Funk / Soul, Blues, Pop, Rock, Reggae, Latin]
Since there are a bit too many sub-genres for each row, keep just the first one.
|
|
|
|
array(['Rock', 'Funk / Soul', 'Jazz', 'Blues', 'Pop', 'Folk', 'Classical',
'Reggae', 'Hip Hop', 'Electronic', 'Latin'], dtype=object)
Set the Genre as categorical variable.
|
|
[Rock, Funk / Soul, Jazz, Blues, Pop, ..., Classical, Reggae, Hip Hop, Electronic, Latin]
Length: 11
Categories (11, object): [Rock, Funk / Soul, Jazz, Blues, ..., Reggae, Hip Hop, Electronic, Latin]
Now let’s plot a scatterplot of the position (Number, used as index) as function of the Year. As a third variable, we are also interested in the main genre of the album.
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d71126630>
|
|
<seaborn.axisgrid.FacetGrid at 0x7f9d71160320>
Reset the index to restore the Number column.
|
|
And compute the correlation between the Number and Year columns.
|
|
Number | Year | |
---|---|---|
Number | 1.000000 | 0.325667 |
Year | 0.325667 | 1.000000 |
What about showing the correlation via a plot?
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d71047e10>
The heatmap above shows the correlation between the two variables.
Yet, we don’t like the colors and we know the matrix is symmetrical, so we would like to show only the lower triangle.
Let’s try to improve the plot.
What about the color palette?
While matplotlib has many palettes, not all of them are actually good for visualization.
Seaborn tries to overcome this limitation by providing a nice interface for the HSLuv (formerly known as HUSL) color system, which works better with human vision as it minimizes the variation of intensity of colors.
|
|
|
|
|
|
Let’s generate a mask to filter out some values from the plot.
We are interested in displaying just the lower (or the upper) triangle of the matrix.
|
|
array([[False, True],
[False, False]])
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d7185e7f0>
At this point we want to group the DataFrame by the main genre to compute the average position in the chart.
|
|
MainGenre
Blues 243.22
Classical 45.00
Electronic 307.13
Folk 262.23
Funk / Soul 208.08
Hip Hop 301.41
Jazz 187.42
Latin 107.00
Pop 145.00
Reggae 191.86
Rock 250.39
Name: Number, dtype: float64
|
|
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d70ef2a58>