# Data visualization

There are multiple libraries for data visualization in *python*.

In this lecture, we will focus on the most famous and used, *matplotlib*, plus _seaborn_ that is built on-top of it.

While these libraries are general enough, others like *plotly*, *geopandas* and *Bokeh* may come handy for specific tasks. 

## _matplotlib_

_matplotlib_ is a __plotting library__ that produces figures in a variety of hardcopy formats and interactive environments.

It has many kind of __heavily customizable__ plots: __line plots\, bar plots__, __stacked bar plots__, __scatter plots__, __histograms__ and more.

_matplotlib_ can handle categorical data, timestamps and other data types.

### Import convention
Core plot functions are in the _\.pyplot_ subpackage that is conventionally imported as *plt*.

In [None]:
import matplotlib.pyplot as plt

#### Jupyter magic
_IPython_ (*Interactive Python*, the shell that powers Jupyter kernels and offers support for interactive data visualization) provides _magic_ commands that can be triggered with \%.

From the docs:

    With the following backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it.
    The resulting plots will then also be stored in the notebook document.

In [None]:
%matplotlib inline

## Anatomy of a figure
<img src="anatomy.png" style="float: right; height: 100%; max-width:60%; margin-left: 3rem;" >

Let's begin by inspecting the anatomy of a figure to better undestand the names of each element and what we are doing.

Each of these elements is called an ___Artist___.

There are *Artist*s for the axes, for the labels, for the plots, etc.

A ___Figure___ represents the figure as whole.

___Axes___ is the region of the image with the data space. An _Axes_ contains two (or three) *Axis*.

___Axis___ objects are the axis of the figure.
<!-- ![](anatomy.png) -->

In [None]:
plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['figure.dpi'] = 150

## Create a new figure
We can create a new figure with the *\.figure()* method.

This step is not mandatory and, if you don't instantiate a new figure, one will be created with the default parameters.

In [None]:
# an empty figure with no Axes
fig = plt.figure(figsize=(10,10), dpi=300)

#### Figure with a single Axes
The _\.subplots()_ function creates, in a single call, a figure and a set of subplots.

You can provide the number of rows and columns in the plot.

In [None]:
# a figure with a single Axes
fig, ax = plt.subplots()

In [None]:
# a figure with a 2x2 grid of Axes
fig, axs = plt.subplots(2, 2)

### Plotting examples
After getting familiar with the names and with figure creation, let's move to the actual plotting.

In [None]:
import numpy as np

X = np.linspace(-np.pi, np.pi, 128)
C = np.cos(X)

fig, ax = plt.subplots()
ax.plot(X, C)

### Multiple plots on the same _Axes_
How can we add a second plot with the _sin_ function?

We just plot on the same _Axes_ multiple times.

In [None]:
S = np.sin(X)

ax.plot(X, S)

display(fig)

### Setting the _ticks_
_Ticks_ are placed automatically and this _automagical_ placement usually works very well.

However, if you want to setup *ticks*, you can:

- Setup them manually (i.e., providing values where to place ticks)
- Setup a ***Locator***
    - ***Locators*** define the placement of ticks according to some rule.
    - This placement is performed by the  ***AutoLocator()*** *Locator*.

Setting _ticks_ manually

In [None]:
# Setting up on the axis
ax.xaxis.set_ticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
ax.yaxis.set_ticks([-1, 0, +1])

# # Setting up on the plot
# plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
# plt.yticks([-1, 0, +1])

display(fig)

Setting a Locator instead

In [None]:
from matplotlib.ticker import LinearLocator

ax.xaxis.set_major_locator(LinearLocator())

display(fig)

 ### Setting up axis limits
 You may need to setup the limits of the _Axis_.

In [None]:
# plt.xlim(X.min() * 1.1, X.max() * 1.1)
# plt.ylim(C.min() * 1.1, C.max() * 1.1)

ax.set_xlim(X.min() * 2, X.max() * 2)
ax.set_ylim(C.min() * 2, C.max() * 2)

display(fig)

### Adding a legend
You can easily add a legend to your plot by setting a label for each sub-plot and calling the *\.legend()* method.

In [None]:
fig, ax = plt.subplots()

ax.plot(X, C, label="cos")
ax.plot(X, S, label="sin")

ax.legend(loc='best')
# fig.legend()

# plt.plot(X, C, label="cos")
# plt.plot(X, S, label="sin")
# plt.legend()

# ax.legend()

display(fig)

### Setting the labels
We can set the _Axis_ labels and Figure title as follows.

In [None]:
# plt.xlabel('x label')
# plt.ylabel('y label')
# plt.title("Simple Plot")

ax.xaxis.set_label('x label')
ax.yaxis.set_label('y label')
ax.set_title("Simple plot")
display(fig)

### Plot types

#### Line plots
    Line plot is a type of chart that displays a series of data points called "markers" connected by straight line segments. [Wiki]
    
The associated method is *\.plot* and it is highly customizable.

For an extensive list of properties, markers and styles, visit the documentation!

As an example, you can configure line color, width and markers.

In [None]:
plt.plot(X, C, linewidth=5, color="red")
plt.plot(X, S, marker="D", markersize=0.1, color="green")

You can also plot only the markers.

In [None]:
indices = np.random.choice(list(range(X.shape[0])), size=64)
plt.plot(X[indices], (S/2)[indices], marker="o", linewidth=0, color="green", markersize=10)

#### Scatter plots
A scatter plot is a type of plot that displays values as markers.

Scatter plots are often used to display two (or more) variables encoded as *x* and _y_ coordinates, but also as color and size of the markers.

This kind of plot is used to visually inspect the data and find, for instance, relations between the variables.

You can create scatter plots in matplotlib with the _.scatter_ method.    

In [None]:
plt.scatter(np.random.rand(1, 20), np.random.rand(1, 20))
plt.scatter(np.random.rand(1, 20), np.random.rand(1, 20))

This kind of plot is highly customizable.

In [None]:
plt.scatter(S, C, s=75, c=X, alpha=.5)

#### Barplots

_Barplots_ represent categorical data with rectangular bars.

They can be plotted vertically or horizontally and the height (or length) of each bar depends on the values.

_Barplots_ and _horizontal barplots_ can be created with the _\.bar_ and _\.barh_ methods respectively.

In [None]:
teams = list("abcdefghil")
match1 = np.random.randint(1, 10, size=10)

fig, ax = plt.subplots(1, 2)

ax[0].bar(teams, match1)
ax[1].barh(teams, match1)

#### Stacked barplots
Stacked barplots are barplots that stack multiple values of the same category together.

The height (or length!) of the resulting bar shows the combined result.

##### Vertical stacked barplots
You just use the _.bar_ method and provide the sum of the previous groups as the _bottom_ (offset) parameter.

In [None]:
match2 = np.random.randint(1, 10, size=10)

p1 = plt.bar(teams, match1, label="Match 1")
p2 = plt.bar(teams, match2, label="Match 2", bottom=match1)

plt.legend()

##### Horizontal stacked barplots
You just use the _.barh_ method and provide the sum of the previous groups as the _left_ (offset) parameter.

In [None]:
p1 = plt.barh(teams, match1, label="Match 1")
p2 = plt.barh(teams, match2, label="Match 2", left=match1)

plt.legend()

## _Seaborn_
_Seaborn_ is a library for making __statistical graphics__ in Python.

It has built-in functions to show relationships between variables and to visualize univariate and bivariate distributions and it also provides estimators and linear regression models.

It has advanced support for categorical data.

_Seaborn_ comes nice built-in themes to improve your plots and with better default colors that are studied to improve readability from users.
It also has advanced functions to simplify the construction of the plot (e.g., grids, legends).

_Seaborn_ is built on top of _matplotlib_ and is closely integrated with *pandas*.

### Import convention
_Seaborn_ is conventionally imported as *sns*.

In [None]:
import seaborn as sns

### Themes
_Seaborn_ comes with nice themes that affect even your _matplotlib_ plots.
The default can be set with

> _.set_theme(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, ...)_

The _context_ parameter affects the scale elements of the figure and is meant to switch to different contexts (paper, poster, etc) easily.

The _style_ parameter affects some aesthetic elements like colors of the axes and of the grid.

The _palette_ parameter affects the color palette.

The other parameters are self-explanatory.

You can also set the parameters above individually.

In [None]:
sns.set_theme()
# sns.reset_orig()

In [None]:
plt.plot(X, C, label="cos")

#### Context
You can set the style using the _\.set\_context()_ function.

In [None]:
sns.set_context("poster")
sns.set_context("paper")
sns.set_context("talk")

In [None]:
plt.plot(X, C, label="cos")

#### Styles
You can set the style using the *.set\_style()* function.

In [None]:
# sns.set_style("white")
# sns.set_style("whitegrid")
sns.set_style("dark")
sns.set_style("darkgrid")
sns.set_style("ticks")

In [None]:
plt.plot(X, C, label="cos")

#### Color palette
You can set the style using the *.set_palette()* function (and visualize them with *.color\_palette()*.

In [None]:
# sns.set_palette("flare")
# sns.set_palette("pastel")
# sns.set_palette("dark")
sns.set_palette("Dark2")

In [None]:
plt.plot(X, C, label="cos")

#### Removing the spines
You can also remove the spines (axis) using *.despine()*.

In [None]:
plt.plot(X, C, label="cos")

sns.despine(left=True, top=True)

In [None]:
plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['figure.dpi'] = 150

### Structured multi-plot grids: _FacetGrid_

When visualizing data, you may need to plot multiple instances of the same plot on different subsets of your dataset.

For this purpose, _seaborn_ provides ***FacetGrid***s, which are basically grids of *Axes*.

*FacetGrid* can have up to three dimensions: *row*, *col* and *hue*.

We will not discuss how to create *FacetGrid*s manually as many functions automatically create them.

In [None]:
# Datasets: _seaborn_ has some built in datasets for testing
tips = sns.load_dataset("tips")

tips.head()

In [None]:
# Example of manual creation
g = sns.FacetGrid(tips, col="day", hue="sex")
g.map(sns.barplot, "sex", "total_bill", order=["Male", "Female"])

### Plotting with _seaborn_

_Seaborn_ has a number of very versatile plotting functions.

We will only focus on a few that work as a "wrapper" for the basic ones.

Another nice thing is that it can create multiple _Axes_ automagically through *FacetGrid*, depending on the _rows_ and _columns_ parameters that correspond to the columns of your *DataFrame*.

#### Relations
How can we plot relations with *seaborn*?

    The relplot() function provides access to several different axes-level functions that show the relationship between two variables with semantic mappings of subsets.

> seaborn.relplot(data=None, x=None, y=None, hue=None, size=None, style=None, row=None, col=None, palette=None, sizes=None, markers=None, dashes=None, legend='auto', kind='scatter', ...)

The _kind_ parameter selects the underlying axes-level function to use:

- scatterplot() (with kind="scatter"; the default)
- lineplot() (with kind="line")

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips)
# hue_order=["No", "Yes"],

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", style="smoker", data=tips);

In [None]:
sns.relplot(x="total_bill", y="tip", hue="smoker", style="time", data=tips);

#### Distributions
Distributions can be plotted with *.distplot()*

> _seaborn.displot(data=None, x=None, y=None, hue=None, row=None, col=None, weights=None, kind='hist', rug=False, log_scale=None, legend=True, palette=None, color=None, ...)_

It can plot histograms, kernel density estimates (KDE) or empirical (cumulative) distribution function (ECDF).
The KDE and rug plot (showing the individual observations) can also be added to the plot.

In [None]:
sns.displot(tips, x="size")

In [None]:
penguins = sns.load_dataset("penguins")

display(penguins.head())

In [None]:
sns.displot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack")

In [None]:
sns.displot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack", kind="kde")

In [None]:
sns.displot(data=penguins, kind='hist', x="flipper_length_mm", kde=True, rug=True)
# sns.displot(data=penguins, kind='kde', x="flipper_length_mm", rug=True)

In [None]:
sns.displot(data=penguins, x="flipper_length_mm", hue="species", col="species")

#### _Catplot_
The *catplot* function provides several functions that show the relationship between a numerical and one or more categorical variables.

> seaborn.catplot(data=None, x=None, y=None, hue=None, data=None, row=None, col=None, col_wrap=None, estimator=<function mean at 0x7fa4c4f67940>, ci=95, n_boot=1000, units=None, kind='strip', ...)

The _kind_ parameter selects the underlying axes-level function to use.
There are categorical:

- scatterplots (*stripplot* with *kind*="strip", *swarmplot* with kind="swarm")
- distribution plots (*boxplot* with *kind*="box", *violinplot* with kind="violin", *boxenplot* with kind="boxen")
- estimate plots (*pointplot* with *kind*="point", *barplot* with kind="bar", *countplot* with kind="count")


In [None]:
exercise = sns.load_dataset("exercise")

In [None]:
g = sns.catplot(x="time", y="pulse", hue="kind", data=exercise)

In [None]:
g = sns.catplot(x="time", y="pulse", hue="kind", data=exercise, kind="violin")

In [None]:
g = sns.catplot(x="time", y="pulse", hue="kind", data=exercise, kind="point")

#### Regressions
_lmplot_ provides an easy way to fit regression models and plot the.

It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.

> seaborn.lmplot(*, x=None, y=None, data=None, hue=None, col=None, row=None, palette=None, col_wrap=None, x_estimator=None, x_bins=None, x_ci='ci', scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, seed=None, order=1, logistic=False, lowess=False, robust=False, logx=False, x_partial=None, y_partial=None, truncate=True, x_jitter=None, y_jitter=None, scatter_kws=None, line_kws=None, size=None)

In [None]:
g = sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)

In [None]:
g = sns.lmplot(x="size", y="total_bill", hue="day", col="day", data=tips, height=6, aspect=.4, x_jitter=.1)

#### _Heatmap_ and _clustermap_

> *Heatmap* is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. [Wiki]

The color may vary in hue or intensity and, if the rows and columns may be reordered to find clusters in the data, the plot is called *clustered heatmap*.

In [None]:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")

sns.heatmap(flights)

In [None]:
iris = sns.load_dataset("iris")

species = iris.pop("species")

g = sns.clustermap(iris)

## Exercise
Given the CSV file from the previous lecture, load it and perform some statistical analysis.

Load the CSV file.

In [None]:
import pandas as pd

file = "albumlist.csv"

df = pd.read_csv(file,
                 encoding="ISO-8859-15",
                 index_col="Number"
                )

display(df.head())

Plot the histogram of the number of albums in the cart for each year.

In [None]:
sns.histplot(data=df, x="Year")

Show the unique values of the _Genre_ column.

In [None]:
df["Genre"].unique()

Since there are a bit too many sub-genres for each row, keep just the first one.

In [None]:
df["MainGenre"] = df["Genre"].apply(lambda x: x.strip().split(",")[0].strip())

In [None]:
df["MainGenre"].unique()

Set the _Genre_ as categorical variable.

In [None]:
df["Genre"] = df["Genre"].astype("category")
df["MainGenre"] = df["MainGenre"].astype("category")

display(df["MainGenre"].unique())

Now let's plot a _scatterplot_ of the position (_Number_, used as index) as function of the *Year*.
As a third variable, we are also interested in the main genre of the album.

In [None]:
sns.relplot(kind="scatter", data=df, x="Year", y=df["Year"].index, hue="MainGenre")

In [None]:
sns.displot(df, x="Year", y="Number", kind="hist")

Reset the index to restore the _Number_ column.

In [None]:
df.reset_index(drop=False, inplace=True)

And compute the correlation between the _Number_ and _Year_ columns.

In [None]:
corr = df[["Number", "Year"]].corr()

display(corr)

What about showing the correlation via a plot?

In [None]:
sns.heatmap(corr, center=0, vmin=-1, vmax=1, square=True, linewidths=.5)

The heatmap above shows the correlation between the two variables.

Yet, we don't like the colors and we know the matrix is symmetrical, so we would like to show only the lower triangle.

Let's try to improve the plot.

What about the color palette?

While _matplotlib_ has many palettes, not all of them are actually good for visualization.

_Seaborn_ tries to overcome this limitation by providing a nice interface for the *HSLuv* (formerly known as *HUSL*) color system, which works better with human vision as it minimizes the variation of intensity of colors.

In [None]:
# Built in in matplotlib
sns.palplot(sns.color_palette("coolwarm", n_colors=9))

In [None]:
# HUSL system palette
sns.palplot(sns.diverging_palette(240, 10, n=9))

In [None]:
# Generate a custom diverging colormap
cmap = sns.diverging_palette(240, 10, as_cmap=True)
# cmap = sns.color_palette("coolwarm", as_cmap=True)

Let's generate a mask to filter out some values from the plot.

We are interested in displaying just the lower (or the upper) triangle of the matrix.

In [None]:
# Generate a mask for the upper triangle, excluding the diagonal
mask = np.triu(
    np.ones_like(corr, dtype=bool),
    k=1
)
display(mask)

In [None]:
sns.heatmap(corr, cmap=cmap, mask=mask, center=0, vmin=-1, vmax=1, square=True, linewidths=.5)

At this point we want to group the _DataFrame_ by the main genre to compute the average position in the chart.

In [None]:
grouped_by_genre = df.groupby("MainGenre")
mean_position = grouped_by_genre["Number"].mean()
mean_position = mean_position.round(2)

display(mean_position)

In [None]:
sns.barplot(x=mean_position, y=mean_position.index)

---

## References
- [_IPython_](https://ipython.org/)
- [Matplotlib Usage Guide](https://matplotlib.org/tutorials/introductory/usage.html)
- [SciPy Lectures: Plotting](https://scipy-lectures.org/intro/matplotlib/index.html)
- [Seaborn tutorial](https://seaborn.pydata.org/tutorial.html)
- [Python Graph Gallery](https://python-graph-gallery.com/)
- [Matplotlib: gallery](https://matplotlib.org/gallery/index.html)
- [Seaborn: gallery](https://seaborn.pydata.org/examples/index.html)