Pandas

Introduction

Pandas is a python library that provides high-performance, easy-to-use data structures and data analysis tools.

These structures heavily rely on NumPy and its arrays and they are suitable for:

  • Tabular data with heterogeneously-typed columns
  • Ordered and unordered time series data
  • Arbitrary matrix data

Among others, pandas can read data from Excel spreadsheets, CSV or TSV files of even from SQL databases.

Pandas is conventionally imported as pd

1
import pandas as pd

Data structures

The main data structures of Pandas are the Series and the DataFrame.

Series

The Series is just a one-dimensional labeled array of any data type.

Series behaves as a dict-like object. The axis labels are collectively referred to as the index.

The index is used to get and set values and to align the values when operations between Series are performed. In this case, the series do not need to have the same length.

Indexes may not be unique, but operations that require unicity will raise an exception.

Note that operations such as slicing will also slice the index.

Under the hood, Series are an extended ndarray, which makes it a valid argument to most NumPy functions, but also makes it fixed size.

Series may have a name. This is useful, for instance, to identify them.

Series creation

You just need some data or the shape of the Series to create one.

If no index is provided at creation, one will be created having contiguous integer values starting from zero (known as RangeIndex).

Note that indexes can be longer than the actual data size and that missing values will be marked as NaN.

Data can be provided in many ways. Let’s see some with a few examples.

From a scalar
1
2
3
series = pd.Series(data=4)
print("Series with default index:")
display(series)
Series with default index:

0    4
dtype: int64
From a list or ndarray
1
2
series = pd.Series(list(range(5)), index=['a', 'b', 'c', 'd', 'e'])
display(series)
a    0
b    1
c    2
d    3
e    4
dtype: int64
1
2
3
4
import numpy as np

series = pd.Series(np.fromiter(range(5), dtype=int), index=['a', 'b', 'c', 'd', 'e'])
display(series)
a    0
b    1
c    2
d    3
e    4
dtype: int64
From a dictionary

Series index and values will be the key-values pair of the dictionary.

1
2
3
4
5
6
7
8
9
# Create a dictionary with zip
keys = ['a', 'b', 'c', 'd', 'e']
values = list(range(5))
dictionary = dict(zip(keys, values))
print("Dict:", dictionary)

series = pd.Series(dictionary)
print("Series:")
display(series)

Dict: {‘a’: 0, ‘b’: 1, ‘c’: 2, ’d’: 3, ’e’: 4} Series:

a    0
b    1
c    2
d    3
e    4
dtype: int64

Note that missing values are supported!

1
2
3
4
5
6
7
8
# Remove key 'c'
del dictionary["c"]
print("Dict:", dictionary)

# Pass the dictionary and the full keys as arguments
series = pd.Series(dictionary, index=keys)
print("Series:")
display(series)

Dict: {‘a’: 0, ‘b’: 1, ’d’: 3, ’e’: 4} Series:

a    0.0
b    1.0
c    NaN
d    3.0
e    4.0
dtype: float64

DataFrame

The DataFrame is a two-dimensional labeled data structure that resembles a spreadsheet, a SQL table or a dictionary of Series.

The row labels are collectively referred to as the index and the columns can potentially be of different types.

Note that while rows cannot be added or removed (because of the ndarray implementation), columns can be added\removed any time.

Arithmetic operations between DataFrames align on both row and column labels.

DataFrame creation

Just like Series, DataFrames can be created in many ways.

One way is to use the DataFrame(data=None, index=None, columns=None, dtype=None, copy=False) class.

If data argument is an iterable of iterables, each entry is used to fill the rows.

The index (row labels) or columns (column labels) arguments can be used for both naming the rows and columns and filtering (if data is a dict of dicts). In the latter case, the final DataFrame will not include all data not matching.

From a list of Series, ndarrays, lists or tuples
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
rows = []
for i in range(4):
    rows.append(list(range(5)))
    np.random.shuffle(rows[i])
    
print("Rows:\n", rows)

dataframe = pd.DataFrame(data=rows)
print("DataFrame from list of lists:")
display(dataframe)

Rows: [[1, 3, 4, 0, 2], [1, 2, 0, 3, 4], [3, 2, 1, 0, 4], [1, 0, 3, 2, 4]] DataFrame from list of lists:

01234
013402
112034
232104
310324

Let’s provide the index and column names.

1
2
3
4
5
6
dataframe = pd.DataFrame(rows,
                         index=["A", "B", "C", "D"],
                         columns=["one", "two", "three", "four", "five"]
                        )
# print("DataFrame from list of lists with indices and labels:")
display(dataframe)

onetwothreefourfive
A13402
B12034
C32104
D10324
From a dict of Series or dicts

Let’s init a DataFrame from dictionaries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Init two dictionaries
keys = ['a', 'b', 'c', 'd', 'e']
dictionary1 = dict(zip(keys, list(range(5))))
dictionary2 = dict(zip(keys, list(range(5, 10))))

# Init a column dictionary
columns = {
    "one": dictionary1,
    "two": dictionary2
}

# From a dictionary of dictionaries
dataframe = pd.DataFrame(columns)

What if we provide all the column names with some order?

1
2
3
# We can provide column order
dataframe = pd.DataFrame(data=columns, columns=["two", "one"])
display(dataframe)

twoone
a50
b61
c72
d83
e94

Let’s try to filter the rows and the columns.

1
2
3
# We can provide the only rows we're interested in.
dataframe = pd.DataFrame(data=columns, index=["a", "b"])
display(dataframe)

onetwo
a05
b16
1
2
3
# We can provide the only columns we're interested in.
dataframe = pd.DataFrame(data=columns, columns=["two"])
display(dataframe)

two
a5
b6
c7
d8
e9

What about a dictionary of Series?

1
2
3
4
5
for key, value in columns.items():
    columns[key] = pd.Series(value)
    
dataframe = pd.DataFrame(columns)
display(dataframe)

onetwo
a05
b16
c27
d38
e49
From other sources

Pandas has connectors and helpers to read data from many sources.

For instance, you can read CSV or Excel files, SQL databases, and more.

Check the docs for more.

DataFrame selection

How can we select data from DataFrame’s columns and/or rows?

Columns selection

You can access the columns of a DataFrame by using the indexing ([]) operator.

Column selection can be performed by label. If you provide:

  • a single label, it will return the Series of that column
  • multiple labels, it will return a sub-DataFrame with specified columns
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Reinit the dataframe
rows = []
for i in range(4):
    rows.append(list(range(5)))
    np.random.shuffle(rows[i])
    
dataframe = pd.DataFrame(rows,
                         index=["A", "B", "C", "D"],
                         columns=["one", "two", "three", "four", "five"]
                        )

display(dataframe)

onetwothreefourfive
A02314
B13204
C40231
D24031

Get the column ‘one’

1
2
3
display(dataframe["one"])

print("What type is the result?", type(dataframe["one"]))
A    0
B    1
C    4
D    2
Name: one, dtype: int64


What type is the result? <class 'pandas.core.series.Series'>

Get the columns ‘one’ and ’two’

1
2
3
display(dataframe[["one", "two"]])

print("What type is the result?", type(dataframe[["one", "two"]]))

onetwo
A02
B13
C40
D24
What type is the result? <class 'pandas.core.frame.DataFrame'>

Rows selection

Row selection can be performed using the .loc[] property.

Again, row selection can be performed by (index) label. If you provide:

  • a single index, it will return the Series of that row
  • multiple indexes, it will return a sub-DataFrame with specified rows

You can also specify what columns to get as a second positional argument.

Single index label

Let’s get the ‘A’ row…

1
display(dataframe.loc["A"])
one      0
two      2
three    3
four     1
five     4
Name: A, dtype: int64

… and filter the columns

1
2
display("Indexing: 'A' row and get column 'two'", dataframe.loc["A", "two"])
display("Indexing: 'A' row and get columns ['one', 'two']", dataframe.loc["A", ["one", "two"]])
"Indexing: 'A' row and get column 'two'"

2

"Indexing: 'A' row and get columns ['one', 'two']"

one    0
two    2
Name: A, dtype: int64

List of index labels

Let’s get the ‘A’ and ‘B’ rows…

1
display(dataframe.loc[["A", "B"]])

onetwothreefourfive
A02314
B13204

Slicing on index labels

DataFrames support slicing!

Let’s fetch the rows from ‘A’ to ‘C’.

1
display(dataframe.loc["A":"C"])

onetwothreefourfive
A02314
B13204
C40231

Boolean indexing

DataFrames also support boolean indexing.

Filtering with a boolean DataFrame

Let’s begin with filtering every entry independently.

1
2
3
4
5
6
7
8
# Get a boolean mask, just like we would using NumPy
indexing_df = dataframe > 1

print("Boolean DataFrame > 1")
display(indexing_df)

print("What type is the filtering index?", type(indexing_df))
display(indexing_df.dtypes)

Boolean DataFrame > 1

onetwothreefourfive
AFalseTrueTrueFalseTrue
BFalseTrueTrueFalseTrue
CTrueFalseTrueTrueFalse
DTrueTrueFalseTrueFalse
What type is the filtering index? <class 'pandas.core.frame.DataFrame'>

one      bool
two      bool
three    bool
four     bool
five     bool
dtype: object
1
2
print("Boolean indexing with the DataFrame")
display(dataframe[indexing_df])

Boolean indexing with the DataFrame

onetwothreefourfive
ANaN2.03.0NaN4.0
BNaN3.02.0NaN4.0
C4.0NaN2.03.0NaN
D2.04.0NaN3.0NaN
Filtering with a boolean Series

Let’s filter the DataFrame depending on the values of a given column.

1
2
3
4
5
6
7
8
# Get a boolean mask, just like we would using NumPy
indexing_series = dataframe["four"] > 1

print("Boolean Series > 1")
display(indexing_series)

print("What type is the filtering index?", type(indexing_series))
display(indexing_series.dtype)

Boolean Series > 1

A    False
B    False
C     True
D     True
Name: four, dtype: bool

What type is the filtering index? <class 'pandas.core.series.Series'>

dtype('bool')
1
2
print("Boolean indexing with the Series")
display(dataframe[indexing_series])

Boolean indexing with the Series

onetwothreefourfive
C40231
D24031
Filter the rows

We can also filter the rows according to some condition. For instance:

1
2
3
4
5
6
# Let's get the rows where column 'two' is > 1
indexing_series = dataframe.loc[:, "two"] > 0
display(indexing_series)

print("What type is the filtering index?", type(indexing_series))
display(indexing_series.dtype)
A     True
B     True
C    False
D     True
Name: two, dtype: bool


What type is the filtering index? <class 'pandas.core.series.Series'>



dtype('bool')
1
2
print("Boolean indexing with the Series")
display(dataframe[indexing_series])

Boolean indexing with the Series

onetwothreefourfive
A02314
B13204
D24031

Position

You can also access rows by integer index (position) using the .iloc attribute.

Integer indexes are 0-based and may be tricky. Other selection methods should be preferred.

As a trivial example, if we want to get the second entry we can…

1
2
print("DataFrame at position 1:")
display(dataframe.iloc[1])

DataFrame at position 1:

one      1
two      3
three    2
four     0
five     4
Name: B, dtype: int64

DataFrame manupulation

After an overview on data creation and selection, let’s introduce some data manipulation.

Show index and columns

How can we show index and column names?

DataFrames have the .index and .columns attributes that may also be set.

1
2
print("Index:", dataframe.index)
print("Columns:", dataframe.columns)

Index: Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=‘object’) Columns: Index([‘one’, ’two’, ’three’, ‘four’, ‘five’], dtype=‘object’)

Get ndarray

What if we want to get the underlying ndarray?

We can use the .to_numpy() method

1
display(dataframe.to_numpy())
array([[0, 2, 3, 1, 4],
       [1, 3, 2, 0, 4],
       [4, 0, 2, 3, 1],
       [2, 4, 0, 3, 1]])

Arithmetic operations

Arithmetic operations on DataFrames (and Series also) are as simple as writing the equation.

Remember that values will be aligned by indices, not position!

Let’s sum two columns (Series)

1
dataframe["two"] + dataframe["one"]
A    2
B    4
C    4
D    6
dtype: int64

Let’s sum two full DataFrames

1
dataframe + dataframe

onetwothreefourfive
A04628
B26408
C80462
D48062

Add and remove columns

We can add columns by just defining a new one.

1
2
dataframe["six"] = dataframe["two"] + dataframe["one"]
display(dataframe)

onetwothreefourfivesix
A023142
B132044
C402314
D240316

Column removal can be performed in two ways

1
2
3
4
5
# del dataframe["six"]

dataframe.drop('six', axis=1, inplace=True)

display(dataframe)

onetwothreefourfive
A02314
B13204
C40231
D24031

Sort by an axis

DataFrames can be easily sorted by the index or by column labels or values.

Sort by labels

1
2
3
4
5
print("Sort by index labels")
display(dataframe.sort_index(axis=0, ascending=False, inplace=False))

print("Sort by column labels")
display(dataframe.sort_index(axis=1, ascending=False, inplace=False))

Sort by index labels

onetwothreefourfive
D24031
C40231
B13204
A02314
Sort by column labels

twothreeonefourfive
A23014
B32104
C02431
D40231

Sort by values

1
2
3
4
print("Sorted by 'one' column values")
display(dataframe.sort_values("one", ascending=False, inplace=False))
print("Sorted by ['two', 'one'] column values")
display(dataframe.sort_values(['two', 'one'], ascending=False, inplace=False))

Sorted by ‘one’ column values

onetwothreefourfive
C40231
D24031
B13204
A02314
Sorted by ['two', 'one'] column values

onetwothreefourfive
D24031
B13204
A02314
C40231

Reindex the DataFrame

Reindexing the DataFrame consists of changing the index.

Set another column as index

You can set, for instance, another column as the new index.

If you set the drop parameter to True, the columns used as the new index will be removed.

1
display(dataframe.set_index("five", drop=True))

onetwothreefour
five
40231
41320
14023
12403

Change the index labels

You can also change the index labels using the .reindex method.

This operation will place NA/NaN in locations having no value in the previous index.

1
display(dataframe.reindex(["G", "B", "C", "D"]))

onetwothreefourfive
GNaNNaNNaNNaNNaN
B1.03.02.00.04.0
C4.00.02.03.01.0
D2.04.00.03.01.0

Reset the index

You can also reset the index column by replacing it with a RangeIndex.

If you set the drop parameter to True, old index will be removed from the DataFrame.

1
display(dataframe.reset_index(drop=False))

indexonetwothreefourfive
0A02314
1B13204
2C40231
3D24031

Query the DataFrame

You can query the DataFrame using the .query method.

The parameter is a query string that will be evaluated as a boolean expression.

You can refer:

  • Columns, even with spaces by surrounding them with backticks (`)
  • Environment variables (prefixed with @)

For instance:

1
dataframe.query("four > 3")

onetwothreefourfive

Merge DataFrames

DataFrames can be merged by concatenating them along some axis (index or columns), performing union or intersection of them.

There are many functions and methods for this kind of operation:

  • concat: this function is the most general
  • .merge: this method performs a SQL-style join
  • .append: this method appends the provided rows to the end of the caller object
  • .join: this method joins the columns of two DataFrames

Let’s focus on the concat function.

After splitting the rows of a DataFrame, recreate the original by concatenating them.

1
2
3
4
5
6
7
8
pieces = [dataframe[:2], dataframe[2:4]]

for i, df in enumerate(pieces, start=1):
    print("Piece {}".format(i))
    display(df)
    
print("Concatenated pieces:")
display(pd.concat(pieces))

Piece 1

onetwothreefourfive
A02314
B13204
Piece 2

onetwothreefourfive
C40231
D24031
Concatenated pieces:

onetwothreefourfive
A02314
B13204
C40231
D24031

After splitting the columns of a DataFrame, recreate the original by concatenating them.

1
2
3
4
5
6
7
8
pieces = [dataframe[["one", "two"]], dataframe[["three", "four"]]]

for i, df in enumerate(pieces, start=1):
    print("Piece {}".format(i))
    display(df)
    
print("Concatenated pieces:")
display(pd.concat(pieces, axis=1))

Piece 1

onetwo
A02
B13
C40
D24
Piece 2

threefour
A31
B20
C23
D03
Concatenated pieces:

onetwothreefour
A0231
B1320
C4023
D2403

Group By

Group By is meant to split-apply-combine data in pandas.

The full chain consists of the following steps:

  • Splitting the data into groups using some criteria
  • Applying some function to each group independently
    • Aggregation : compute some summary statistics for each group
    • Transformation : perform some group-specific computations and return a like-indexed object
    • Filtration : discard some groups, according to a group-wise computation that evaluates True or False
  • Combining the results into a data structure

Introduction by example

Let’s consider the following DataFrame.

1
2
3
4
5
6
df = pd.DataFrame({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., 24., 26.]
})

display(df)

AnimalMax Speed
0Falcon380.0
1Falcon370.0
2Parrot24.0
3Parrot26.0

Split

As a first step, we want to split the data and group our DataFrame using the column values.

1
2
3
4
5
groups = df.groupby(['Animal'])

for animal, group_df in groups:
    print(animal, "group")
    display(group_df)

Falcon group

AnimalMax Speed
0Falcon380.0
1Falcon370.0
Parrot group

AnimalMax Speed
2Parrot24.0
3Parrot26.0

Aggregate and combine

Apply an aggregation by computing the mean value and get the new DataFrame as the concatenation of the results.

1
groups.mean()

Max Speed
Animal
Falcon375.0
Parrot25.0

Pivoting

Pivoting consists in reshaping a DataFrame and organizing it using some given index and/or column values.

This is useful, for instance, when we need some specific representation for our data.

Example

Let’s try to clarify the concept with an example.

Given another DataFrame.

1
2
3
4
5
6
7
8
df = pd.DataFrame({
    'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
    'baz': [1, 2, 3, 4, 5, 6],
    'zoo': ['x', 'y', 'z', 'q', 'w', 't']
})

df

foobarbazzoo
0oneA1x
1oneB2y
2oneC3z
3twoA4q
4twoB5w
5twoC6t

We would like to pivot it and have:

  • the values of the “foo” column as new index
  • the values of the “bar” column as new columns labels
  • the values of “baz” values as new values
1
df.pivot(index='foo', columns='bar', values='baz')

barABC
foo
one123
two456

We can also use two or mode columns as values.

1
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])

bazzoo
barABCABC
foo
one123xyz
two456qwt

Native dtypes

While dtypes are inherited from NumPy, new some are introduced.

Specifically, the main ones are:

Time Series

Time series types allow to parse time information from various formats, manipulate them and time-zones, resampling, perform date and time arithmetic, etc…

Create a time index

One way to create a time index is via the date_range function.

date_range(start=None, end=None, periods=None, freq=None, tz=None, …):

  • start: first date
  • end: last date
  • periods: number of periods to generate
  • freq: frequency of periods
1
2
3
print("DateRange starting 1/1/2020 with 6M frequency and 10 periods")
date_range = pd.date_range('1/1/2020', periods=8, freq='6M')
display(date_range)

DateRange starting 1/1/2020 with 6M frequency and 10 periods

DatetimeIndex([‘2020-01-31’, ‘2020-07-31’, ‘2021-01-31’, ‘2021-07-31’, ‘2022-01-31’, ‘2022-07-31’, ‘2023-01-31’, ‘2023-07-31’], dtype=‘datetime64[ns]’, freq=‘6M’)

1
2
3
4
values = np.random.randint(0, 20, len(date_range))
time_series = pd.Series(values, index=date_range)
print("Time Series:")
display(time_series)

Time Series:

2020-01-31     3
2020-07-31     9
2021-01-31     0
2021-07-31    12
2022-01-31    18
2022-07-31    11
2023-01-31    10
2023-07-31     9
Freq: 6M, dtype: int64
1
2
3
# Create a DataFrame from the time series
time_dataframe = pd.DataFrame(time_series, columns=["value"])
display(time_dataframe)

value
2020-01-313
2020-07-319
2021-01-310
2021-07-3112
2022-01-3118
2022-07-3111
2023-01-3110
2023-07-319

Resample the time series

You can resample a time series to change the frequency.

For instance, reduce the accuracy to 1Y.

1
display(time_series.to_period("Y"))
2020     3
2020     9
2021     0
2021    12
2022    18
2022    11
2023    10
2023     9
Freq: A-DEC, dtype: int64
1
display(time_dataframe.to_period("Y"))

value
20203
20209
20210
202112
202218
202211
202310
20239

Basic aggregation

What if we want to aggregate the values over the year?

1
2
3
4
print("Aggregate by sum to 1Y Frequency:")
display(time_dataframe.to_period("Y").groupby(axis=0, level=0).sum())

# display(time_series.resample('1Y').sum())

Aggregate by sum to 1Y Frequency:

value
202012
202112
202229
202319

Assignation via time index

Being the Time Series the index, we can locate entries via dates.

1
2
time_dataframe.loc["2020-01-31", "two"] = 10
display(time_dataframe)

valuetwo
2020-01-31310.0
2020-07-319NaN
2021-01-310NaN
2021-07-3112NaN
2022-01-3118NaN
2022-07-3111NaN
2023-01-3110NaN
2023-07-319NaN

Categoricals

Categoricals correspond to categorical variables in statistics.

A categorical variable takes on a limited, and usually fixed, number of possible values, with some intrinsic order possible.

As an example, consider the following DataFrame.

1
2
3
4
df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5, 6],
    "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']
})

Transform a column to Category

We can define the categories by assigning the type “category” to the column.

1
2
df["grade"] = df["raw_grade"].astype("category")
display(df)

idraw_gradegrade
01aa
12bb
23bb
34aa
45aa
56ee

Rename the categories

You can rename the categories very easyly by setting the .cat.categories attribute.

1
2
df["grade"].cat.categories = ["very good", "good", "very bad"]
display(df["grade"])
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]

Add new categories

You can also add new categories.

1
df["grade"].cat.set_categories(["very good", "good", "medium", "bad", "very bad"], inplace=True)

Sort by

Sort the DataFrame with the intrinsic category order

1
display(df.sort_values(by="grade"))

idraw_gradegrade
01avery good
34avery good
45avery good
12bgood
23bgood
56every bad

Exercise

Given the CSV file from the previous lecture, load it and perform some statistical analysis.

1
2
import matplotlib.pyplot as plt
%matplotlib inline
1
2
3
4
5
6
7
8
file = "albumlist.csv"

df = pd.read_csv(file,
                 encoding="ISO-8859-15",
                 index_col="Number"
                )

display(df.head())

YearAlbumArtistGenreSubgenre
Number
11967Sgt. Pepper's Lonely Hearts Club BandThe BeatlesRockRock & Roll, Psychedelic Rock
21966Pet SoundsThe Beach BoysRockPop Rock, Psychedelic Rock
31966RevolverThe BeatlesRockPsychedelic Rock, Pop Rock
41965Highway 61 RevisitedBob DylanRockFolk Rock, Blues Rock
51965Rubber SoulThe BeatlesRock, PopPop Rock

What is the most present year in the chart?

1
display(df["Year"].value_counts().head())
1970    26
1972    24
1973    23
1969    22
1968    21
Name: Year, dtype: int64

What are the unique genres in the chart?

1
df["Genre"].unique()
array(['Rock', 'Rock, Pop', 'Funk / Soul', 'Rock, Blues', 'Jazz',
       'Jazz, Rock, Blues, Folk, World, & Country', 'Funk / Soul, Pop',
       'Blues', 'Pop', 'Rock, Folk, World, & Country',
       'Folk, World, & Country', 'Classical, Stage & Screen', 'Reggae',
       'Hip Hop', 'Jazz, Funk / Soul', 'Rock, Funk / Soul, Pop',
       'Electronic, Rock',
       'Jazz, Rock, Funk / Soul, Folk, World, & Country',
       'Jazz, Rock, Funk / Soul, Pop, Folk, World, & Country',
       'Funk / Soul, Stage & Screen',
       'Electronic, Rock, Funk / Soul, Stage & Screen',
       'Rock, Funk / Soul', 'Rock, Reggae', 'Jazz, Pop',
       'Funk / Soul, Folk, World, & Country', 'Latin, Funk / Soul',
       'Funk / Soul, Blues',
       'Reggae,ÊPop,ÊFolk, World, & Country,ÊStage & Screen',
       'Electronic,ÊStage & Screen', 'Jazz, Rock, Funk / Soul, Blues',
       'Jazz, Rock', 'Rock, Latin, Funk / Soul', 'Electronic, Rock, Pop',
       'Hip Hop, Rock, Funk / Soul', 'Electronic, Pop',
       'Rock, Blues, Pop', 'Electronic, Rock, Funk / Soul, Pop',
       'Rock, Funk / Soul, Folk, World, & Country', 'Rock,ÊBlues',
       'Rock, Pop, Folk, World, & Country', 'Rock, Latin',
       'Rock, Stage & Screen', 'Rock, Blues, Folk, World, & Country',
       'Electronic', 'Electronic, Funk / Soul, Pop',
       'Pop, Folk, World, & Country', 'Electronic, Hip Hop, Pop',
       'Blues, Folk, World, & Country',
       'Electronic, Hip Hop, Funk / Soul, Pop',
       'Rock, Funk / Soul, Blues, Pop, Folk, World, & Country',
       'Jazz, Pop, Folk, World, & Country', 'Jazz, Rock, Pop',
       'Hip Hop, Funk / Soul', 'Hip Hop, Rock',
       'Electronic, Hip Hop, Funk / Soul',
       'Funk / Soul,ÊFolk, World, & Country',
       'Electronic, Hip Hop, Reggae, Pop', 'Electronic, Reggae',
       'Electronic, Funk / Soul', 'Rock, Funk / Soul, Blues', 'Rock,ÊPop',
       'Electronic, Rock, Funk / Soul, Blues, Pop', 'Rock, Reggae, Latin'],
      dtype=object)

Let’s filter the genres by main

1
df["Genre"] = df["Genre"].apply(lambda x: x.strip().split(",")[0].strip())
1
df["Genre"].unique()
array(['Rock', 'Funk / Soul', 'Jazz', 'Blues', 'Pop', 'Folk', 'Classical',
       'Reggae', 'Hip Hop', 'Electronic', 'Latin'], dtype=object)

Set the genre as a category

1
df["Genre"] = df["Genre"].astype("category")

Any correlation between year and number?

1
df.reset_index(drop=False, inplace=True)
1
df[["Number", "Year"]].corr()

NumberYear
Number1.0000000.325667
Year0.3256671.000000

What is the best rated genre?

1
2
3
grouped_by_genre = df.groupby("Genre")

display(grouped_by_genre)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe911cc4d90>
1
2
3
mean_position = grouped_by_genre["Number"].mean()

mean_position
Genre
Blues          243.222222
Classical       45.000000
Electronic     307.133333
Folk           262.230769
Funk / Soul    208.078431
Hip Hop        301.411765
Jazz           187.421053
Latin          107.000000
Pop            145.000000
Reggae         191.857143
Rock           250.393082
Name: Number, dtype: float64

What are the bands that have the most albums in the chart?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# As a first step, let's filter the columns
album_df = df[["Artist", "Album"]]

# Then, group by the artist
bands = album_df.groupby("Artist", axis=0)

# Now we count how many occurrences there are for each and sort by that value
sorted_bands = bands.count().sort_values("Album", ascending=False)

display(sorted_bands.head())

Album
Artist
The Beatles10
Bob Dylan10
The Rolling Stones10
Bruce Springsteen8
The Who7
Previous
Next