Pandas
Introduction
Pandas is a python library that provides high-performance, easy-to-use data structures and data analysis tools.
These structures heavily rely on NumPy and its arrays and they are suitable for:
- Tabular data with heterogeneously-typed columns
- Ordered and unordered time series data
- Arbitrary matrix data
Among others, pandas can read data from Excel spreadsheets, CSV or TSV files of even from SQL databases.
Pandas is conventionally imported as pd
|
|
Data structures
The main data structures of Pandas are the Series and the DataFrame.
Series
The Series is just a one-dimensional labeled array of any data type.
Series behaves as a dict-like object. The axis labels are collectively referred to as the index.
The index is used to get and set values and to align the values when operations between Series are performed. In this case, the series do not need to have the same length.
Indexes may not be unique, but operations that require unicity will raise an exception.
Note that operations such as slicing will also slice the index.
Under the hood, Series are an extended ndarray, which makes it a valid argument to most NumPy functions, but also makes it fixed size.
Series may have a name. This is useful, for instance, to identify them.
Series creation
You just need some data or the shape of the Series to create one.
If no index is provided at creation, one will be created having contiguous integer values starting from zero (known as RangeIndex).
Note that indexes can be longer than the actual data size and that missing values will be marked as NaN.
Data can be provided in many ways. Let’s see some with a few examples.
From a scalar
|
|
Series with default index:
0 4
dtype: int64
From a list or ndarray
|
|
a 0
b 1
c 2
d 3
e 4
dtype: int64
|
|
a 0
b 1
c 2
d 3
e 4
dtype: int64
From a dictionary
Series index and values will be the key-values pair of the dictionary.
|
|
Dict: {‘a’: 0, ‘b’: 1, ‘c’: 2, ’d’: 3, ’e’: 4} Series:
a 0
b 1
c 2
d 3
e 4
dtype: int64
Note that missing values are supported!
|
|
Dict: {‘a’: 0, ‘b’: 1, ’d’: 3, ’e’: 4} Series:
a 0.0
b 1.0
c NaN
d 3.0
e 4.0
dtype: float64
DataFrame
The DataFrame is a two-dimensional labeled data structure that resembles a spreadsheet, a SQL table or a dictionary of Series.
The row labels are collectively referred to as the index and the columns can potentially be of different types.
Note that while rows cannot be added or removed (because of the ndarray implementation), columns can be added\removed any time.
Arithmetic operations between DataFrames align on both row and column labels.
DataFrame creation
Just like Series, DataFrames can be created in many ways.
One way is to use the DataFrame(data=None, index=None, columns=None, dtype=None, copy=False) class.
If data argument is an iterable of iterables, each entry is used to fill the rows.
The index (row labels) or columns (column labels) arguments can be used for both naming the rows and columns and filtering (if data is a dict of dicts). In the latter case, the final DataFrame will not include all data not matching.
From a list of Series, ndarrays, lists or tuples
|
|
Rows: [[1, 3, 4, 0, 2], [1, 2, 0, 3, 4], [3, 2, 1, 0, 4], [1, 0, 3, 2, 4]] DataFrame from list of lists:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 3 | 4 | 0 | 2 |
1 | 1 | 2 | 0 | 3 | 4 |
2 | 3 | 2 | 1 | 0 | 4 |
3 | 1 | 0 | 3 | 2 | 4 |
Let’s provide the index and column names.
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
A | 1 | 3 | 4 | 0 | 2 |
B | 1 | 2 | 0 | 3 | 4 |
C | 3 | 2 | 1 | 0 | 4 |
D | 1 | 0 | 3 | 2 | 4 |
From a dict of Series or dicts
Let’s init a DataFrame from dictionaries
|
|
What if we provide all the column names with some order?
|
|
two | one | |
---|---|---|
a | 5 | 0 |
b | 6 | 1 |
c | 7 | 2 |
d | 8 | 3 |
e | 9 | 4 |
Let’s try to filter the rows and the columns.
|
|
one | two | |
---|---|---|
a | 0 | 5 |
b | 1 | 6 |
|
|
two | |
---|---|
a | 5 |
b | 6 |
c | 7 |
d | 8 |
e | 9 |
What about a dictionary of Series?
|
|
one | two | |
---|---|---|
a | 0 | 5 |
b | 1 | 6 |
c | 2 | 7 |
d | 3 | 8 |
e | 4 | 9 |
From other sources
Pandas has connectors and helpers to read data from many sources.
For instance, you can read CSV or Excel files, SQL databases, and more.
Check the docs for more.
DataFrame selection
How can we select data from DataFrame’s columns and/or rows?
Columns selection
You can access the columns of a DataFrame by using the indexing ([]) operator.
Column selection can be performed by label. If you provide:
- a single label, it will return the Series of that column
- multiple labels, it will return a sub-DataFrame with specified columns
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
C | 4 | 0 | 2 | 3 | 1 |
D | 2 | 4 | 0 | 3 | 1 |
Get the column ‘one’
|
|
A 0
B 1
C 4
D 2
Name: one, dtype: int64
What type is the result? <class 'pandas.core.series.Series'>
Get the columns ‘one’ and ’two’
|
|
one | two | |
---|---|---|
A | 0 | 2 |
B | 1 | 3 |
C | 4 | 0 |
D | 2 | 4 |
What type is the result? <class 'pandas.core.frame.DataFrame'>
Rows selection
Row selection can be performed using the .loc[] property.
Again, row selection can be performed by (index) label. If you provide:
- a single index, it will return the Series of that row
- multiple indexes, it will return a sub-DataFrame with specified rows
You can also specify what columns to get as a second positional argument.
Single index label
Let’s get the ‘A’ row…
|
|
one 0
two 2
three 3
four 1
five 4
Name: A, dtype: int64
… and filter the columns
|
|
"Indexing: 'A' row and get column 'two'"
2
"Indexing: 'A' row and get columns ['one', 'two']"
one 0
two 2
Name: A, dtype: int64
List of index labels
Let’s get the ‘A’ and ‘B’ rows…
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
Slicing on index labels
DataFrames support slicing!
Let’s fetch the rows from ‘A’ to ‘C’.
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
C | 4 | 0 | 2 | 3 | 1 |
Boolean indexing
DataFrames also support boolean indexing.
Filtering with a boolean DataFrame
Let’s begin with filtering every entry independently.
|
|
Boolean DataFrame > 1
one | two | three | four | five | |
---|---|---|---|---|---|
A | False | True | True | False | True |
B | False | True | True | False | True |
C | True | False | True | True | False |
D | True | True | False | True | False |
What type is the filtering index? <class 'pandas.core.frame.DataFrame'>
one bool
two bool
three bool
four bool
five bool
dtype: object
|
|
Boolean indexing with the DataFrame
one | two | three | four | five | |
---|---|---|---|---|---|
A | NaN | 2.0 | 3.0 | NaN | 4.0 |
B | NaN | 3.0 | 2.0 | NaN | 4.0 |
C | 4.0 | NaN | 2.0 | 3.0 | NaN |
D | 2.0 | 4.0 | NaN | 3.0 | NaN |
Filtering with a boolean Series
Let’s filter the DataFrame depending on the values of a given column.
|
|
Boolean Series > 1
A False
B False
C True
D True
Name: four, dtype: bool
What type is the filtering index? <class 'pandas.core.series.Series'>
dtype('bool')
|
|
Boolean indexing with the Series
one | two | three | four | five | |
---|---|---|---|---|---|
C | 4 | 0 | 2 | 3 | 1 |
D | 2 | 4 | 0 | 3 | 1 |
Filter the rows
We can also filter the rows according to some condition. For instance:
|
|
A True
B True
C False
D True
Name: two, dtype: bool
What type is the filtering index? <class 'pandas.core.series.Series'>
dtype('bool')
|
|
Boolean indexing with the Series
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
D | 2 | 4 | 0 | 3 | 1 |
Position
You can also access rows by integer index (position) using the .iloc attribute.
Integer indexes are 0-based and may be tricky. Other selection methods should be preferred.
As a trivial example, if we want to get the second entry we can…
|
|
DataFrame at position 1:
one 1
two 3
three 2
four 0
five 4
Name: B, dtype: int64
DataFrame manupulation
After an overview on data creation and selection, let’s introduce some data manipulation.
Show index and columns
How can we show index and column names?
DataFrames have the .index and .columns attributes that may also be set.
|
|
Index: Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=‘object’) Columns: Index([‘one’, ’two’, ’three’, ‘four’, ‘five’], dtype=‘object’)
Get ndarray
What if we want to get the underlying ndarray?
We can use the .to_numpy() method
|
|
array([[0, 2, 3, 1, 4],
[1, 3, 2, 0, 4],
[4, 0, 2, 3, 1],
[2, 4, 0, 3, 1]])
Arithmetic operations
Arithmetic operations on DataFrames (and Series also) are as simple as writing the equation.
Remember that values will be aligned by indices, not position!
Let’s sum two columns (Series)
|
|
A 2
B 4
C 4
D 6
dtype: int64
Let’s sum two full DataFrames
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 4 | 6 | 2 | 8 |
B | 2 | 6 | 4 | 0 | 8 |
C | 8 | 0 | 4 | 6 | 2 |
D | 4 | 8 | 0 | 6 | 2 |
Add and remove columns
We can add columns by just defining a new one.
|
|
one | two | three | four | five | six | |
---|---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 | 2 |
B | 1 | 3 | 2 | 0 | 4 | 4 |
C | 4 | 0 | 2 | 3 | 1 | 4 |
D | 2 | 4 | 0 | 3 | 1 | 6 |
Column removal can be performed in two ways
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
C | 4 | 0 | 2 | 3 | 1 |
D | 2 | 4 | 0 | 3 | 1 |
Sort by an axis
DataFrames can be easily sorted by the index or by column labels or values.
Sort by labels
|
|
Sort by index labels
one | two | three | four | five | |
---|---|---|---|---|---|
D | 2 | 4 | 0 | 3 | 1 |
C | 4 | 0 | 2 | 3 | 1 |
B | 1 | 3 | 2 | 0 | 4 |
A | 0 | 2 | 3 | 1 | 4 |
Sort by column labels
two | three | one | four | five | |
---|---|---|---|---|---|
A | 2 | 3 | 0 | 1 | 4 |
B | 3 | 2 | 1 | 0 | 4 |
C | 0 | 2 | 4 | 3 | 1 |
D | 4 | 0 | 2 | 3 | 1 |
Sort by values
|
|
Sorted by ‘one’ column values
one | two | three | four | five | |
---|---|---|---|---|---|
C | 4 | 0 | 2 | 3 | 1 |
D | 2 | 4 | 0 | 3 | 1 |
B | 1 | 3 | 2 | 0 | 4 |
A | 0 | 2 | 3 | 1 | 4 |
Sorted by ['two', 'one'] column values
one | two | three | four | five | |
---|---|---|---|---|---|
D | 2 | 4 | 0 | 3 | 1 |
B | 1 | 3 | 2 | 0 | 4 |
A | 0 | 2 | 3 | 1 | 4 |
C | 4 | 0 | 2 | 3 | 1 |
Reindex the DataFrame
Reindexing the DataFrame consists of changing the index.
Set another column as index
You can set, for instance, another column as the new index.
If you set the drop parameter to True, the columns used as the new index will be removed.
|
|
one | two | three | four | |
---|---|---|---|---|
five | ||||
4 | 0 | 2 | 3 | 1 |
4 | 1 | 3 | 2 | 0 |
1 | 4 | 0 | 2 | 3 |
1 | 2 | 4 | 0 | 3 |
Change the index labels
You can also change the index labels using the .reindex method.
This operation will place NA/NaN in locations having no value in the previous index.
|
|
one | two | three | four | five | |
---|---|---|---|---|---|
G | NaN | NaN | NaN | NaN | NaN |
B | 1.0 | 3.0 | 2.0 | 0.0 | 4.0 |
C | 4.0 | 0.0 | 2.0 | 3.0 | 1.0 |
D | 2.0 | 4.0 | 0.0 | 3.0 | 1.0 |
Reset the index
You can also reset the index column by replacing it with a RangeIndex.
If you set the drop parameter to True, old index will be removed from the DataFrame.
|
|
index | one | two | three | four | five | |
---|---|---|---|---|---|---|
0 | A | 0 | 2 | 3 | 1 | 4 |
1 | B | 1 | 3 | 2 | 0 | 4 |
2 | C | 4 | 0 | 2 | 3 | 1 |
3 | D | 2 | 4 | 0 | 3 | 1 |
Query the DataFrame
You can query the DataFrame using the .query method.
The parameter is a query string that will be evaluated as a boolean expression.
You can refer:
- Columns, even with spaces by surrounding them with backticks (`)
- Environment variables (prefixed with @)
For instance:
|
|
one | two | three | four | five |
---|
Merge DataFrames
DataFrames can be merged by concatenating them along some axis (index or columns), performing union or intersection of them.
There are many functions and methods for this kind of operation:
- concat: this function is the most general
- .merge: this method performs a SQL-style join
- .append: this method appends the provided rows to the end of the caller object
- .join: this method joins the columns of two DataFrames
Let’s focus on the concat function.
After splitting the rows of a DataFrame, recreate the original by concatenating them.
|
|
Piece 1
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
Piece 2
one | two | three | four | five | |
---|---|---|---|---|---|
C | 4 | 0 | 2 | 3 | 1 |
D | 2 | 4 | 0 | 3 | 1 |
Concatenated pieces:
one | two | three | four | five | |
---|---|---|---|---|---|
A | 0 | 2 | 3 | 1 | 4 |
B | 1 | 3 | 2 | 0 | 4 |
C | 4 | 0 | 2 | 3 | 1 |
D | 2 | 4 | 0 | 3 | 1 |
After splitting the columns of a DataFrame, recreate the original by concatenating them.
|
|
Piece 1
one | two | |
---|---|---|
A | 0 | 2 |
B | 1 | 3 |
C | 4 | 0 |
D | 2 | 4 |
Piece 2
three | four | |
---|---|---|
A | 3 | 1 |
B | 2 | 0 |
C | 2 | 3 |
D | 0 | 3 |
Concatenated pieces:
one | two | three | four | |
---|---|---|---|---|
A | 0 | 2 | 3 | 1 |
B | 1 | 3 | 2 | 0 |
C | 4 | 0 | 2 | 3 |
D | 2 | 4 | 0 | 3 |
Group By
Group By is meant to split-apply-combine data in pandas.
The full chain consists of the following steps:
- Splitting the data into groups using some criteria
- Applying some function to each group independently
- Aggregation : compute some summary statistics for each group
- Transformation : perform some group-specific computations and return a like-indexed object
- Filtration : discard some groups, according to a group-wise computation that evaluates True or False
- Combining the results into a data structure
Introduction by example
Let’s consider the following DataFrame.
|
|
Animal | Max Speed | |
---|---|---|
0 | Falcon | 380.0 |
1 | Falcon | 370.0 |
2 | Parrot | 24.0 |
3 | Parrot | 26.0 |
Split
As a first step, we want to split the data and group our DataFrame using the column values.
|
|
Falcon group
Animal | Max Speed | |
---|---|---|
0 | Falcon | 380.0 |
1 | Falcon | 370.0 |
Parrot group
Animal | Max Speed | |
---|---|---|
2 | Parrot | 24.0 |
3 | Parrot | 26.0 |
Aggregate and combine
Apply an aggregation by computing the mean value and get the new DataFrame as the concatenation of the results.
|
|
Max Speed | |
---|---|
Animal | |
Falcon | 375.0 |
Parrot | 25.0 |
Pivoting
Pivoting consists in reshaping a DataFrame and organizing it using some given index and/or column values.
This is useful, for instance, when we need some specific representation for our data.
Example
Let’s try to clarify the concept with an example.
Given another DataFrame.
|
|
foo | bar | baz | zoo | |
---|---|---|---|---|
0 | one | A | 1 | x |
1 | one | B | 2 | y |
2 | one | C | 3 | z |
3 | two | A | 4 | q |
4 | two | B | 5 | w |
5 | two | C | 6 | t |
We would like to pivot it and have:
- the values of the “foo” column as new index
- the values of the “bar” column as new columns labels
- the values of “baz” values as new values
|
|
bar | A | B | C |
---|---|---|---|
foo | |||
one | 1 | 2 | 3 |
two | 4 | 5 | 6 |
We can also use two or mode columns as values.
|
|
baz | zoo | |||||
---|---|---|---|---|---|---|
bar | A | B | C | A | B | C |
foo | ||||||
one | 1 | 2 | 3 | x | y | z |
two | 4 | 5 | 6 | q | w | t |
Native dtypes
While dtypes are inherited from NumPy, new some are introduced.
Specifically, the main ones are:
Time Series
Time series types allow to parse time information from various formats, manipulate them and time-zones, resampling, perform date and time arithmetic, etc…
Create a time index
One way to create a time index is via the date_range function.
date_range(start=None, end=None, periods=None, freq=None, tz=None, …):
- start: first date
- end: last date
- periods: number of periods to generate
- freq: frequency of periods
|
|
DateRange starting 1/1/2020 with 6M frequency and 10 periods
DatetimeIndex([‘2020-01-31’, ‘2020-07-31’, ‘2021-01-31’, ‘2021-07-31’, ‘2022-01-31’, ‘2022-07-31’, ‘2023-01-31’, ‘2023-07-31’], dtype=‘datetime64[ns]’, freq=‘6M’)
|
|
Time Series:
2020-01-31 3
2020-07-31 9
2021-01-31 0
2021-07-31 12
2022-01-31 18
2022-07-31 11
2023-01-31 10
2023-07-31 9
Freq: 6M, dtype: int64
|
|
value | |
---|---|
2020-01-31 | 3 |
2020-07-31 | 9 |
2021-01-31 | 0 |
2021-07-31 | 12 |
2022-01-31 | 18 |
2022-07-31 | 11 |
2023-01-31 | 10 |
2023-07-31 | 9 |
Resample the time series
You can resample a time series to change the frequency.
For instance, reduce the accuracy to 1Y.
|
|
2020 3
2020 9
2021 0
2021 12
2022 18
2022 11
2023 10
2023 9
Freq: A-DEC, dtype: int64
|
|
value | |
---|---|
2020 | 3 |
2020 | 9 |
2021 | 0 |
2021 | 12 |
2022 | 18 |
2022 | 11 |
2023 | 10 |
2023 | 9 |
Basic aggregation
What if we want to aggregate the values over the year?
|
|
Aggregate by sum to 1Y Frequency:
value | |
---|---|
2020 | 12 |
2021 | 12 |
2022 | 29 |
2023 | 19 |
Assignation via time index
Being the Time Series the index, we can locate entries via dates.
|
|
value | two | |
---|---|---|
2020-01-31 | 3 | 10.0 |
2020-07-31 | 9 | NaN |
2021-01-31 | 0 | NaN |
2021-07-31 | 12 | NaN |
2022-01-31 | 18 | NaN |
2022-07-31 | 11 | NaN |
2023-01-31 | 10 | NaN |
2023-07-31 | 9 | NaN |
Categoricals
Categoricals correspond to categorical variables in statistics.
A categorical variable takes on a limited, and usually fixed, number of possible values, with some intrinsic order possible.
As an example, consider the following DataFrame.
|
|
Transform a column to Category
We can define the categories by assigning the type “category” to the column.
|
|
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | a |
1 | 2 | b | b |
2 | 3 | b | b |
3 | 4 | a | a |
4 | 5 | a | a |
5 | 6 | e | e |
Rename the categories
You can rename the categories very easyly by setting the .cat.categories attribute.
|
|
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]
Add new categories
You can also add new categories.
|
|
Sort by
Sort the DataFrame with the intrinsic category order
|
|
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
5 | 6 | e | very bad |
Exercise
Given the CSV file from the previous lecture, load it and perform some statistical analysis.
|
|
|
|
Year | Album | Artist | Genre | Subgenre | |
---|---|---|---|---|---|
Number | |||||
1 | 1967 | Sgt. Pepper's Lonely Hearts Club Band | The Beatles | Rock | Rock & Roll, Psychedelic Rock |
2 | 1966 | Pet Sounds | The Beach Boys | Rock | Pop Rock, Psychedelic Rock |
3 | 1966 | Revolver | The Beatles | Rock | Psychedelic Rock, Pop Rock |
4 | 1965 | Highway 61 Revisited | Bob Dylan | Rock | Folk Rock, Blues Rock |
5 | 1965 | Rubber Soul | The Beatles | Rock, Pop | Pop Rock |
What is the most present year in the chart?
|
|
1970 26
1972 24
1973 23
1969 22
1968 21
Name: Year, dtype: int64
What are the unique genres in the chart?
|
|
array(['Rock', 'Rock, Pop', 'Funk / Soul', 'Rock, Blues', 'Jazz',
'Jazz, Rock, Blues, Folk, World, & Country', 'Funk / Soul, Pop',
'Blues', 'Pop', 'Rock, Folk, World, & Country',
'Folk, World, & Country', 'Classical, Stage & Screen', 'Reggae',
'Hip Hop', 'Jazz, Funk / Soul', 'Rock, Funk / Soul, Pop',
'Electronic, Rock',
'Jazz, Rock, Funk / Soul, Folk, World, & Country',
'Jazz, Rock, Funk / Soul, Pop, Folk, World, & Country',
'Funk / Soul, Stage & Screen',
'Electronic, Rock, Funk / Soul, Stage & Screen',
'Rock, Funk / Soul', 'Rock, Reggae', 'Jazz, Pop',
'Funk / Soul, Folk, World, & Country', 'Latin, Funk / Soul',
'Funk / Soul, Blues',
'Reggae,ÊPop,ÊFolk, World, & Country,ÊStage & Screen',
'Electronic,ÊStage & Screen', 'Jazz, Rock, Funk / Soul, Blues',
'Jazz, Rock', 'Rock, Latin, Funk / Soul', 'Electronic, Rock, Pop',
'Hip Hop, Rock, Funk / Soul', 'Electronic, Pop',
'Rock, Blues, Pop', 'Electronic, Rock, Funk / Soul, Pop',
'Rock, Funk / Soul, Folk, World, & Country', 'Rock,ÊBlues',
'Rock, Pop, Folk, World, & Country', 'Rock, Latin',
'Rock, Stage & Screen', 'Rock, Blues, Folk, World, & Country',
'Electronic', 'Electronic, Funk / Soul, Pop',
'Pop, Folk, World, & Country', 'Electronic, Hip Hop, Pop',
'Blues, Folk, World, & Country',
'Electronic, Hip Hop, Funk / Soul, Pop',
'Rock, Funk / Soul, Blues, Pop, Folk, World, & Country',
'Jazz, Pop, Folk, World, & Country', 'Jazz, Rock, Pop',
'Hip Hop, Funk / Soul', 'Hip Hop, Rock',
'Electronic, Hip Hop, Funk / Soul',
'Funk / Soul,ÊFolk, World, & Country',
'Electronic, Hip Hop, Reggae, Pop', 'Electronic, Reggae',
'Electronic, Funk / Soul', 'Rock, Funk / Soul, Blues', 'Rock,ÊPop',
'Electronic, Rock, Funk / Soul, Blues, Pop', 'Rock, Reggae, Latin'],
dtype=object)
Let’s filter the genres by main
|
|
|
|
array(['Rock', 'Funk / Soul', 'Jazz', 'Blues', 'Pop', 'Folk', 'Classical',
'Reggae', 'Hip Hop', 'Electronic', 'Latin'], dtype=object)
Set the genre as a category
|
|
Any correlation between year and number?
|
|
|
|
Number | Year | |
---|---|---|
Number | 1.000000 | 0.325667 |
Year | 0.325667 | 1.000000 |
What is the best rated genre?
|
|
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe911cc4d90>
|
|
Genre
Blues 243.222222
Classical 45.000000
Electronic 307.133333
Folk 262.230769
Funk / Soul 208.078431
Hip Hop 301.411765
Jazz 187.421053
Latin 107.000000
Pop 145.000000
Reggae 191.857143
Rock 250.393082
Name: Number, dtype: float64
What are the bands that have the most albums in the chart?
|
|
Album | |
---|---|
Artist | |
The Beatles | 10 |
Bob Dylan | 10 |
The Rolling Stones | 10 |
Bruce Springsteen | 8 |
The Who | 7 |