Construct a SQLite3 Database of Spotify Songs in 3rd normal form (3NF)

Database Normalization

Database normalization is performed for two main reasons - reduce redundancy and prevent inconsistencies on insert/update/delete.

First Normal Form (1NF)

  1. Table has a primary key (unique, not null column that identifies each row)
  2. No repeating groups of columns
  3. Each cell contains a single value

Second Normal Form (2NF)

  1. All columns in each row depend fully on candidate keys.

Third Normal Form (3NF)

  1. No transitive dependencies between non-candidate columns.

Construct a SQLite Database of Spotify Songs Data in 3NF

The Spotify Songs dataset contains 32833 observations and 23 columns, including information of playlists, artists, albums and scores of each track.

First, let's get the access to the dataset

import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv"
df = pd.read_csv(url)

Obviously, this dataset has a primary key: track_id and every cell only has a single value. The first normal form has already been satisfied.

However, the track_album_name only depends on track_album_id and play_list_name only depends on playlist_id, which violates the 2NF. To satisfy the second normal form, let's create seperate tables containing variables fully dependent on albums and playlist.

album = df.loc[:,["track_album_id","track_album_name","track_album_release_date"]]
album = album.drop_duplicates(subset='track_album_id')
playlist = df.loc[:,["playlist_id","playlist_name","playlist_genre","playlist_subgenre"]]
playlist = playlist.drop_duplicates(subset='playlist_id')

There are many different ways to connect python and sqlite3. Here, I will use my favorite method: Ipython sql magic

%load_ext sql

The following command create a new sqlite3 database called spotify.db.

%sql sqlite:///data/spotify.db

Next, we convert a pandas DataFrame to a table

%%sql 

PERSIST track
PERSIST playlist
PERSIST album
PERSIST scores

Example Query of retrieving data

One of the most important manipulation of tables is join. There are many different ways to join tables. Here, I will introduce three different methods:

  • INNER JOIN
  • INTERSECT
  • EQUIJOIN

First, let's apply INNER JOIN to find the names of all playlists that contains instrumentals.

select distinct playlist_name
from (scores s inner join
track t on s.track_id=t.track_id) inner join
playlist p on p.playlist_id=t.playlist_id
where s.instrumentalness > 0.5;

If you want to join track table and playlist table, you can also use EQUIJOIN.

select distinct playlist_name
from playlist, track
where playlist.playlist_id = track.playlist_id

Or use INTERSECT

select playlist_name
from playlist
where palylist_id IN (select distinct playlist_id
from playlist
INTERSECT
select distinct playlist_id
from track)

References

  1. https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md
  2. https://github.com/cliburn/bios-823-2020