# Reproducible Data Analysis in Jupyter

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question I get, though, is how you can progress from this nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code. This series of videos presents a case study in how I personally approach reproducible data analysis within the Jupyter notebook.

Each video is approximately 5-8 minutes; the videos are available in a YouTube Playlist. Alternatively, below you can find the videos with some description and links to relevant resources

In [1]:
# Quick utility to embed the videos below
def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):
return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)


In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.

In [2]:
embed_video(1)

Out[2]:

Relevant resources:

## Part 2: Further Data Exploration¶

In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.

In [3]:
embed_video(2)

Out[3]:

Relevant Resources:

## Part 3: Version Control with Git & GitHub¶

In this video, I set up a repository on GitHub and commit the notebook into version control.

In [4]:
embed_video(3)

Out[4]:

Relevant Resources:

## Part 4: Working with Data and GitHub¶

In [5]:
embed_video(4)

Out[5]:

## Part 5: Creating a Python Package¶

In this video, I move the data download utility into its own separate package

In [6]:
embed_video(5)

Out[6]:

Relevant Resources:

## Part 6: Unit Testing with PyTest¶

In [7]:
embed_video(6)

Out[7]:

Relevant resources:

## Part 7: Refactoring for Speed¶

In this video, I refactor the data download function to be a bit faster

In [8]:
embed_video(7)

Out[8]:

Relevant Resources:

## Part 8: Debugging a Broken Function¶

In this video, I discover that my refactoring has caused a bug. I debug it and fix it.

In [9]:
embed_video(8)

Out[9]:

## Part 8.5: Finding and Fixing a scikit-learn bug¶

In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug

In [10]:
embed_video(9)

Out[10]:

## Part 9: Further Data Exploration: PCA and GMM¶

In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it

In [11]:
embed_video(10)

Out[11]:

Relevant Resources:

## Part 10: Cleaning-up the Notebook¶

In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.

In [12]:
embed_video(11)

Out[12]:

Relevant Resources:

This post was composed within an IPython notebook; you can view a static version here or download the full source here.