Home | Pythonic Perambulations

A D3 Viewer for Matplotlib Visualizations

Thu 19 December 2013

I've spent a lot of time recently attempting to push the boundaries of tools for interactive data exploration within the IPython notebook. I have worked on animations, including an HTML5 embedding and a Javascript Viewer. I have worked on javascript/python kernel interaction and static javascript widgets. But I would say that the holy grail of interactive data visualization in the IPython notebook is, as I've mentioned previously, a truly interactive in-browser matplotlib display.

There are many people pushing in this direction in the Python world. Bokeh and Plotly are new visualization packages which have built Python APIs from scratch. The demos are beautiful and impressive, and the APIs are clean and intuitive. But, because matplotlib is so well-established in the Python world, it would be nice to be able to continue using it even in the age of browser-based visualization. To this end, some of the matplotlib core devs have been working on a WebGL viewer for matplotlib figures. I've seen a working demo, and it's very cool, but last I heard it still has a long way to go.

I've been wondering for a while whether it might be possible to create a solution using D3.js. D3 (short for data-driven documents) is a framework which facilitates the easy creation and manipulation of groups of HTML objects. Combined with the native SVG support of modern web browsers, it provides an extremely powerful and flexible low-level interface to creating interactive graphics on the web. I've long wondered what it would take to write a D3 backend or frontend for matplotlib, but I'd never experimented with the idea. It was a couple weeks ago at Seattle Beer && Code meetup that I chatted with some expert Javascript hackers who pointed out where I might start.

I started to try things out, and over the course of a few late nights, came up with a first attempt at a partial interactive D3 viewer for matplotlib images: the result is the mpld3 package, available on my GitHub page.

matplotlib mpld3

Static Interactive Widgets for IPython Notebooks

Thu 05 December 2013

The inspiration of my previous kernel density estimation post was a blog post by Michael Lerner, who used my JSAnimation tools to do a nice interactive demo of the relationship between kernel density estimation and histograms.

This was on the heels of Brian Granger's excellent PyData NYC Keynote where he live-demoed the brand new IPython interactive tools. This new functionality is very cool. Wes McKinney remarked that day on Twitter that "IPython's interact machinery is going to be a huge deal". I completely agree: the ability to quickly generate interactive widgets to explore your data is going to change the way a lot of people do their daily scientific computing work.

But there's one issue with the new widget framework as it currently stands: unless you're connected to an IPython kernel (i.e. actually running IPython to view your notebook), the widgets are useless. Don't get me wrong: they're incredibly cool when you're actually interacting with data. But the bread-and-butter of this blog and many others is static notebook views: for this purpose, widgets with callbacks to the Python kernel aren't so helpful.

This is where ipywidgets comes in.

matplotlib javascript

Kernel Density Estimation in Python

Sun 01 December 2013

Last week Michael Lerner posted a nice explanation of the relationship between histograms and kernel density estimation (KDE). I've made some attempts in this direction before (both in the scikit-learn documentation and in our upcoming textbook), but Michael's use of interactive javascript widgets makes the relationship extremely intuitive. I had been planning to write a similar post on the theory behind KDE and why it's useful, but Michael took care of that part. Instead, I'm going to focus here on comparing the actual implementations of KDE currently available in Python. If you're unsure what kernel density estimation is, read Michael's post and then come back here.

There are several options available for computing kernel density estimates in Python. The question of the optimal KDE implementation for any situation, however, is not entirely straightforward, and depends a lot on what your particular goals are. Here are the four KDE implementations I'm aware of in the SciPy/Scikits stack:

In SciPy: gaussian_kde.
In Statsmodels: KDEUnivariate and KDEMultivariate (See an example here).
In Scikit-learn: KernelDensity (See further examples here).

Each has advantages and disadvantages, and each has its area of applicability. I'll start with a table summarizing the strengths and weaknesses of each, before discussing each feature in more detail and running some simple benchmarks to gauge their computational cost:

tutorial benchmarks

The Big Data Brain Drain: Why Science is in Trouble

Sat 26 October 2013

Regardless of what you might think of the ubiquity of the "Big Data" meme, it's clear that the growing size of datasets is changing the way we approach the world around us. This is true in fields from industry to government to media to academia and virtually everywhere in between. Our increasing abilities to gather, process, visualize, and learn from large datasets is helping to push the boundaries of our knowledge.

But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms.

academia opinion

Understanding the FFT Algorithm

Wed 28 August 2013

The Fast Fourier Transform (FFT) is one of the most important algorithms in signal processing and data analysis. I've used it for years, but having no formal computer science background, It occurred to me this week that I've never thought to ask how the FFT computes the discrete Fourier transform so quickly. I dusted off an old algorithms book and looked into it, and enjoyed reading about the deceptively simple computational trick that JW Cooley and John Tukey outlined in their classic 1965 paper introducing the subject.

The goal of this post is to dive into the Cooley-Tukey FFT algorithm, explaining the symmetries that lead to it, and to show some straightforward Python implementations putting the theory into practice. My hope is that this exploration will give data scientists like myself a more complete picture of what's going on in the background of the algorithms we use.

fft tutorial

Conway's Game of Life in Python

Wed 07 August 2013

In 1970 the British Mathematician John Conway created his "Game of Life" -- a set of rules that mimics the chaotic yet patterned growth of a colony of biological organisms. The "game" takes place on a two-dimensional grid consisting of "living" and "dead" cells, and the rules to step from generation to generation are simple:

Overpopulation: if a living cell is surrounded by more than three living cells, it dies.
Stasis: if a living cell is surrounded by two or three living cells, it survives.
Underpopulation: if a living cell is surrounded by fewer than two living cells, it dies.
Reproduction: if a dead cell is surrounded by exactly three cells, it becomes a live cell.

By enforcing these rules in sequential steps, beautiful and unexpected patterns can appear.

I was thinking about classic problems that could be used to demonstrate the effectiveness of Python for computing and visualizing dynamic phenomena, and thought back to a high school course I took where we had an assignment to implement a Game Of Life computation in C++. If only I'd had access to IPython and associated tools back then, my homework assignment would have been a whole lot easier!

Here I'll use Python and NumPy to compute generational steps for the game of life, and use my JSAnimation package to animate the results.

matplotlib animation tutorial

XKCD Plots have Landed in Matplotlib!

Wed 10 July 2013

One of the most consistently popular posts on this blog has been my XKCDify post, where I followed in the footsteps of others to write a little hack for xkcd-style plotting in matplotlib. In it, I mentioned the Sketch Path Filter pull request that would eventually supersede my ugly little hack.

Well, "eventually" has finally come. Observe:

inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

plt.plot(sin(linspace(0, 10))) plt.title('Whoo Hoo!!!')

<matplotlib.text.Text at 0x2fade10>

The plt.xkcd() function enables some rcParam settings which can automatically convert any matplotlib plot into XKCD style. You can peruse the matplotlib xkcd gallery here for inspiration, or read on where I'll show off some of my favorite of the possibilities.

matplotlib xkcd

Numba vs. Cython: Take 2

Sat 15 June 2013

Last summer I wrote a post comparing the performance of Numba and Cython for optimizing array-based computation. Since posting, the page has received thousands of hits, and resulted in a number of interesting discussions. But in the meantime, the Numba package has come a long way both in its interface and its performance.

Here I want to revisit those timing comparisons with a more recent Numba release, using the newer and more convenient autojit syntax, and also add in a few additional benchmarks for completeness. I've also written this post entirely within an IPython notebook, so it can be easily downloaded and modified.

As before, I'll use a pairwise distance function. This will take an array representing M points in N dimensions, and return the M x M matrix of pairwise distances. This is a nice test function for a few reasons. First of all, it's a very clean and well-defined test. Second of all, it illustrates the kind of array-based operation that is common in statistics, datamining, and machine learning. Third, it is a function that results in large memory consumption if the standard numpy broadcasting approach is used (it requires a temporary array containing M * M * N elements), making it a good candidate for an alternate approach.

benchmarks numba cython

IPython Notebook: Javascript/Python Bi-directional Communication

Sat 01 June 2013

I've been working with javascript and the IPython notebook recently, and found myself in need of a way to pass data back and forth between the Javascript runtime and the IPython kernel. There's a bit of information about this floating around on various mailing lists and forums, but no real organized tutorial on the subject. Partly this is because the tools are relatively specialized, and partly it's because the functionality I'll outline here is planned to be obsolete in the 2.0 release of IPython.

Nevertheless, I thought folks might be interested to hear what I've learned. What follows are a few basic examples of moving data back and forth between the IPython kernel and the browser's javascript. Note that if you're viewing this statically (i.e. on the blog or on nbviewer) then the javascript calls below will not work: to see the code in action, you'll need to download the notebook and open it in IPython.

ipython jupyter

A Simple Animation: The Magic Triangle

Tue 28 May 2013

I've been spending a lot of time lately playing with animations in IPython notebook. Here's my latest one - a slightly puzzling rearrangement of shapes into two different triangles:

Where does the extra block go? Look close and you might be able to figure it out.

matplotlib animation