Pythonic Perambulations

Musings and ramblings through the world of Python and beyond

Frequentism and Bayesianism: A Practical Introduction

One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions between them and the different practical approaches that result. The purpose of this post is to synthesize the philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself might be better prepared to understand the types of data analysis people do.

I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how these ideas are applied in practice, with some Python code snippets demonstrating the difference between the approaches.

D3 Plugins: Truly Interactive Matplotlib In Your Browser

It's been a few weeks since I introduced mpld3, a toolkit for visualizing matplotlib graphics in-browser via d3, and a lot of progress has been made. I've added a lot of features, and there have also been dozens of pull requests from other users which have made the package much more complete.

One thing I recognized early was the potential to use mpld3 to add interactive bells and whistles to matplotlib plots. Yesterday, at the #HackAAS event at the American Astronomical Society meeting, I had a solid chunk of time to work this out. The result is what I'll describe and demonstrate below: a general plugin framework for matplotlib plots, which, given some d3 skills, can be utilized to modify your matplotlib plots in almost any imaginable way.

A D3 Viewer for Matplotlib Visualizations

I've spent a lot of time recently attempting to push the boundaries of tools for interactive data exploration within the IPython notebook. I have worked on animations, including an HTML5 embedding and a Javascript Viewer. I have worked on javascript/python kernel interaction and static javascript widgets. But I would say that the holy grail of interactive data visualization in the IPython notebook is, as I've mentioned previously, a truly interactive in-browser matplotlib display.

There are many people pushing in this direction in the Python world. Bokeh and Plotly are new visualization packages which have built Python APIs from scratch. The demos are beautiful and impressive, and the APIs are clean and intuitive. But, because matplotlib is so well-established in the Python world, it would be nice to be able to continue using it even in the age of browser-based visualization. To this end, some of the matplotlib core devs have been working on a WebGL viewer for matplotlib figures. I've seen a working demo, and it's very cool, but last I heard it still has a long way to go.

I've been wondering for a while whether it might be possible to create a solution using D3.js. D3 (short for data-driven documents) is a framework which facilitates the easy creation and manipulation of groups of HTML objects. Combined with the native SVG support of modern web browsers, it provides an extremely powerful and flexible low-level interface to creating interactive graphics on the web. I've long wondered what it would take to write a D3 backend or frontend for matplotlib, but I'd never experimented with the idea. It was a couple weeks ago at Seattle Beer && Code meetup that I chatted with some expert Javascript hackers who pointed out where I might start.

I started to try things out, and over the course of a few late nights, came up with a first attempt at a partial interactive D3 viewer for matplotlib images: the result is the mpld3 package, available on my GitHub page.

Static Interactive Widgets for IPython Notebooks

The inspiration of my previous kernel density estimation post was a blog post by Michael Lerner, who used my JSAnimation tools to do a nice interactive demo of the relationship between kernel density estimation and histograms.

This was on the heels of Brian Granger's excellent PyData NYC Keynote where he live-demoed the brand new IPython interactive tools. This new functionality is very cool. Wes McKinney remarked that day on Twitter that "IPython's interact machinery is going to be a huge deal". I completely agree: the ability to quickly generate interactive widgets to explore your data is going to change the way a lot of people do their daily scientific computing work.

But there's one issue with the new widget framework as it currently stands: unless you're connected to an IPython kernel (i.e. actually running IPython to view your notebook), the widgets are useless. Don't get me wrong: they're incredibly cool when you're actually interacting with data. But the bread-and-butter of this blog and many others is static notebook views: for this purpose, widgets with callbacks to the Python kernel aren't so helpful.

This is where ipywidgets comes in.

Kernel Density Estimation in Python

Last week Michael Lerner posted a nice explanation of the relationship between histograms and kernel density estimation (KDE). I've made some attempts in this direction before (both in the scikit-learn documentation and in our upcoming textbook), but Michael's use of interactive javascript widgets makes the relationship extremely intuitive. I had been planning to write a similar post on the theory behind KDE and why it's useful, but Michael took care of that part. Instead, I'm going to focus here on comparing the actual implementations of KDE currently available in Python. If you're unsure what kernel density estimation is, read Michael's post and then come back here.

There are several options available for computing kernel density estimates in Python. The question of the optimal KDE implementation for any situation, however, is not entirely straightforward, and depends a lot on what your particular goals are. Here are the four KDE implementations I'm aware of in the SciPy/Scikits stack:

Each has advantages and disadvantages, and each has its area of applicability. I'll start with a table summarizing the strengths and weaknesses of each, before discussing each feature in more detail and running some simple benchmarks to gauge their computational cost:

The Big Data Brain Drain: Why Science is in Trouble

Regardless of what you might think of the ubiquity of the "Big Data" meme, it's clear that the growing size of datasets is changing the way we approach the world around us. This is true in fields from industry to government to media to academia and virtually everywhere in between. Our increasing abilities to gather, process, visualize, and learn from large datasets is helping to push the boundaries of our knowledge.

But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms.

Understanding the FFT Algorithm

The Fast Fourier Transform (FFT) is one of the most important algorithms in signal processing and data analysis. I've used it for years, but having no formal computer science background, It occurred to me this week that I've never thought to ask how the FFT computes the discrete Fourier transform so quickly. I dusted off an old algorithms book and looked into it, and enjoyed reading about the deceptively simple computational trick that JW Cooley and John Tukey outlined in their classic 1965 paper introducing the subject.

The goal of this post is to dive into the Cooley-Tukey FFT algorithm, explaining the symmetries that lead to it, and to show some straightforward Python implementations putting the theory into practice. My hope is that this exploration will give data scientists like myself a more complete picture of what's going on in the background of the algorithms we use.

The Game of Life in Python

In 1970 the British Mathematician John Conway created his "Game of Life" -- a set of rules that mimics the chaotic yet patterned growth of a colony of biological organisms. The "game" takes place on a two-dimensional grid consisting of "living" and "dead" cells, and the rules to step from generation to generation are simple:

  • Overpopulation: if a living cell is surrounded by more than three living cells, it dies.
  • Stasis: if a living cell is surrounded by two or three living cells, it survives.
  • Underpopulation: if a living cell is surrounded by fewer than two living cells, it dies.
  • Reproduction: if a dead cell is surrounded by exactly three cells, it becomes a live cell.

By enforcing these rules in sequential steps, beautiful and unexpected patterns can appear.

I was thinking about classic problems that could be used to demonstrate the effectiveness of Python for computing and visualizing dynamic phenomena, and thought back to a high school course I took where we had an assignment to implement a Game Of Life computation in C++. If only I'd had access to IPython and associated tools back then, my homework assignment would have been a whole lot easier!

Here I'll use Python and NumPy to compute generational steps for the game of life, and use my JSAnimation package to animate the results.

XKCD Plots in Matplotlib: Going the Whole Way

One of the most consistently popular posts on this blog has been my XKCDify post, where I followed in the footsteps of others to write a little hack for xkcd-style plotting in matplotlib. In it, I mentioned the Sketch Path Filter pull request that would eventually supersede my ugly little hack.

Well, "eventually" has finally come. Observe:

In [1]:
%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [2]:
plt.xkcd()  # Yes...
plt.plot(sin(linspace(0, 10)))
plt.title('Whoo Hoo!!!')
Out[2]:
<matplotlib.text.Text at 0x2fade10>

The plt.xkcd() function enables some rcParam settings which can automatically convert any matplotlib plot into XKCD style. You can peruse the matplotlib xkcd gallery here for inspiration, or read on where I'll show off some of my favorite of the possibilities.

Numba vs. Cython: Take 2

Last summer I wrote a post comparing the performance of Numba and Cython for optimizing array-based computation. Since posting, the page has received thousands of hits, and resulted in a number of interesting discussions. But in the meantime, the Numba package has come a long way both in its interface and its performance.

Here I want to revisit those timing comparisons with a more recent Numba release, using the newer and more convenient autojit syntax, and also add in a few additional benchmarks for completeness. I've also written this post entirely within an IPython notebook, so it can be easily downloaded and modified.

As before, I'll use a pairwise distance function. This will take an array representing M points in N dimensions, and return the M x M matrix of pairwise distances. This is a nice test function for a few reasons. First of all, it's a very clean and well-defined test. Second of all, it illustrates the kind of array-based operation that is common in statistics, datamining, and machine learning. Third, it is a function that results in large memory consumption if the standard numpy broadcasting approach is used (it requires a temporary array containing M * M * N elements), making it a good candidate for an alternate approach.