Frequentism and Bayesianism II: When Results Differ

In a previous post I gave a brief practical introduction to frequentism and Bayesianism as they relate to the analysis of scientific data. In it, I discussed the fundamental philosophical difference between frequentism and Bayesianism, and showed several simple problems where the two approaches give basically the same results.

While it is easy to show that the two approaches are often equivalent for simple problems, it is also true that they can diverge greatly for more complicated problems. I've found that in practice, this divergence makes itself most clear in two different situations:

  1. The handling of nuisance parameters
  2. The subtle (and often overlooked) difference between frequentist confidence intervals and Bayesian credible intervals

The second point is a bit more philosophical and in-depth, and I'm going to save it for a later post and focus here on the first point: the difference between frequentist and Bayesian treatment of nuisance parameters. Though I tried my best to stay impartial in the previous post, here you'll start to see my leanings toward the Bayesian approach. Consider this a warmup for when I get around to addressing point number 2: that will likely get downright polemical.

Why Python is Slow: Looking Under the Hood

An Introduction to the Python Buffer Protocol

This is a bit of a niche topic, but I figured there might be one or two people out there who would find this useful (including my future self)... today I managed to implement a simple Python object which exposes the buffer protocol. If that means nothing to you, you may want to stop reading and instead browse this gallery of puppy gifs.

But if you're the kind of person who becomes mildly excited at the words Python buffer protocol, I hope this short post will help you in your quest...

Frequentism and Bayesianism: A Practical Introduction

One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions between them and the different practical approaches that result. The purpose of this post is to synthesize the philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself might be better prepared to understand the types of data analysis people do.

I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how these ideas are applied in practice, with some Python code snippets demonstrating the difference between the approaches.

D3 Plugins: Truly Interactive Matplotlib In Your Browser

It's been a few weeks since I introduced mpld3, a toolkit for visualizing matplotlib graphics in-browser via d3, and a lot of progress has been made. I've added a lot of features, and there have also been dozens of pull requests from other users which have made the package much more complete.

One thing I recognized early was the potential to use mpld3 to add interactive bells and whistles to matplotlib plots. Yesterday, at the #HackAAS event at the American Astronomical Society meeting, I had a solid chunk of time to work this out. The result is what I'll describe and demonstrate below: a general plugin framework for matplotlib plots, which, given some d3 skills, can be utilized to modify your matplotlib plots in almost any imaginable way.

A D3 Viewer for Matplotlib Visualizations

I've spent a lot of time recently attempting to push the boundaries of tools for interactive data exploration within the IPython notebook. I have worked on animations, including an HTML5 embedding and a Javascript Viewer. I have worked on javascript/python kernel interaction and static javascript widgets. But I would say that the holy grail of interactive data visualization in the IPython notebook is, as I've mentioned previously, a truly interactive in-browser matplotlib display.

There are many people pushing in this direction in the Python world. Bokeh and Plotly are new visualization packages which have built Python APIs from scratch. The demos are beautiful and impressive, and the APIs are clean and intuitive. But, because matplotlib is so well-established in the Python world, it would be nice to be able to continue using it even in the age of browser-based visualization. To this end, some of the matplotlib core devs have been working on a WebGL viewer for matplotlib figures. I've seen a working demo, and it's very cool, but last I heard it still has a long way to go.

I've been wondering for a while whether it might be possible to create a solution using D3.js. D3 (short for data-driven documents) is a framework which facilitates the easy creation and manipulation of groups of HTML objects. Combined with the native SVG support of modern web browsers, it provides an extremely powerful and flexible low-level interface to creating interactive graphics on the web. I've long wondered what it would take to write a D3 backend or frontend for matplotlib, but I'd never experimented with the idea. It was a couple weeks ago at Seattle Beer && Code meetup that I chatted with some expert Javascript hackers who pointed out where I might start.

I started to try things out, and over the course of a few late nights, came up with a first attempt at a partial interactive D3 viewer for matplotlib images: the result is the mpld3 package, available on my GitHub page.

Static Interactive Widgets for IPython Notebooks

The inspiration of my previous kernel density estimation post was a blog post by Michael Lerner, who used my JSAnimation tools to do a nice interactive demo of the relationship between kernel density estimation and histograms.

This was on the heels of Brian Granger's excellent PyData NYC Keynote where he live-demoed the brand new IPython interactive tools. This new functionality is very cool. Wes McKinney remarked that day on Twitter that "IPython's interact machinery is going to be a huge deal". I completely agree: the ability to quickly generate interactive widgets to explore your data is going to change the way a lot of people do their daily scientific computing work.

But there's one issue with the new widget framework as it currently stands: unless you're connected to an IPython kernel (i.e. actually running IPython to view your notebook), the widgets are useless. Don't get me wrong: they're incredibly cool when you're actually interacting with data. But the bread-and-butter of this blog and many others is static notebook views: for this purpose, widgets with callbacks to the Python kernel aren't so helpful.

This is where ipywidgets comes in.

Kernel Density Estimation in Python

Last week Michael Lerner posted a nice explanation of the relationship between histograms and kernel density estimation (KDE). I've made some attempts in this direction before (both in the scikit-learn documentation and in our upcoming textbook), but Michael's use of interactive javascript widgets makes the relationship extremely intuitive. I had been planning to write a similar post on the theory behind KDE and why it's useful, but Michael took care of that part. Instead, I'm going to focus here on comparing the actual implementations of KDE currently available in Python. If you're unsure what kernel density estimation is, read Michael's post and then come back here.

There are several options available for computing kernel density estimates in Python. The question of the optimal KDE implementation for any situation, however, is not entirely straightforward, and depends a lot on what your particular goals are. Here are the four KDE implementations I'm aware of in the SciPy/Scikits stack:

Each has advantages and disadvantages, and each has its area of applicability. I'll start with a table summarizing the strengths and weaknesses of each, before discussing each feature in more detail and running some simple benchmarks to gauge their computational cost:

The Big Data Brain Drain: Why Science is in Trouble

Regardless of what you might think of the ubiquity of the "Big Data" meme, it's clear that the growing size of datasets is changing the way we approach the world around us. This is true in fields from industry to government to media to academia and virtually everywhere in between. Our increasing abilities to gather, process, visualize, and learn from large datasets is helping to push the boundaries of our knowledge.

But where scientific research is concerned, this recently accelerated shift to data-centric science has a dark side, which boils down to this: the skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. While academia, with typical inertia, gradually shifts to accommodate this, the rest of the world has already begun to embrace and reward these skills to a much greater degree. The unfortunate result is that some of the most promising upcoming researchers are finding no place for themselves in the academic community, while the for-profit world of industry stands by with deep pockets and open arms.

Understanding the FFT Algorithm

The Fast Fourier Transform (FFT) is one of the most important algorithms in signal processing and data analysis. I've used it for years, but having no formal computer science background, It occurred to me this week that I've never thought to ask how the FFT computes the discrete Fourier transform so quickly. I dusted off an old algorithms book and looked into it, and enjoyed reading about the deceptively simple computational trick that JW Cooley and John Tukey outlined in their classic 1965 paper introducing the subject.

The goal of this post is to dive into the Cooley-Tukey FFT algorithm, explaining the symmetries that lead to it, and to show some straightforward Python implementations putting the theory into practice. My hope is that this exploration will give data scientists like myself a more complete picture of what's going on in the background of the algorithms we use.