Pythonic Perambulations

Musings and ramblings through the world of Python and beyond

Learning Seattle's Work Habits from Bicycle Counts (Updated!)

Last year I wrote a blog post examining trends in Seattle bicycling and how they relate to weather, daylight, day of the week, and other factors.

Here I want to revisit the same data from a different perspective: rather than making assumptions in order to build models that might describe the data, I'll instead wipe the slate clean and ask what information we can extract from the data themselves, without reliance on any model assumptions. In other words, where the previous post examined the data using a supervised machine learning approach for data modeling, this post will examine the data using an unsupervised learning approach for data exploration.

Along the way, we'll see some examples of importing, transforming, visualizing, and analyzing data in the Python language, using mostly Pandas, Matplotlib, and Scikit-learn. We will also see some real-world examples of the use of unsupervised machine learning algorithms, such as Principal Component Analysis and Gaussian Mixture Models, in exploring and extracting meaning from data.

To spoil the punchline (and perhaps whet your appetite) what we will find is that from analysis of bicycle counts alone, we can make some definite statements about the aggregate work habits of Seattleites who commute by bicycle.

The Model Complexity Myth

An oft-repeated rule of thumb in any sort of statistical model fitting is "you can't fit a model with more parameters than data points". This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints, and this is much more than mere trivia with which you can impress the nerdiest of your friends: as I will show here, this fact can prove to be very useful in real-world scientific applications.

A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. In this post I will consider this misconception, which I call the model complexity myth. I'll start by showing where this model complexity myth holds true, first from from an intuitive point of view, and then from a more mathematically-heavy point of view. I'll build from this mathematical treatment and discuss how underdetermined models may be addressed from a frequentist standpoint, and then from a Bayesian standpoint. (If you're unclear about the general differences between frequentist and Bayesian approaches, I might suggest reading my posts on the subject). Finally, I'll discuss some practical examples of where such an underdetermined model can be useful, and demonstrate one of these examples: quantitatively accounting for measurement biases in scientific data.

Fast Lomb-Scargle Periodograms in Python

Image source: astroML. Source code here

The Lomb-Scargle periodogram (named for Lomb (1976) and Scargle (1982)) is a classic method for finding periodicity in irregularly-sampled data. It is in many ways analogous to the more familiar Fourier Power Spectral Density (PSD) often used for detecting periodicity in regularly-sampled data.

Despite the importance of this method, until recently there have not been any (in my opinion) solid implementations of the algorithm available for easy use in Python. That has changed with the introduction of the gatspy package, which I recently released. In this post, I will compare several available Python implementations of the Lomb-Scargle periodogram, and discuss some of the considerations required when using it to analyze data.

To cut to the chase, I'd recommend using the gatspy package for Lomb-Scargle periodograms in Python, and particularly its gatspy.periodic.LombScargleFast algorithm which implements an efficient pure-Python version of Press & Rybicki's \(O[N\log N]\) periodogram. Below, I'll dive into the reasons for this recommendation.

Optimizing Python in the Real World: NumPy, Numba, and the NUFFT

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

The Hipster Effect: An IPython Interactive Exploration

This week I started seeing references all over the internet to this paper: The Hipster Effect: When Anticonformists All Look The Same. It essentially describes a simple mathematical model which models conformity and non-conformity among a mutually interacting population, and finds some interesting results: namely, conformity among a population of self-conscious non-conformists is similar to a phase transition in a time-delayed thermodynamic system. In other words, with enough hipsters around responding to delayed fashion trends, a plethora of facial hair and fixed gear bikes is a natural result.

Also naturally, upon reading the paper I wanted to try to reproduce the work. The paper solves the problem analytically for a continuous system and shows the precise values of certain phase transitions within the long-term limit of the postulated system. Though such theoretical derivations are useful, I often find it more intuitive to simulate systems like this in a more approximate manner to gain hands-on understanding. By the end of this notebook, we'll be using IPython's incredible interactive widgets to explore how the inputs to this model affect the results.

How Bad Is Your Colormap?

(Or, Why People Hate Jet – and You Should Too)

I made a little code snippet that I find helpful, and you might too:

In [1]:
def grayify_cmap(cmap):
    """Return a grayscale version of the colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    
    # convert RGBA to perceived greyscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
    
    return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)

What this function does is to give you a lumninance-correct grayscale version of any matplotlib colormap. I've found this useful for quickly checking how my plots might appear if printed in black and white, but I think it's probably even more useful for stoking the flame of the internet's general rant against jet.

Hacking Academia: Data Science and the University

A reflection on our SciFoo breakout session, where we discussed issues of data science within academia.

Almost a year ago, I wrote a post I called the Big Data Brain Drain, lamenting the ways that academia is neglecting the skills of modern data-intensive research, and in doing so is driving away many of the men and women who are perhaps best equipped to enable progress in these fields. This seemed to strike a chord with a wide range of people, and has led me to some incredible opportunities for conversation and collaboration on the subject. One of those conversations took place at the recent SciFoo conference, and this article is my way of recording some reflections on that conversation.

SciFoo is an annual gathering of several hundred scientists, writers, and thinkers sponsored by Digital Science, Nature, O'Reilly Media & Google. SciFoo brings together an incredibly eclectic group of people: I met philosophers, futurists, alien hunters, quantum physicists, mammoth cloners, magazine editors, science funders, astrophysicists, musicians, mycologists, mesmerists, and many many more: the list could go on and on. The conference is about as unstructured as it can be: the organizers simply provide food, drink, and a venue for conversation, and attendees put together breakout discussions on nearly any imaginable topic. If you ever get the chance to go, my advice is to drop everything else and attend. It was one of the most quirky and intellectually stimulating weekends I've ever spent.

Frequentism and Bayesianism IV: How to be a Bayesian in Python

I've been spending a lot of time recently writing about frequentism and Bayesianism.

Here I want to back away from the philosophical debate and go back to more practical issues: in particular, demonstrating how you can apply these Bayesian ideas in Python. The workhorse of modern Bayesianism is the Markov Chain Monte Carlo (MCMC), a class of algorithms used to efficiently sample posterior distributions.

Below I'll explore three mature Python packages for performing Bayesian analysis via MCMC:

  • emcee: the MCMC Hammer
  • pymc: Bayesian Statistical Modeling in Python
  • pystan: The Python Interface to Stan

I won't be so much concerned with speed benchmarks between the three, as much as a comparison of their respective APIs. This post is not meant to be a tutorial in any of the three; each of them is well documented and the links above include introductory tutorials for that purpose. Rather, what I want to do here is a side-by-side comparison which will give a feel for how each package is used. I'll propose a single relatively non-trivial test problem, and show the implementation and results of this problem using all three packages. Hopefully by seeing the three approaches side-by-side, you can choose which package might be best suited for your particular application.

Frequentism and Bayesianism III: Confidence, Credibility, and why Frequentism and Science do not Mix

In Douglas Adams' classic Hitchhiker's Guide to the Galaxy, hyper-intelligent pan-dimensional beings build a computer named Deep Thought in order to calculate "the Answer to the Ultimate Question of Life, the Universe, and Everything". After seven and a half million years spinning its hyper-dimensional gears, before an excited crowd, Deep Thought finally outputs the answer:

42

The disappointed technicians, who trained a lifetime for this moment, are stupefied. They probe Deep Though for more information, and after some back-and-forth, the computer responds: "once you do know what the question actually is, you'll know what the answer means."

An answer does you no good if you don't know the question.

I find this story be an apt metaphor for statistics as sometimes used in the scientific literature. When trying to estimate the value of an unknown parameter, the frequentist approach generally relies on a confidence interval (CI), while the Bayesian approach relies on a credible region (CR). While these concepts sound and look very similar, their subtle difference can be extremely important, as they answer essentially different questions.

Like the poor souls hoping for enlightenment in Douglas Adams' universe, scientists often turn the crank of frequentism hoping for useful answers, but in the process overlook the fact that in science, frequentism is generally answering the wrong question. This is far from simple philosophical navel-gazing: as I'll show, it can have real consequences for the conclusions we draw from observed data.