# Optimizing Python in the Real World: NumPy, Numba, and the NUFFT

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

## Why a Python Implementation?¶

First, I want to answer the inevitable question: why spend the time to make a Python implementation of an algorithm that's already out there in Fortran? The reason is that I've found in my research and teaching that pure-Python implementations of algorithms are far more valuable than C or Fortran implementations, even if they might be a bit slower. This is for a number of reasons:

• Pure-Python code is easier to read, understand, and contribute to. Good Python implementations are much higher-level than C or Fortran, and abstract-away loop indices, bit twiddling, workspace arrays, and other sources of code clutter. A typical student reading good Python code can immediately understand and modify the algorithm, while the same student would be lost trying to understand typical optimized Fortran code.

• Pure-python packages are much easier to install than Python-wrapped C or Fortran code. This is especially true on non-Linux systems. Fortran in particular can require some installation prerequisites that are non-trivial for many users. In practice, I've seen people give up on better tools when there is an installation barrier for those tools.

• Pure-python code often works for many data types. Because of the way it is written, pure Python code is often automatically applicable to single or double precision, and perhaps even to extensions to complex numbers. For compiled packages, supporting and compiling for all possible types can be a burden.

• Pure-python is easier to use at scale. Because it does not require complicated installation, pure Python packages can be much easier to install on cloud VMs and/or shared clusters for computation at scale. If you can easily pip-install a pure-Python package on a VM, then services like AWS and TravisCI are much easier to set up.

Certainly code speed will overcome these considerations if the performance gap is great enough, but I've found that for many applications a pure Python package, cleverly designed and optimized, can be made fast enough that these larger considerations win-out. The challenge is making the Python fast. We'll explore this below.

## Background: The Non-Uniform Fast Fourier Transform¶

The Fast Fourier Transform (FFT) is perhaps the most important and fundamental of modern numerical algorithms. It provides a fast, $O[N\log N]$ method of computing the discrete Fourier transform: $$Y_k^\pm = \sum_{n=0}^{N-1} y_n e^{\pm i k n / N}$$ You can read more about the FFT in my previous post on the subject.

One important limitation of the FFT is that it requires that input data be evenly-spaced: that is, we can think of the values $y_n$ as samples of a function $y_n = y(x_n)$ where $x_n = x_0 + n\Delta x$ is a regular grid of points. But what about when your grid is not uniform? That is, what if you want to compute this result: $$Y_k^\pm = \sum_{j=1}^N y(x_j) e^{\pm i k x_j}$$ where $y(x)$ is evaluated at an arbitrary set of points $x_j$? In this case, the FFT is no longer directly applicable, and you're stuck using a much slower $O[N^2]$ direct summation.

Stuck, that is, until the NUFFT came along.

The NUFFT is an algorithm which converts the non-uniform transform into an approximate uniform transform, not with error-prone interpolation, but instead using a clever "gridding" operation motivated by the convolution theorem. If you'd like to read about the algorithm in detail, the Courant Institute's NUFFT page has a nice set of resources.

Below we'll take a look at implementing this algorithm in Python.

## Direct Non-Uniform Fourier Transform¶

When developing optimized code, it is important to start with something easy to make sure you're on the right track. Here we'll start with a straightforward direct version of the non-uniform Fourier transform. We'll allow non-uniform inputs $x_j$, but compute the output on a grid of $M$ evenly-spaced frequencies in the range $-M/2 \le f/\delta f < M/2$. This is what the NUFFT group calls the Type-1 NUFFT.

First we'll implement nufftfreqs(), which returns the frequency grid for a given $M$, and nudft() which computes the non-uniform discrete Fourier transform using a slow direct method. The arguments for the latter include iflag, which is a positive or negative number indicating the desired sign of the exponent:

In :
from __future__ import print_function, division
import numpy as np

def nufftfreqs(M, df=1):
"""Compute the frequency range used in nufft for M frequency bins"""
return df * np.arange(-(M // 2), M - (M // 2))

def nudft(x, y, M, df=1.0, iflag=1):
"""Non-Uniform Direct Fourier Transform"""
sign = -1 if iflag < 0 else 1
return (1 / len(x)) * np.dot(y, np.exp(sign * 1j * nufftfreqs(M, df) * x[:, np.newaxis]))


Again, I can't emphasize this enough: when writing fast code, start with a slow-and-simple version of the code which you know gives the correct result, and then optimize from there.

## Comparing to the Fortran NUFFT¶

We can double-check that this is producing the desired result by comparing to the Fortran NUFFT implementation, using Python wrappers written by Dan Foreman-Mackey, available at http://github.com/dfm/python-nufft/:

In :
# Install nufft from http://github.com/dfm/python-nufft/
from nufft import nufft1 as nufft_fortran

x = 100 * np.random.random(1000)
y = np.sin(x)

Y1 = nudft(x, y, 1000)
Y2 = nufft_fortran(x, y, 1000)

np.allclose(Y1, Y2)

Out:
True

The results match! A quick check shows that, as we might expect, the Fortran algorithm is orders of magnitude faster:

In :
%timeit nudft(x, y, 1000)
%timeit nufft_fortran(x, y, 1000)

10 loops, best of 3: 47.9 ms per loop
1000 loops, best of 3: 265 µs per loop


On top of this, for $N$ points and $N$ frequencies, the Fortran NUFFT will scale as $O[N\log N]$, while our simple implementation will scale as $O[N^2]$, making the difference even greater as $N$ increases! Let's see if we can do better.

## NUFFT with Python¶

Here we'll attempt a pure-Python version of the fast, FFT-based NUFFT. We'll follow the basics of the algorithm presented on the NUFFT page, using NumPy broadcasting tricks to push loops into the compiled layer of NumPy. For later convenience, we'll start by defining a utility to compute the grid parameters as detailed in the NUFFT paper.

In :
def _compute_grid_params(M, eps):
# Choose Msp & tau from eps following Dutt & Rokhlin (1993)
if eps <= 1E-33 or eps >= 1E-1:
raise ValueError("eps = {0:.0e}; must satisfy "
"1e-33 < eps < 1e-1.".format(eps))
ratio = 2 if eps > 1E-11 else 3
Msp = int(-np.log(eps) / (np.pi * (ratio - 1) / (ratio - 0.5)) + 0.5)
Mr = max(ratio * M, 2 * Msp)
lambda_ = Msp / (ratio * (ratio - 0.5))
tau = np.pi * lambda_ / M ** 2
return Msp, Mr, tau

def nufft_python(x, c, M, df=1.0, eps=1E-15, iflag=1):
"""Fast Non-Uniform Fourier Transform with Python"""
Msp, Mr, tau = _compute_grid_params(M, eps)
N = len(x)

# Construct the convolved grid
ftau = np.zeros(Mr, dtype=c.dtype)
Mr = ftau.shape
hx = 2 * np.pi / Mr
mm = np.arange(-Msp, Msp)
for i in range(N):
xi = (x[i] * df) % (2 * np.pi)
m = 1 + int(xi // hx)
spread = np.exp(-0.25 * (xi - hx * (m + mm)) ** 2 / tau)
ftau[(m + mm) % Mr] += c[i] * spread

# Compute the FFT on the convolved grid
if iflag < 0:
Ftau = (1 / Mr) * np.fft.fft(ftau)
else:
Ftau = np.fft.ifft(ftau)
Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

# Deconvolve the grid using convolution theorem
k = nufftfreqs(M)
return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau


Let's compare this to the previous results. For convenience, we'll define a single routine which validates the results and times the execution:

In :
from time import time

def test_nufft(nufft_func, M=1000, Mtime=100000):
# Test vs the direct method
print(30 * '-')
name = {'nufft1':'nufft_fortran'}.get(nufft_func.__name__,
nufft_func.__name__)
print("testing {0}".format(name))
rng = np.random.RandomState(0)
x = 100 * rng.rand(M + 1)
y = np.sin(x)
for df in [1, 2.0]:
for iflag in [1, -1]:
F1 = nudft(x, y, M, df=df, iflag=iflag)
F2 = nufft_func(x, y, M, df=df, iflag=iflag)
assert np.allclose(F1, F2)
print("- Results match the DFT")

# Time the nufft function
x = 100 * rng.rand(Mtime)
y = np.sin(x)
times = []
for i in range(5):
t0 = time()
F = nufft_func(x, y, Mtime)
t1 = time()
times.append(t1 - t0)
print("- Execution time (M={0}): {1:.2g} sec".format(Mtime, np.median(times)))

In :
test_nufft(nufft_python)
test_nufft(nufft_fortran)

------------------------------
testing nufft_python
- Results match the DFT
- Execution time (M=100000): 1.6 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.043 sec


The good news is that our Python implementation works; the bad news is that it remains several orders of magnitude slower than the Fortran result!

Let's make it faster.

### Making Code Faster: Line Profiling¶

We know that our Python function is slow, but we'd like to determine where this speed bottleneck lies. One convenient way to do this is with the line_profiler utility, a Python/IPython addon which can be installed using

## Conclusion¶

I hope you've enjoyed this exploration of how to write fast numerical code in pure Python. As you think about writing efficient implementations of useful algorithms, I invite you to consider the points I listed above: in particular, how difficult will it be for your users to install, read, modify, and contribute to your code? In the long run, this may be much more important than shaving a few milliseconds off the execution time. Writing a fast implementation of a useful algorithm is an excellent and useful pursuit, but we should be careful to not forget the costs that come along with such optimization.

If you're interested in using the pure-Python NUFFT implementation, I've adapted much of the above code in a repository at http://github.com/jakevdp/nufftpy/. It contains a packaged and unit-tested version of some of the above code, as well as a (hopefully) growing compendium of related routines.

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.