Memoryview Benchmarks

Wed 08 August 2012

There was recently a thread on cython-users which caught my eye. It has to do with memoryviews, a new way of working with memory buffers in cython.

I've been thinking recently about how to do fast and flexible memory buffer access in cython. I contributed the BallTree implementation for nearest neighbors searching in scikit-learn, and have been actively thinking about how to make it faster and more flexible, including adding the ability to specify distance metrics other than euclidean and minkowski.

In order to accomplish this, I'd like to have a set of distance metric functions which take two vectors and compute a distance. There would be many functions with similar call signatures which could then be plugged into a code that would iterate over a set of vectors and compute the appropriate distances.

Pure python version

In pure python, the implementation described above might look something like this:

# memview_bench_v1.py
import numpy as np

def euclidean_distance(x1, x2):
    x1 = np.asarray(x1)
    x2 = np.asarray(x2)
    return np.sqrt(np.sum((x1 - x2) ** 2))

This looks promising. Let's create a function based on this which will compute the pairwise distance between all points in a matrix (this is similar to pairwise_distances in scikit-learn or pdist in scipy). The simple form of the function might look like this:

# memview_bench_v1 (continued)

def pairwise(X, metric=euclidean_distance):
    X = np.asarray(X)

    n_samples, n_dim = X.shape

    D = np.empty((n_samples, n_samples))

    for i in range(n_samples):
        for j in range(n_samples):
            D[i, j] = metric(X[i], X[j])

    return D

We could exploit symmetry to reduce the number of computations required, but we'll skip that step for now: this simple version of the function will give us a good benchmark for comparison with alternatives below. Using the timeit magic in ipython, we can learn how fast this implementation is:

In [1]: import numpy as np

In [2]:from memview_bench_v1 import pairwise

In [3]: X = np.random.random((500, 3))

In [4]: timeit pairwise(X)
1 loops, best of 3: 6.51 s per loop

It takes nearly seven seconds to compute 250,000 distances. This is much too slow.

Cython Speedup

Perhaps we can speed this up using cython declarations. Before typed memoryviews were added in cython 0.16, the way to quickly index numpy arrays in cython was through the numpy specific syntax, adding type information to each array that specifies its data type, its dimension, and its order:

# memview_bench.pyx
import numpy as np

cimport numpy as np
from libc.math cimport sqrt
cimport cython

# define a function pointer to a metric
ctypedef double (*metric_ptr)(np.ndarray, np.ndarray)

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double euclidean_distance(np.ndarray[double, ndim=1, mode='c'] x1,
                               np.ndarray[double, ndim=1, mode='c'] x2):
    cdef double tmp, d
    cdef np.intp_t i, N

    d = 0
    N = x1.shape[0]
    # assume x2 has the same shape as x1.  This could be dangerous!

    for i in range(N):
        tmp = x1[i] - x2[i]
        d += tmp * tmp

    return sqrt(d)


@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise(np.ndarray[double, ndim=2, mode='c'] X not None,
             metric = 'euclidean'):
    cdef metric_ptr dist_func
    if metric == 'euclidean':
        dist_func = &euclidean_distance
    else:
        raise ValueError("unrecognized metric")

    cdef np.intp_t i, j, n_samples
    n_samples = X.shape[0]

    cdef np.ndarray[double, ndim=2, mode='c'] D = np.empty((n_samples,
                                                            n_samples))
    for i in range(n_samples):
        for j in range(n_samples):
            D[i, j] = dist_func(X[i], X[j])

    return D

Notice that we're essentially running the same code, except we have added type identifiers to speed up function calls and loops. The mode='c' argument in the np.ndarray type says that the array is contiguous in memory, and C-ordered.

For reference, this can be compiled in-place by running python setup.py build_ext --inplace with the following setup.py file:

# setup.py

import os
import numpy

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

module = 'memview_bench_v2'

setup(cmdclass = {'build_ext': build_ext},
      name=module,
      version='1.0',
      ext_modules=[Extension(module,
                             [module + ".pyx"])],
      include_dirs=[numpy.get_include(),
                    os.path.join(numpy.get_include(), 'numpy')]
      )

We'll time the resulting function on the same sized array as we did previously:

In [1]: import numpy as np

In [2]: from memview_bench_v2 import pairwise

In [3]: X = np.random.random((500, 3))

In [4]: timeit pairwise(X)
1 loops, best of 3: 668 ms per loop

That's a factor of 10 speedup over the pure python version! It turns out, though, that we can do better. In particular, the slicing operation when we call X[i] and X[j] must generate a new numpy array each time, which leads to a lot of python overhead in reference counting, etc. This is the reason that the cython team introduced typed memoryviews in cython v0.16.

Typed Memoryviews

The equivalent of the above code using typed memoryviews looks like this:

# memview_bench_v3.pyx
import numpy as np

cimport numpy as np
from libc.math cimport sqrt
cimport cython

# define a function pointer to a metric
ctypedef double (*metric_ptr)(double[::1], double[::1])

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double euclidean_distance(double[::1] x1,
                               double[::1] x2):
    cdef double tmp, d
    cdef np.intp_t i, N

    d = 0
    N = x1.shape[0]
    # assume x2 has the same shape as x1.  This could be dangerous!

    for i in range(N):
        tmp = x1[i] - x2[i]
        d += tmp * tmp

    return sqrt(d)


@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise(double[:, ::1] X not None,
             metric = 'euclidean'):
    cdef metric_ptr dist_func
    if metric == 'euclidean':
        dist_func = &euclidean_distance
    else:
        raise ValueError("unrecognized metric")

    cdef np.intp_t i, j, n_samples
    n_samples = X.shape[0]

    cdef double[:, ::1] D = np.empty((n_samples, n_samples))

    for i in range(n_samples):
        for j in range(n_samples):
            D[i, j] = dist_func(X[i], X[j])

    return D

The only change is that instead of using the np.ndarray[...] type specifier, we use the typed memoryview double[:, ::1] specifier. The ::1 in the second position means that we are passing a two-dimensional array, which is contiguous and C-ordered. We time the results and see the following:

In [1]: import numpy as np

In [2]: from memview_bench_v3 import pairwise

In [3]: X = np.random.random((500, 3))

In [4]: timeit pairwise(X)
10 loops, best of 3: 22 ms per loop

This gives another factor of 30 improvement over the previous version, simply by switching to typed memoryviews rather than the numpy interface. Still, our function is creating memoryview objects each time we slice the array. We can determine how much overhead this is generating by using raw C pointers instead. It is not as clean, but it should be very fast:

Raw Pointers

The fundamental benchmark for this sort of operation should be working directly with the pointers themselves. While this is not a very "pythonic" way of doing things, it does lead to very fast code, as we will see:

# memview_bench_v4.pyx
import numpy as np

cimport numpy as np
from libc.math cimport sqrt
cimport cython

# define a function pointer to a metric
ctypedef double (*metric_ptr)(double*, double*, int)

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double euclidean_distance(double* x1,
                               double* x2,
                               int N):
    cdef double tmp, d
    cdef np.intp_t i

    d = 0

    for i in range(N):
        tmp = x1[i] - x2[i]
        d += tmp * tmp

    return sqrt(d)


@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise(double[:, ::1] X not None,
             metric = 'euclidean'):
    cdef metric_ptr dist_func
    if metric == 'euclidean':
        dist_func = &euclidean_distance
    else:
        raise ValueError("unrecognized metric")

    cdef np.intp_t i, j, n_samples, n_dim
    n_samples = X.shape[0]
    n_dim = X.shape[1]

    cdef double[:, ::1] D = np.empty((n_samples, n_samples))

    cdef double* Dptr = &D[0, 0]
    cdef double* Xptr = &X[0, 0]

    for i in range(n_samples):
        for j in range(n_samples):
            Dptr[i * n_samples + j] = dist_func(Xptr + i * n_dim,
                                                Xptr + j * n_dim,
                                                n_dim)

    return D

Instead of passing around slices of arrays, we've accessed the raw memory buffer using C pointer syntax. This is not as easy to read, and can lead to glibc errors or segmentation faults if we're not careful. Testing this implementation, we find that it is extremely fast:

In [1]: import numpy as np

In [2]: from memview_bench_v4 import pairwise

In [3]: X = np.random.random((500, 3))

In [4]: timeit pairwise(X)
100 loops, best of 3: 2.47 ms per loop

This is another factor of 10 faster than the memoryview benchmark above! Essentially, what this is telling us is that creating a memoryview slice takes about 0.02 / 500,000 = 40 nanoseconds on our machine. This is extremely fast, but because we're performing this operation half a million times, the cost of the allocations is significant compared to the rest of our computation. If our vectors were, say, length 1000, this cost may not be a significant percentage of the total cost.

So what are we left with? Do we need to use raw pointers in all circumstances when working with collections of small vectors? Perhaps not.

A Faster Implementation with Memoryviews

The creation of memoryview slices, though extremely fast, is causing a problem simply because we're creating so many slices. Here is an alternative which uses no raw pointers, but matches the speed of raw pointers:

# memview_bench_v5.pyx
import numpy as np

cimport numpy as np
from libc.math cimport sqrt
cimport cython

# define a function pointer to a metric
ctypedef double (*metric_ptr)(double[:, ::1], np.intp_t, np.intp_t)

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double euclidean_distance(double[:, ::1] X,
                               np.intp_t i1, np.intp_t i2):
    cdef double tmp, d
    cdef np.intp_t j

    d = 0

    for j in range(X.shape[1]):
        tmp = X[i1, j] - X[i2, j]
        d += tmp * tmp

    return sqrt(d)


@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise(double[:, ::1] X not None,
             metric = 'euclidean'):
    cdef metric_ptr dist_func
    if metric == 'euclidean':
        dist_func = &euclidean_distance
    else:
        raise ValueError("unrecognized metric")

    cdef np.intp_t i, j, n_samples, n_dim
    n_samples = X.shape[0]
    n_dim = X.shape[1]

    cdef double[:, ::1] D = np.empty((n_samples, n_samples))

    for i in range(n_samples):
        for j in range(n_samples):
            D[i, j] = dist_func(X, i, j)

    return D

Timing this implementation we find the following:

In [1]: import numpy as np

In [2]: from memview_bench_v5 import pairwise

In [3]: X = np.random.random((500, 3))

In [4]: timeit pairwise(X)
100 loops, best of 3: 2.45 ms per loop

Just as fast as using raw pointers, but much cleaner and easier to read.

Summary

Here are the timing results we've seen above:

Python + numpy: 6510 ms
Cython + numpy: 668 ms
Cython + memviews (slicing): 22 ms
Cython + raw pointers: 2.47 ms
Cython + memviews (no slicing): 2.45 ms

So what have we learned here? First of all, typed memoryviews are fast. Blazing fast. If used correctly, they can be comparable to raw pointers, but are much cleaner easier to debug. For example, in the last version, if we ran into a memory error we could simply turn on bounds-checking and quickly find the source of the problem. Slicing with memoryviews is also fast, but should be used carefully if your operation time on each slice is compararable to the cost of building the slice.

The moral of the story? Use typed memoryviews. It will lead to fast cython code which is cleaner, more readable, and more easily debuggable than any other alternative.

Update: All of the above scripts are now available as an ipython notebook: memview_bench.ipynb. For information on how to view this file, see the IPython page Alternatively, you can view this notebook (but not modify it) using the nbviewer here.

Thanks to Dave for the tip.

benchmarks numpy cython