Pythonic Perambulations

Musings and ramblings through the world of Python and beyond

Learning Seattle's Work Habits from Bicycle Counts (Updated!)

Update, July 25, 2015: I included some new plots suggested by my colleague Ariel Rokem. Scroll to the end!

Last year I wrote a blog post examining trends in Seattle bicycling and how they relate to weather, daylight, day of the week, and other factors.

Here I want to revisit the same data from a different perspective: rather than making assumptions in order to build models that might describe the data, I'll instead wipe the slate clean and ask what information we can extract from the data themselves, without reliance on any model assumptions. In other words, where the previous post examined the data using a supervised machine learning approach for data modeling, this post will examine the data using an unsupervised learning approach for data exploration.

Along the way, we'll see some examples of importing, transforming, visualizing, and analyzing data in the Python language, using mostly Pandas, Matplotlib, and Scikit-learn. We will also see some real-world examples of the use of unsupervised machine learning algorithms, such as Principal Component Analysis and Gaussian Mixture Models, in exploring and extracting meaning from data.

To spoil the punchline (and perhaps whet your appetite) what we will find is that from analysis of bicycle counts alone, we can make some definite statements about the aggregate work habits of Seattleites who commute by bicycle.

The Data

The data we will use here are the hourly bicycle counts on Seattle's Fremont Bridge. These data come from an automated bicycle counter, installed in late 2012, which has inductive sensors under the sidewalks on either side of the bridge. The daily or hourly bicycle counts can be downloaded from http://data.seattle.gov/; here is the direct link to the hourly dataset. To download the data directly, you can uncomment the following curl command:

In [1]:
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

Once this is downloaded, we can use the pandas library to load the data:

In [2]:
import pandas as pd
data = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
data.head()
Out[2]:
Fremont Bridge West Sidewalk Fremont Bridge East Sidewalk
Date
2012-10-03 00:00:00 4 9
2012-10-03 01:00:00 4 6
2012-10-03 02:00:00 1 1
2012-10-03 03:00:00 2 3
2012-10-03 04:00:00 6 1

We'll do some quick data cleaning: we'll rename the columns to the shorter "West" and "East", set any missing values to zero, and add a "Total" column:

In [3]:
data.columns = ['West', 'East']
data.fillna(0, inplace=True)
data['Total'] = data.eval('East + West')

We can get a better idea of the dataset as a whole through a simple visualization; for example, we can resample the data to see the weekly trend in trips over the nearly three-year period:

In [4]:
# first some standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # plot styling
import numpy as np

data.resample('W', how='sum').plot()
plt.ylabel('weekly trips');

The counts show both a strong seasonal variation, as well as a local structure that can be partially accounted for by temperature, time of year, precipitation, and other factors.

Extracting Knowledge from the Data

From here, we could do a variety of other visualizations based on our intuition about what might affect bicycle counts. For example, we could look at the effect of the days of the week, the effect of the weather, and other factors that I explored previously. But we could also proceed by letting the dataset speak for itself, and use unsupervised machine learning techniques (that is, machine learning without reference to data labels) to learn what the data have to tell us.

We will consider each day in the dataset as its own separate entity (or sample, in usual machine learning parlance). For each day, we have 48 observations: two observations (east and west sidewalk sensors) for each of the 24 hour-long periods. By examining the days in light of these observations and doing some careful analysis, we should be able to extract meaningful quantitative statements from the data themselves, without the need to lean on any other assumptions.

Transforming the Data

The first step in this approach is to transform our data; essentially we will want a two-dimensional matrix, where each row of the matrix corresponds to a day, and each column of the matrix corresponds to one of the 48 observations. We can arrange the data this way using the pivot_table() function in Pandas. We want the "East" and "West" column values, indexed by date, and separated by hour of the day. Any missing values we will fill with zero:

In [5]:
pivoted = data.pivot_table(['East', 'West'],
                           index=data.index.date,
                           columns=data.index.hour,
                           fill_value=0)
pivoted.head()
Out[5]:
East ... West
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
2012-10-03 9 6 1 3 1 10 50 95 146 104 ... 77 72 133 192 122 59 29 25 24 5
2012-10-04 11 0 6 3 1 11 51 89 134 94 ... 63 73 114 154 137 57 27 31 25 11
2012-10-05 7 4 3 2 2 7 37 101 119 81 ... 63 80 120 144 107 42 27 11 10 16
2012-10-06 7 5 2 2 1 2 15 16 47 55 ... 89 115 107 107 41 40 25 18 14 15
2012-10-07 5 5 1 2 2 3 8 12 26 36 ... 126 122 132 118 68 26 19 12 9 5

5 rows × 48 columns

Next we extract the raw values and put them in a matrix:

In [6]:
X = pivoted.values
X.shape
Out[6]:
(1001, 48)

Our data consists of just over 1000 days, each with the aforementioned 48 measurements.

Visualizing the Data

We can think of this data now as representing 1001 distinct objects which live in a 48-dimensional space: the value of each dimension is the number of bicycle trips measured on a particular side of the bridge at a particular hour. Visualizing 48-dimensional data is quite difficult, so instead we will use a standard dimensionality reduction technique to project this to a more manageable size.

The technique we'll use is Principal Component Analysis (PCA), a fast linear projection which rotates the data such that the projection preserves the maximum variance. We can ask for components preserving 90% of the variance as follows:

In [7]:
from sklearn.decomposition import PCA
Xpca = PCA(0.9).fit_transform(X)
Xpca.shape
Out[7]:
(1001, 2)

The output has two dimensions, which means that these two projected components describe at least 90% of the total variance in the dataset. While 48-dimensional data is difficult to plot, we certainly know how to plot two-dimensional data: we'll do a simple scatter plot, and for reference we'll color each point according to the total number of trips taken that day:

In [8]:
total_trips = X.sum(1)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=total_trips,
            cmap='cubehelix')
plt.colorbar(label='total trips');

We see that the days lie in two quite distinct groups, and that the total number of trips increases along the length of each projected cluster. Further, the two groups begin to be less distinguishable when the number of trips during the day is very small.

I find this extremely interesting: from the raw data, we can determine that there are basically two primary types of days for Seattle bicyclists. Let's model these clusters and try to figure out what these types-of-day are.

Unsupervised Clustering

When you have groups of data you'd like to automatically separate, but no previously-determined labels for the groups, the type of algorithm you are looking at is a clustering algorithm. There are a number of clustering algorithms out there, but for nicely-defined oval-shaped blobs like we see above, Gaussian Mixture Models are a very good choice. We can compute the Gaussian Mixture Model of the data using, again, scikit-learn, and quickly plot the predicted labels for the points:

In [9]:
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
gmm.fit(Xpca)
cluster_label = gmm.predict(Xpca)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=cluster_label);

This clustering seems to have done the job, and separated the two groups we are interested in. Let's join these inferred cluster labels to the initial dataset:

In [10]:
pivoted['Cluster'] = cluster_label
data = data.join(pivoted['Cluster'], on=data.index.date)
data.head()
Out[10]:
West East Total Cluster
Date
2012-10-03 00:00:00 4 9 13 0
2012-10-03 01:00:00 4 6 10 0
2012-10-03 02:00:00 1 1 2 0
2012-10-03 03:00:00 2 3 5 0
2012-10-03 04:00:00 6 1 7 0

Now we can find the average trend by cluster and time using a GroupBy within this updated dataset

In [11]:
by_hour = data.groupby(['Cluster', data.index.time]).mean()
by_hour.head()
Out[11]:
West East Total
Cluster
0 00:00:00 5.312139 6.213873 11.526012
01:00:00 2.713873 2.969653 5.683526
02:00:00 2.294798 1.732659 4.027457
03:00:00 1.570809 1.426301 2.997110
04:00:00 4.179191 2.650289 6.829480

Finally, we can plot the average hourly trend among the days within each cluster:

In [12]:
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
hourly_ticks = 4 * 60 * 60 * np.arange(6)

for i in range(2):
    by_hour.ix[i].plot(ax=ax[i], xticks=hourly_ticks)
    ax[i].set_title('Cluster {0}'.format(i))
    ax[i].set_ylabel('average hourly trips')

These plots give us some insight into the interpretation of the two clusters: the first cluster shows a sharp bimodal traffic pattern, while the second shows a wide unimodal pattern.

In the bimodal cluster, we see a peak around 8:00am which is dominated by cyclists on the west sidewalk, and another peak around 5:00pm which is dominated by cyclists on the east sidewalk. This is very clearly a commute pattern, with the majority of cyclists riding toward downtown Seattle in the morning, and away from downtown Seattle in the evening.

In the unimodal cluster, we see fairly steady traffic in each direction beginning early in the morning and going until late at night, with a peak around 2:00 in the afternoon. This is very clearly a recreational pattern of use, with people out riding through the entire day.

I find this is fascinating: from simple unsupervised dimensionality reduction and clustering, we've discovered two distinct classes of days in the data, and found that these classes have very intuitive explanations.

Seattle's Work Habits

Let's go one step deeper and figure out what we can learn about people (well, bicycle commuters) in Seattle from just this hourly commute data. As a rough approximation, you might guess that these two classes of data might be largely reflective of workdays in the first cluster, and non-work days in the second. We can check this intuition by re-plotting our projected data, except labeling them by day of the week:

In [13]:
dayofweek = pd.to_datetime(pivoted.index).dayofweek
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=dayofweek,
            cmap=plt.cm.get_cmap('jet', 7))
cb = plt.colorbar(ticks=range(7))
cb.set_ticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.clim(-0.5, 6.5);

We see that the weekday/weekend intuition holds, but only to a point: in particular, it is clear that there are a handful of weekdays which follow the typical weekend pattern! Further, it's interesting to note that Fridays tend to be pulled closer to weekend days in this plot, though as a whole they still fall solidly in the work-day cluster.

Let's take a closer look at the "special" weekdays that fall in the "wrong" cluster. We start by constructing a dataset listing the cluster id and the day of the week for each of the dates in our dataset:

In [14]:
results = pd.DataFrame({'cluster': cluster_label,
                        'is_weekend': (dayofweek > 4),
                        'weekday': pivoted.index.map(lambda x: x.strftime('%a'))},
                       index=pivoted.index)
results.head()
Out[14]:
cluster is_weekend weekday
2012-10-03 0 False Wed
2012-10-04 0 False Thu
2012-10-05 0 False Fri
2012-10-06 1 True Sat
2012-10-07 1 True Sun

First, let's see how many weekend days fall in the first, commute-oriented cluster

In [15]:
weekend_workdays = results.query('cluster == 0 and is_weekend')
len(weekend_workdays)
Out[15]:
0

zero! Apparently, there is not a single weekend during the year where Seattle cyclists as a whole decide to go to work.

Similarly, we can see how many weekdays fall in the second, recreation-oriented cluster:

In [16]:
midweek_holidays = results.query('cluster == 1 and not is_weekend')
len(midweek_holidays)
Out[16]:
23

There were 23 weekdays over the past several years in which Seattle cyclists as a whole did not go to work. To label these, let's load the US Federal holiday calendar available in Pandas:

In [17]:
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016', return_name=True)
holidays.head()
Out[17]:
2012-01-02                 New Years Day
2012-01-16    Dr. Martin Luther King Jr.
2012-02-20                Presidents Day
2012-05-28                   MemorialDay
2012-07-04                      July 4th
dtype: object

Just for completeness, we will add to the list the day before and day after each of these holidays:

In [18]:
holidays_all = pd.concat([holidays,
                         "Day Before " + holidays.shift(-1, 'D'),
                         "Day After " + holidays.shift(1, 'D')])
holidays_all = holidays_all.sort_index()
holidays_all.head()
Out[18]:
2012-01-01                 Day Before New Years Day
2012-01-02                            New Years Day
2012-01-03                  Day After New Years Day
2012-01-15    Day Before Dr. Martin Luther King Jr.
2012-01-16               Dr. Martin Luther King Jr.
dtype: object

Note that these are observed holidays, which is why New Years Day 2012 falls on January 2nd. With this ready to go, we can compute the complete list of non-weekend days on which Seattle bicycle commuters as a whole chose to stay home from work:

In [19]:
holidays_all.name = 'name'  # required for join
joined = midweek_holidays.join(holidays_all)
set(joined['name'])
Out[19]:
{'Christmas',
 'Day After Christmas',
 'Day After Thanksgiving',
 'Day Before Christmas',
 'July 4th',
 'Labor Day',
 'MemorialDay',
 'New Years Day',
 'Thanksgiving'}

On the other side of things, here are the Federally recognized holidays where Seattle bicycle commuters chose to go to work anyway:

In [20]:
set(holidays) - set(joined.name)
Out[20]:
{'Columbus Day',
 'Dr. Martin Luther King Jr.',
 'Presidents Day',
 'Veterans Day'}

Update: What's up with Fridays?

A colleague of mine, Ariel Rokem, saw the first version of this post and noticed something interesting. For the most part, Fridays tend to lie on the upper side of the weekday cluster, closer in this parameter space to the typical weekend pattern. This pattern holds nearly universally for Fridays, all except for three strange outliers which lie far on the other side of the cluster.

We can see these more clearly if we highlight the Friday points in the plot:

In [21]:
fridays = (dayofweek == 4)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c='gray', alpha=0.2)
plt.scatter(Xpca[fridays, 0], Xpca[fridays, 1], c='yellow');

The yellow points in the bottom-left of the plot are unique – they're far different than other Fridays, and they even stand-out in comparison to the other work days! Let's see what they represent:

In [22]:
weird_fridays = pivoted.index[fridays & (Xpca[:, 0] < -600)]
weird_fridays
Out[22]:
Index([2013-05-17, 2014-05-16, 2015-05-15], dtype='object')

All three of these outlying Fridays fall in the middle of May. Curious!

Let's quickly visualize the daily stats for these, along with the mean trend over all days. We can arrange the data this way with a pivot table operation:

In [23]:
all_days = data.pivot_table('Total', index=data.index.time, columns=data.index.date)
all_days.loc[:, weird_fridays].plot();
all_days.mean(1).plot(color='gray', lw=5, alpha=0.3,
                      xticks=hourly_ticks);

Apparently these three strange Fridays are days with extreme amounts of bicycle commuting. But what makes them so special?

After some poking-around on the internet, the answer becomes clear: we've discovered Seattle's annual bike to work day. Mystery solved!

Summary

We have seen here that by taking a close look at raw bicycle counts and using some basic visualization and unsupervised machine learning, we can make some very definite statements about the overall work habits of people in Seattle who bicycle to work across the Fremont bridge. In summary, this is what we have learned:

  • Seattle cyclists, as a whole, tend to take off work New Year's Day, Memorial Day, Independence Day, Labor Day, and the days surrounding Thanksgiving and Christmas.
  • Seattle cyclists, as a whole, tend to head to the office on the more minor US holidays: Columbus Day, Martin Luther King Jr. Day, Presidents Day, and Veterans Day.
  • Seattle cyclists, as a whole, would never, ever, be caught at work on a weekend.

Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

Comments