In order to get data for “a day in the life of Metrorail,” I needed a way to filter the GTFS (Google Transit Feed Specification) data for a single day.
The large size of WMATA’s GTFS data can make it hard to work with. The largest file is stop_times.txt, 1,451,295 lines, one for each scheduled trip (and the header). My project to filter only Metrorail data reduced that to 80,186 lines (see Filtering Metrorail Data from WMATA GTFS). The size could be reduced further by showing trips for only a single day.
The calendar information is linked to from the file trips.txt, using the service_id field. The GTFS specification recommends keeping the service schedule information in the calendar.txt file, but for some reason WMATA is using the alternate method of using only calendar_dates.txt, which is normally used to show exceptions. This file’s data looks like this:
1,20120804,1 1,20120811,1 1,20120818,1 2,20120226,1 2,20120304,1
The columns are service_id, date, and exception_type. In Metro’s data, exception_type is always 1, meaning “service has been added.” The date range covers the next six months; I assume Metro updates this file daily or weekly. The service_id ranges from 1 to 6. To find trips for a certain date, you have to look to see which services are listed for that date. For example, 20120227 (Monday, February 27, 2012) is listed twice:
Thus, I need to filter trips.txt for where service_id is 3 or 5 to see all the trips for that day (thereby excluding trips that don’t run on that day).
But what do the numbers mean? I counted how often each ID is used in 3 files (railtrips is my filtered version of trips, showing only Metrorail trips).
|O c c u r r e n c e s|
I had to look at the pattern of dates to see what the codes are used for. Here’s what they seem to mean:
1: Saturdays (through 8/18)
2: Sundays (through 8/19)
3: Weekdays (M T W T F) (through 8/21)
4: Fridays (through 8/17)
5: Mondays – Thursdays (through 8/21)
6: Weekdays (M T W T F) (6/26 through 8/21)
Those 11 trips with service_ID 6 that are scheduled to begin June 26 are on the Red Line, from Glenmont to Grosvenor. Oddly there are no additional return trips, but I assume the GTFS data will change once again in the coming months.
I used a PHP program to further filter my Metrorail GTFS data, creating a new version of stop_times.txt that contains trip data for a “typical Monday”, that is, one between now and June 25, 2012:
- railmondaystop_times.txt 1.5MB (railstop_times.txt was 3.5MB; stop_times.txt was 62.4MB)
This gives me a data set that can be used to model a typical day in the life of Metro in the Washington, DC region. The chart below is an example of data culled from this set, showing weekday (Monday – Thursday, technically) arrivals at L’Enfant Plaza, by hour.