WMATA’s Google Transit Feed Specification (GTFS) data includes Metrobus, Metrorail, and the DC Circulator. If you want to analyze only Metrorail data, the task is daunting, because the sheer volume of bus data overwhelms the files. So, I wanted to create smaller versions of the relevant files just for the rail system.
The GTFS transit data was released in March 2009. The data is available on their Developer Resources page. Once you sign their license agreement you have immediate access to download a 16MB zip file, which contains 9 text files, in the “comma-separated values” (CSV) format.
My goal was to get stop information (time and place) for each trip on Metrorail. The time information is kept in the GTFS file stop_times.txt, while location information is kept in stops.txt. They are linked by the stop_id field. But neither file has data that tells you if the stop is for a bus or a subway train. That information is kept in routes.txt, which doesn’t directly connect to stops.txt or stop_times.txt. Instead, trips.txt links the routes together with the stops. Phew!
In order to filter for only rail data, I needed to know the route_id codes for the five Metro lines. It was easy to browse through routes.txt to find them:
260,2,"R5","",3,"", 261,2,"Blue","Metrorail Blue Line",1,"",0d7bba 262,2,"","Richmond Highway Express Bus",3,"", 263,2,"Green","Metrorail Green Line",1,"",009d57 264,2,"Orange","Metrorail Orange Line",1,"",f89038 265,2,"Red","Metrorail Red Line",1,"",e94333 266,2,"Yellow","Metrorail Yellow Line",1,"",fde310 267,2,"S1","",3,"",
The order of fields is: route_id, agency_id, route_short_name, route_long_name, route_type, route_url, route_color.
It seems odd that the Richmond Highway Express Bus is stuck in the middle. And I was surprised to discover that sometime between September 2011 and today they changed the route IDs, and then by a value of 1.
Using those 5 values, trips.txt could be easily filtered, by keeping only lines where route_id is 261, 263, 264, 265 or 266. I wrote a program in PHP to do my file I/O.
From there, I filtered stop_times.txt to include only lines where trip_id had a match in my revised trips.txt file.
Finally, stops.txt was filtered to include only lines where stop_id had a match in my revised stop_times.txt file. This file was small enough to verify with a quick visual inspection, since it had lines for only the familiar 86 Metro stations, looking like the snippet below:
308,SHAW METRO STATION,,38.914546,-77.021927,69 999,CHEVERLY METRO STATION,,38.916552,-76.915104,5 1305,CAPITOL HEIGHTS METRO STATION,,38.889571,-76.913313,5 1418,U STREET METRO STATION,,38.917015,-77.029169,70 2124,LANDOVER METRO STATION,,38.933994,-76.890005,5
Ah, if only Metro could step into the new millennium and stop using all-uppercase letters.
The stop_times.txt file has its own baffling annoyances. See the sample below. The first 3 fields are trip_id, arrival_time, and departure_time.
24703,06:54:00,06:54:00,4697,1,0,0,0.0000 24703,06:56:54,06:56:54,4664,2,0,0,1.3634 24703,07:00:00,07:00:00,13107,3,0,0,2.7739 24703,07:02:06,07:02:06,1305,4,0,0,3.8714 24703,07:04:42,07:04:42,4613,5,0,0,5.1831
For every stop, the arrival time is exactly the same as the departure time! It’s as if they just run trains up and down the lines without bothering to stop to pick up and drop off passengers. I knew that WMATA’s Trip Planner used the same time for arrival and departure times, but I assumed that was because their application reasonably rounds time to the nearest minute.
You can copy the revised data and see if it’s easier to use:
- railtrips.txt 0.1MB, down from 1.3MB
- railstops.txt 8KB, down from 688KB
- railstop_times.txt 3.5MB, down from 62.4MB
For an example of using the stops.txt data, see Mapping Metro’s 11,485 Bus Stops.