Data analysis protocol (draft)
- Current academic work describes 3 "selection" metrics and 2 "contract compliance" metrics.
- Current tool implementation contains 6 implemented metrics. An additional metric will be implemented in the near future. Based upon data availability, two of the metrics can be adjusted to account for unique sensors.
Obtaining academic metrics using the tool
Let us first describe how the metrics described in the academic work can be obtained using the tool.
Selection metrics
Total number of datapoints (T)
This metric represents the total number of datapoints provided by the feed in the time interval that has been selected. At the moment, this metric is not directly available but implementing it will be trivial. The total number of datapoints will be shown above the metric selection and will be updated whenever a new date range, feed, or network is chosen.
Number of cells with data (C)
This metric represents the total of number of mile-hour blocks which have more than k datapoints in them. This metric is not currently implemented directly. However, adding a display of it to the current space-time coverage chart will be trivial. Below the chart, a text field will allow the user to set the value of k, after which the percentage of mile-hour blocks containing ≥k datapoints will be calculated and displayed.
Weighted number of cells with data (W)
This is similar to C above but uses a weighted average instead of a binary delta function. Implementation difficulty is in effect the same as before, though the exact weighting function needs to be determined first.
Contract compliance metrics
Spatial coverage (s)
This metric represents the total number of datapoints in a given mile block summer over all hour blocks in the chosen time period. In effect, this is the tool's current space-time metric summed over time.
Obtaining the data itself will be trivial. However, it seems to me that the best way to visualize this information would be to display it on a map. This will take a bit more time but could be conceivably complete by the end of the week (assuming our systems cooperate…).
Temporal coverage (t)
This metric represents the total number of datapoints in the feed at a particular time block (i.e. hour). This metric is currently fully implemented with the tool's Time coverage metric. The user would pick the feed of interest, select the desired time interval, and then click on the Time coverage radio button. A graph of the temporal coverage is then presented. The user can hover over the graph to view individual metric values.
Description of tool metrics
The current version of the tool provides a number of graphs beyond those currently described in the academic context which would also be useful for evaluating the data. They are described in detail below.
Speed
This graph shows the average speed value in the given mile-hour block in miles per hour. This allows the user to sanity-check the data provided by the vendor to make sure that the speed values the vendor has provided are reasonable. For example, if we see that the average speed on an urban freeway during the morning rush hour is 80 mph, we can suspect that there is a problem with the data and can potentially further investigate where the data error lies.
Map-matching error
The map-matching error represents the distance from the original lat-long position of the raw data point to the nearest roadway segment in the network. The data used in this metric is only the subset of data located within a reasonable number of miles of the network's bounding box. The blue line represents the percentage of points which were successfully map-matched to a roadway segment which is ≤ 30 meters away from the raw data point. The green line shows the average distance (in meters) to the nearest roadway of the successfully-map-matched datapoints; shaded area represents one-half a standard deviation on either side of the average.
This metric can be useful in evaluating the positioning error of the devices used by the vendor to report speeds. If a high percentage of datapoints can't be mapmatched, it is possible that the vendor is supplying data that is not for the roadway in question or the positioning sensors used are of low quality.
Transmission delay
This chart displays the time difference between when the data point is recorded and when the data point has been received by the vendor (in seconds). A large transmission delay can be indicative of high network latencies between the sensor and the vendor servers, the vendor servers and the CCIT servers, or a bottleneck/delay on the CCIT servers themselves. In the case of the first two causes, high transmission delay makes the vendor largely unsuitable for real-time traffic management and monitoring purposes.
Space-time coverage
This chart shows the total number of datapoints recorded for the given mile-hour block. It is similar to the academic C metric but continues to distinguish between mile-hour blocks with different counts instead of using a single threshold. This metric makes it possible to evaluate the consistency of data volumes of the vendor throughout the day, while also making the next metric possible.
Penetration rate
This chart shows the estimated penetration rate of the vendor's data; that is, what percentage of the total occupancy of the freeway is made up of devices which supply data to the vendor. The methodology is as follows: first, all the valid PeMS stations along the route are retrieved. Each station's flows (which are reported at 30-second intervals) are summed across all of the lanes; these flows are then summed at hourly intervals. Then, the PeMS stations are bucketed into one-mile segments. The data from all the PeMS stations in a mile-hour block (which has already been summed for that hour) is then averaged to get the total occupancy in the mile-hour block. The number of unique devices in the data feed in the same mile-hour block is then divided by this total occupancy to get the penetration rate percentage.
Note: concerns have been raised by Anthony about labeling this as "penetration rate". Work continues on providing a better theoretical basis for this metric.
Time coverage
This has already been described in the Temporal coverage (t) academic metric above.
Unique device toggle
For the space-time coverage chart described earlier, the Include only single point per device checkbox excludes duplicate points from the same sensor in each mile-hour box. That is, for any mile-hour block, only the first datapoint from a particular device is counted and included in the data that's then presented in graphical form.
Note that this feature is data-dependent, since it requires the presence of a unique device/sensor identifier for each datapoint. The RFP, as published, makes such a field optional, rather than mandatory.
Future metric: sampling rate
This metric displays the average time period between consecutive samples in the raw feed. The way in which this data will be displayed is currently undefined. It seems reasonable to extend the unique device toggle to this metric.
Future functionality: feed overlays
The tool should have the ability to overlay data for multiple feeds/vendors for the map-matching error, transmission delay, and time coverage metrics. A way to directly compare the amount of space-time coverage and penetration is also needed. One option is generate area charts for each of the mile-hour blocks (resulting in the same grid as with the absolute value chart), showing the proportion of total data from each vendor. Another option would be to provide the same chart as before (for individual feeds) with the absolute color replaced by a relative one, where the color value would represent the percentage of all data provided by this particular vendor (the unique device toggle would still function as before).