Open main menu

lensowiki β

Data quality tool

Revision as of 00:37, 13 November 2010 by Lensovet (talk | contribs) (Created page with 'With a commoditization of hardware and associated decreasing infrastructure costs, the number of ways in which the data needed to drive traffic monitoring has increased significa…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

With a commoditization of hardware and associated decreasing infrastructure costs, the number of ways in which the data needed to drive traffic monitoring has increased significantly. However, most traffic measurement methods have drawbacks related to the technology which they use. For example, data collection via smartphones has extremely low infrastructure costs and the ability to provide extremely precise and granular data but suffer from extremely low penetration. Inductive loop detectors, on the other hand, suffer from being point-based measuring tools which cannot always be used for time travel estimates, can carry high maintenance costs, and are not always accurate; at the same time, they have extremely high penetration rates (in locations where they are installed).

This diversity and difference in strengths and weaknesses makes it evident that it is necessary, at least at this point in time, to be able to combine different data sources together into a single "traffic picture" whose data is of higher quality than that of any individual source. A key component of such a system is the ability to evaluate the quality of both the individual data streams as well as their combinations to determine what improvements, if any, exist in the new, combined system, as well as whether certain parameters or combination provide better results. The goal of this project is to devise such a data quality assessment system so that these evaluations and analyses can be made.

Contents

General system description

The system, as currently envisioned, will be a web-based portal which will allow users to evaluate the quality of various data feeds through any modern, standards-compliant browser with an internet connection to CCIT servers. The interface will be primarily visual, allowing users to compare a number of metrics (described in more detail below) visually as well as numerically. At the current time, a comparison of at least one and at most two data sources will be possible, though the system will be designed in such a way that the latter restriction will not be permanent and could be lifted in the future.

In the spirit of modern and user-friendly web design paradigms, the system should be responsive and visually appealing. Exporting data for sharing via email or other means should be easy and not painful. The tool should be useful "as is" or "out of the box" but still allow a useful amount of customization to allow users to tailor it to their specific needs or application.

Note that in general, the CCIT system contains two types of data feeds: travel time distributions (such as the FasTrack travel time system) and point-based speed/flow/density estimates. In the short term, the system will be designed with the latter group of feeds in mind. However, it should be architected in such a way that the addition of travel time feeds should be possible without much additional effort.

Evaluation metrics

The above-described system is largely useless without a description of the metrics which it will display for analysis. It is implied, of course, that this list of metrics is not all-inclusive, and the system should be designed in such a fashion that the addition of new metrics should be trivial beyond the coding of the metric calculation itself. As per the earlier mention that point-based-data evaluations are the first priority, the metrics below are only applicable to such sources.

The data quality assessment tool should provide an easy-to-use interface to specify "correct" or benchmark values for each of the metrics, as the tolerable amount of, for example, GPS error depends on the specific application for which the data is being considered. Also, any feed available to the system should be usable as a benchmark for any metric (as long as it has the data to calculate it).

Data-level metrics

  • Distribution of GPS errors (as reported by the recording device)
  • Distribution of map-matching errors (as determined by MM mapmatching algorithms)
  • Density of data (total number of points per link)
  • Frequency of new data receipt (total per link)
  • Data transmission delay (time difference between data recording and data storage on server; 2-step delay for TeleNav only: device→TeleNav server→CCIT server)
  • Distribution of point location distance from link end (for city locations with traffic lights, provides ability to flag when most data points are not on the ends, since people should be waiting at lights)

All of these metrics should be filterable by time intervals, feed, device model (if available), location (specified as a network, polygon, or set of specific links), and unique device. This would allow for the analysis of derived metrics such as "density of data per unique device" or "distribution of point location from link end for the city of SF."

Application-level metrics

These metrics will allow the users to see the differences between feeds on the application level, i.e. when they are used as input to some model, rather than on the raw data alone. The addition of new metrics at this level is more constrained, as it requires that the application being benchmarked is (a) capable of accepting different feeds as inputs, and (b) is easily instrumented to calculate the metrics of interest. Two main metrics are envisioned for the system at this point:

  • Value of information from adding an individual feed into a model
  • Value of information from adding multiple feeds into a model

The value of information is defined as the degree to which the distribution of travel times from some model more closely resembles some "ground truth" distribution previously obtained. The difference between adding a single feed and multiple feeds is fairly self-explanatory: the former evaluates the value of information from adding a single feed to some baseline model, whereas the latter does the same but for more than one feed. Note that these are different metrics, because the improvement in the model output neither linear nor monotonically additive in the positive direction. As a result, certain feed combinations could actually worsen model performance, while other feeds which provide dramatic improvements individually may not add much when other feeds are added at the same time.