Changes

Data quality tool

2,935 bytes added, 23:08, 18 November 2010

updates reflecting discussions with seb and samitha

The system, as currently envisioned, will be a web-based portal which will allow users to evaluate the quality of various data feeds through any modern, standards-compliant browser with an internet connection to CCIT servers. The interface will be primarily visual, allowing users to compare a number of metrics (described in more detail below) visually as well as numerically. At the current time, a comparison of at least one and at most two data sources will be possible, though the system will be designed in such a way that the latter restriction will not be permanent and could be lifted in the future.

In the spirit of modern and user-friendly web design paradigms, the system should be responsive and visually appealing. Exporting ~~data (both in tabular and graphical format)~~ generated graphs for sharing via email or other means should be easy and not painful. The tool should be useful "as is" or "out of the box" but still allow a useful amount of customization to allow users to tailor it to their specific needs or application.

More details on the user interface and program layout will be specified in a separate document. The system will be targeted primarily at CCIT researchers familiar with the different data feeds and models which are available. However, the user interface will be designed with less technical users in mind, so that (for example) it can be demoed live to transportation professionals.

Note that in general, the CCIT system contains two types of data feeds: ~~travel time distributions (such as the FasTrack travel time system)~~ link-based and point-based ~~speed/flow/density estimates~~. In the short term, the system will be designed with the latter group of feeds in mind. However, it should be architected in such a way that the addition of ~~travel time~~ link-based feeds should be possible without much additional effort. ===Distinction between DAT and FeedGenerator===Note that the system described here can be called the "Data Analysis Tool," or DAT for short. Its only purpose is to ''accept any number of feeds as inputs and provide an interface to compare the data that these feeds contain directly''. Its purpose is '''not''' to take two vastly different feeds and generate the necessary modifications to make them compatible with each other for comparison. Such functionality is beyond the scope of the work for the DAT component specifically, even though it will most likely be necessary for DAT itself to work properly. As a result, a required component of the data quality assessment ''framework'' (but not DAT itself) is the ''FeedGenerator'' module. This component is responsible for creating the necessary output filters to the feeds which already exist in the system so that different feeds can be compared correctly in the DAT. The FeedGenerator module will be responsible for making sure that consistent metadata exists across different feeds to enable comparisons to take place. The minimal metadata requirements for a feed to be usable by DAT can be broken up into feed-level and datapoint-level requirements.;Feed-level requirements*For processed feeds**Feed processing sequence – a description of the modifications that have been made to the input data to arrive at the data currently produced by the feed**Inputs used for feed**Model parameters**Typical input-output error profile**Generic feed type (model-based, statistical, historical, real-time, or some combination of these)*For raw feeds**Sensor type – if the feed contains readings from a single sensor type:Additionally, raw feeds may specify characteristics of the sensor networks, if applicable. This, however, is not required.;Datapoint-level requirements*Recorded time*Received time*Sensor ID*Sensor location*Sensor type [for multi-sensor feeds]*GPS device error [if applicable]*Measured value (speed/count/etc)Users who want to compare two incompatible feeds will be provided with an interface to the FeedGenerator so that they can submit a request to the system administrator to create an output filter for the feed(s) in question so that a proper output filter will be created. A direct "instant filter creation" mechanism will not be present in the system, at least initially, due to the management complexities that would be introduced by such a system.

==Evaluation metrics==

All metrics, whether direct or calculated, will generally be generated on-the-fly for each request. As such, the system does not require the use of a database for storing calculated data. If system usage is high enough that database load or performance become a problem, we should look into using [http://www.memcached.org memcached] for storing the calculation results. Since the results are transient, storing them permanently in a database does not make much sense.

~~It is possible that the~~ The system will not allow users to flag specific, problematic data points or feeds~~. Since this feature adds~~ initially due to the added overhead of having to store this data in a manageable and useful fashion. If users want to, in effect, ~~it is not clear if this~~ create a new output filter based on their analyses, the tool will provide a textual summary of the user's current filtering algorithm so that a corresponding output filter may be ~~implemented in~~ created by the ~~initial release~~system administrator.

===Data-level metrics===

*Distribution of map-matching errors (as determined by MM mapmatching algorithms)

*Data transmission delay (time difference between data recording and data storage on server; 2-step delay for TeleNav only: device→TeleNav server→CCIT server)

*Sampling rate †

*Space coverage †

*Time coverage †

*Penetration rate †

*Distribution of measured values

–––––<br/>

† – at this time, it is not entirely clear if this data will be generated on-the-fly by the DAT or by the FeedGenerator (see above for the distinction). This separation should become evident during implementation.

All of these metrics should be filterable by time intervals, feed, device model (if available), location (specified as a network, polygon, or set of specific links), and unique device. This would allow for the analysis of derived metrics such as "density of data per unique device" or "distribution of point location from link end for the city of SF."

===Application-level ~~metrics~~comparison===~~These metrics~~ This functionality will allow the users to ~~see~~ directly compare the ~~differences between~~ output of some model when different combinations of input feeds ~~on the application level, i.e. when they~~ are used ~~as input to some model, rather than on the raw data alone~~with it. ~~The addition of new metrics at~~ At this ~~level is more constrained~~time, ~~as it requires that~~ the ~~application being benchmarked is (a) capable of accepting different feeds as inputs~~comparison between outputs will be purely visual, ~~and (b) is easily instrumented to calculate the metrics of interest. Two main~~ though specific metrics ~~are envisioned for the system~~ or analytical/numerical comparison methodologies can be added at ~~this point:~~ *Value of information from adding an individual feed into a ~~model~~*Value of information from adding multiple feeds into a modellater time.

The ~~value of information is defined~~ system will be designed as follows: for a chosen model (i.e. the highway model), the ~~degree to~~ user will be able specify which input feeds should be used and for what time period the ~~distribution of travel times from some~~ model ~~more closely resembles some "ground truth" distribution previously obtained~~will be run. The ~~difference between adding~~ user will provide a ~~single feed and multiple feeds is fairly self-explanatory:~~ contact email address as well. When the ~~former evaluates~~ user submits the ~~value~~ request, their task will be placed into a queue of ~~information from adding a single feed to some baseline~~ model~~, whereas~~ runs. When the ~~latter does the same but for more than one feed. Note that these are different metrics~~model finishes computation, ~~because~~ the ~~improvement in~~ user will be sent an email containing a link at which they may view the model ~~output neither linear nor monotonically additive in the positive direction~~'s results. These results will be viewable to other users as well. As a result, ~~certain feed combinations could actually worsen~~ if the user wants to compare the output of the highway model ~~performance~~on two different sets of input feeds, ~~while other feeds which provide dramatic improvements individually~~ he may ~~not add much when other feeds are added at~~ either:*Find two existing output sets and compare their outputs*Take an existing output set, request the generation of a second output set, and then compare the two sets once generation of the second set is complete*Make two requests for the generation of two new output sets and then compare them once the computation of both is completeSince this will allow for the ~~same time~~easy generation of an arbitrary number of output sets, these generated output sets will be automatically deleted after a set number of days, initially 7.

==Future directions==

Lensovet

1,277

edits

lensowiki β

Changes

Data quality tool

lensowiki ^β