Techniques

  1. System Architecture:

    As stated in the project overview, the CLARO system involves two processes: data capturing and transformation and relational processing under uncertainty. The below figure shows the architecture overview of CLARO, an uncertainty-aware stream system. The T operators transform raw data streams into queriable data streams with quantified uncertainty. The A and T operators are examples of relational operators, i.e., aggregates and joins respectively. These operators manipulate and process data modelled by continuous random variables. The final results can be characterized by confidence regions or other statistics such as mean and variance values.

    system architecture

  2. Data Capturing and Transformation:

    Since the raw streams may not present data in a format suitable for query processing and can be highly noisy, this project employs probabilistic models of the underlying data generation process and machine learning techniques to efficiently transform raw data into a desired representation with an uncertainty metric. The following figure shows a graphical model built for the RFID application.

    system architecture
  3. Relational Processing under Uncertainty:

    To efficiently quantify result uncertainty of a query operator, CLARO explores various techniques based on probability and statistical theory to reduce statistics that data streams need to carry and to expedite the computation of result distributions using approximation. Examples of techniques applied are characteristic functions and regressions..

  4. Probabilistic Threshold Query Optimization:

    Given input data with uncertainty, users would want to retrieve query answers of high confidence, reflected by high existence probabilities of these answer. Probabilistic threshold queries return tuples whose existence probabilities pass the user-specified threshold. We optimize threshold query processing for continuous uncertain data by (i) expediting selections by reducing dimensionality of integration and using fast filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach.



Last Update: April 24, 2013