Managing High-Volume Uncertain Data Streams
Thanh Tran, Yanlei Diao, Charles Sutton, and Anna Liu. Supporting User-Defined Functions on Uncertain Data. To Appear in VLDB, 2013.
Abstract: Uncertain data management has become crucial in many sensing and scientific applications. As user-defined functions (UDFs) become widely used in these applications, an important task is to capture result uncertainty for queries that evaluate UDFs on uncertain data. In this work, we provide a general framework for supporting UDFs on uncertain data. Specifically, we propose a learning approach based on Gaussian processes (GPs) to compute approximate output distributions of a UDF when evaluated on uncertain input, with guaranteed error bounds. We also devise an online algorithm to compute such output distributions, which employs a suite of optimizations to improve accuracy and performance. Our evaluation using both real-world and synthetic functions shows that our proposed GP approach can outperform the state-of-the-art sampling approach with up to two orders of magnitude improvement for a variety of UDFs.
Thanh Tran, Liping Peng, Yanlei Diao, Andrew McGregor, and Anna Liu. CLARO: Modelling and Processing Uncertain Data Streams. In VLDB Journal, 2012.
Abstract: Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the \claro\ system that supports stream processing for uncertain data naturally captured using continuous random variables. \claro\ employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and outperform state-of-the-art sampling methods.
Liping Peng, Yanlei Diao, and Anna Liu. Optimizing Probabilistic Query Processing on Continuous Uncertain Data. In Proceedings of PVLDB 2011.
Abstract: Uncertain data management is becoming increasingly important in many applications, in particular, in scientific databases and data stream systems. Uncertain data in these new environments is naturally modeled by continuous random variables. An important class of queries uses complex selection and join predicates and requires query answers to be returned if their existence probabilities pass a threshold. In this work, we optimize threshold query processing for continuous uncertain data by (i) expediting joins using new indexes on uncertain data, (ii) expediting selections by reducing dimensionality of integration and using faster filters, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of our techniques and significant performance gains over a state-of-the-art threshold query optimizer.
Thanh Tran, Andrew McGregor, Yanlei Diao, Liping Peng, and Anna Liu. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations, In Proceedings of PVLDB 2010.
Abstract: Uncertain data streams are increasingly common in real-world deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group by's) and aggregation operations on uncertain data streams. To characterize the uncertainty of answers to these queries, one generally has to compute the full probability distribution of each operation used in the query. Computing distributions of aggregates given conditioned tuple distributions is a hard, unsolved problem. Our work employs a new evaluation framework that includes a general data model, approximation metrics, and approximate representations. Within this framework we design fast data-stream algorithms, both deterministic and randomized, for returning approximate distributions with bounded errors as answers to those complex queries. Our experimental results demonstrate the accuracy and efficiency of our approximation techniques and offer insights into the strengths and limitations of deterministic and randomized algorithms.
Thanh Tran, Liping Peng, Boduo Li, Yanlei Diao, and Anna Liu. PODS: A New Model and Processing Algorithms for Uncertain Data Streams, In Proceedings of SIGMOD 2010.
Abstract: Uncertain data streams, where data is incomplete, imprecise, and even misleading, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the PODS system that supports stream processing for uncertain data naturally captured using continuous random variables. PODS employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for complex relational operators, i.e., aggregates and joins, by exploring advanced statistical theory and approximation. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and significantly outperform a state-of-the-art sampling method. A case study further shows that our techniques can enable a tornado detection system (for the first time) to produce detection results at stream speed and with much improved quality.
Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton, Thanh Tran, and Michael Zink. Capturing Data Uncertainty in High-Volume Stream Processing, In Proceedings of CIDR 2009.
Abstract: We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous random variables such as many types of sensor data. To provide an end-to-end solution, our system employs probabilistic modeling and inference to generate uncertainty description for raw data, and then a suite of statistical techniques to capture changes of uncertainty as data propagates through query operators. To cope with high-volume streams, we explore advanced approximation techniques for both space and time efficiency. We are currently working with a group of scientists to evaluate our system using traces collected from real-world applications for hazardous weather monitoring and for object tracking and monitoring.
Thanh Tran, Charles Sutton, Richard Cocci, Yanming Nie, Yanlei Diao, and Prashant Shenoy. Probabilistic Inference over RFID Streams in Mobile Environments, In Proceedings of ICDE 2009. (tech report version)
Abstract: Recent innovations in RFID technology are enabling large-scale cost-effective deployments in retail, healthcare, pharmaceuticals and supply chain management. The advent of mobile or handheld readers adds significant new challenges to RFID stream processing due to the inherent reader mobility, increased noise, and incomplete data. In this paper, we address the problem of translating noisy, incomplete raw streams from mobile RFID readers into clean, precise event streams with location information. Specifically we propose a probabilistic model to capture the mobility of the reader, object dynamics, and noisy readings. Our model can self-calibrate by automatically estimating key parameters from observed data. Based on this model, we employ a sampling-based technique called particle filtering to infer clean, precise information about object locations from raw streams from mobile RFID readers. Since inference based on standard particle filtering is neither scalable nor efficient in our settings, we propose three enhancements -- particle factorization, spatial indexing, and belief compression -- for scalable inference over large numbers of objects and high- volume streams. Our experiments show that our approach can offer 49% error reduction over a state-of-the-art data cleaning approach such as SMURF while also being scalable and efficient.
Last Update: April 24, 2013