Uncertainty Management in Large Database Systems and Scientific Applications
Wenzhao Liu, Yanlei Diao, and Anna Liu. An Analysis of Query-Agnostic Sampling for Interactive Data Exploration. Communications in Statistics – Theory and Methods. Accepted, July 2017.
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. AIDE: An Active Learning-based Approach for Interactive Data Exploration. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2016.
Abstract: In this paper, we argue that database systems be augmented with an automated data exploration service that methodically steers users through the data in a meaningful way. Such an automated system is crucial for deriving insights from complex datasets found in many big data applications such as scientific and healthcare applications as well as for reducing the human effort of data exploration. Towards this end, we present AIDE, an Automatic Interactive Data Exploration framework that assists users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory queries. AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected samples. We present a number of exploration techniques as well as optimizations that minimize the number of samples presented to the user while offering interactive performance. AIDE can deliver highly accurate query predictions for very common conjunctive queries with small user effort while, given a reasonable number of samples, it can predict with high accuracy complex disjunctive queries. It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.
Yanlei Diao, Kyriaki Dimitriadou, Zhan Li, Wenzhao Liu, Olga Papaemmanouil, Kemi Peng, Liping Peng. AIDE: An Automatic User Navigation System for Interactive Data Exploration. PVLDB 8(12): 1964-1975 (2015)
Abstract: Data analysts often engage in data exploration tasks to discover interesting data patterns, without knowing exactly what they are look- ing for. Such exploration tasks can be very labor-intensive because they often require the user to review many results of ad-hoc queries and adjust the predicates of subsequent queries to balance the trade-off between collecting all interesting information and reducing the size of returned data. In this demonstration we introduce AIDE, a system that automates these exploration tasks. AIDE steers the user towards interesting data areas based on her relevance feedback on database samples, aiming to achieve the goal of identifying all database objects that match the user interest with high efficiency. In our demonstration, conference attendees will see AIDE in action for a variety of exploration tasks on real-world datasets.
Liping Peng and Yanlei Diao. Supporting Data Uncertainty in Array Databases. In Proceedings of SIGMOD 2015.
Abstract: Uncertain data management has become crucial to scientific applications. Recently, array databases have gained popularity for scientific data processing due to performance benefits. In this paper, we address uncertain data management in array databases, which may involve both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes. Our work defines the formal semantics of array operations under both value and position uncertainty. To address the new challenge raised by position uncertainty, we propose a suite of storage and evaluation strategies for array operations, with a focus on a new scheme that bounds the overhead of querying by strategically treating tuples with large variances via replication in storage. Results from real datasets show that for common workloads, our best-performing techniques outperform alternative methods based on state-of-the-art indexes by 1.7x to 4.3x for the Subarray operation and 1 to 2 orders of magnitude for Structure-Join, at only a small storage cost.
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 517-528, 2014.
Abstract: Interactive Data Exploration (IDE) is a key ingredient of a diverse set of discovery-oriented applications, including ones from scientific computing and evidence-based medicine. In these applications, data discovery is a highly ad hoc interactive process where users execute numerous exploration queries using varying predi- cates aiming to balance the trade-off between collecting all relevant information and reducing the size of returned data. Therefore, there is a strong need to support these human-in-the-loop applications by assisting their navigation in the data to find interesting objects. In this paper, we introduce AIDE, an Automatic Interactive Data Exploration framework, that iteratively steers the user towards in- teresting data areas and “predicts” a query that retrieves his objects of interest. Our approach leverages relevance feedback on database samples to model user interests and strategically collects more samples to refine the model while minimizing the user effort. AIDE integrates machine learning and data management techniques to pro- vide effective data exploration results (matching the user’s interests with high accuracy) as well as high interactive performance. It de- livers highly accurate query predictions for very common conjunctive queries with very small user effort while, given a reasonable number of samples, it can predict with high accuracy complex conjunctive queries. Furthermore, it provides interactive performance by limiting the user wait time per iteration to less than a few sec- onds in average. Our user study indicates that AIDE is a practical exploration framework as it significantly reduces the user effort and the total exploration time compared with the current state-of-the-art approach of manual exploration.
Thanh Tran, Yanlei Diao, Charles Sutton, and Anna Liu. Supporting User-Defined Functions on Uncertain Data. In Proceedings of VLDB 2013.
Abstract: Uncertain data management has become crucial in many sensing and scientific applications. As user-defined functions (UDFs) become widely used in these applications, an important task is to capture result uncertainty for queries that evaluate UDFs on uncertain data. In this work, we provide a general framework for supporting UDFs on uncertain data. Specifically, we propose a learning approach based on Gaussian processes (GPs) to compute approximate output distributions of a UDF when evaluated on uncertain input, with guaranteed error bounds. We also devise an online algorithm to compute such output distributions, which employs a suite of optimizations to improve accuracy and performance. Our evaluation using both real-world and synthetic functions shows that our proposed GP approach can outperform the state-of-the-art sampling approach with up to two orders of magnitude improvement for a variety of UDFs.
Thanh Tran, Liping Peng, Yanlei Diao, Andrew McGregor, and Anna Liu. CLARO: Modelling and Processing Uncertain Data Streams. In VLDB Journal, 2012.
Abstract: Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the \claro\ system that supports stream processing for uncertain data naturally captured using continuous random variables. \claro\ employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and outperform state-of-the-art sampling methods.
Liping Peng, Yanlei Diao, and Anna Liu. Optimizing Probabilistic Query Processing on Continuous Uncertain Data. In Proceedings of PVLDB 2011.
Abstract: Uncertain data management is becoming increasingly important in many applications, in particular, in scientific databases and data stream systems. Uncertain data in these new environments is naturally modeled by continuous random variables. An important class of queries uses complex selection and join predicates and requires query answers to be returned if their existence probabilities pass a threshold. In this work, we optimize threshold query processing for continuous uncertain data by (i) expediting joins using new indexes on uncertain data, (ii) expediting selections by reducing dimensionality of integration and using faster filters, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of our techniques and significant performance gains over a state-of-the-art threshold query optimizer.
Thanh Tran, Andrew McGregor, Yanlei Diao, Liping Peng, and Anna Liu. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations, In Proceedings of PVLDB 2010.
Abstract: Uncertain data streams are increasingly common in real-world deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group by's) and aggregation operations on uncertain data streams. To characterize the uncertainty of answers to these queries, one generally has to compute the full probability distribution of each operation used in the query. Computing distributions of aggregates given conditioned tuple distributions is a hard, unsolved problem. Our work employs a new evaluation framework that includes a general data model, approximation metrics, and approximate representations. Within this framework we design fast data-stream algorithms, both deterministic and randomized, for returning approximate distributions with bounded errors as answers to those complex queries. Our experimental results demonstrate the accuracy and efficiency of our approximation techniques and offer insights into the strengths and limitations of deterministic and randomized algorithms.
Thanh Tran, Liping Peng, Boduo Li, Yanlei Diao, and Anna Liu. PODS: A New Model and Processing Algorithms for Uncertain Data Streams, In Proceedings of SIGMOD 2010.
Abstract: Uncertain data streams, where data is incomplete, imprecise, and even misleading, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the PODS system that supports stream processing for uncertain data naturally captured using continuous random variables. PODS employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for complex relational operators, i.e., aggregates and joins, by exploring advanced statistical theory and approximation. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and significantly outperform a state-of-the-art sampling method. A case study further shows that our techniques can enable a tornado detection system (for the first time) to produce detection results at stream speed and with much improved quality.
Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton, Thanh Tran, and Michael Zink. Capturing Data Uncertainty in High-Volume Stream Processing, In Proceedings of CIDR 2009.
Abstract: We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous random variables such as many types of sensor data. To provide an end-to-end solution, our system employs probabilistic modeling and inference to generate uncertainty description for raw data, and then a suite of statistical techniques to capture changes of uncertainty as data propagates through query operators. To cope with high-volume streams, we explore advanced approximation techniques for both space and time efficiency. We are currently working with a group of scientists to evaluate our system using traces collected from real-world applications for hazardous weather monitoring and for object tracking and monitoring.
Thanh Tran, Charles Sutton, Richard Cocci, Yanming Nie, Yanlei Diao, and Prashant Shenoy. Probabilistic Inference over RFID Streams in Mobile Environments, In Proceedings of ICDE 2009. (tech report version)
Abstract: Recent innovations in RFID technology are enabling large-scale cost-effective deployments in retail, healthcare, pharmaceuticals and supply chain management. The advent of mobile or handheld readers adds significant new challenges to RFID stream processing due to the inherent reader mobility, increased noise, and incomplete data. In this paper, we address the problem of translating noisy, incomplete raw streams from mobile RFID readers into clean, precise event streams with location information. Specifically we propose a probabilistic model to capture the mobility of the reader, object dynamics, and noisy readings. Our model can self-calibrate by automatically estimating key parameters from observed data. Based on this model, we employ a sampling-based technique called particle filtering to infer clean, precise information about object locations from raw streams from mobile RFID readers. Since inference based on standard particle filtering is neither scalable nor efficient in our settings, we propose three enhancements -- particle factorization, spatial indexing, and belief compression -- for scalable inference over large numbers of objects and high- volume streams. Our experiments show that our approach can offer 49% error reduction over a state-of-the-art data cleaning approach such as SMURF while also being scalable and efficient.
Last Update: November 2017