Abstract:
Aggregation is a commonly used but time-consuming operation in database systems. Relative Compared to exact query, it is often more attractive to return an approximate result with the required error bound to user in a much faster response time. However, we find that none of the previous methods can process approximate aggregation on massive data with arbitrary accuracy and high efficiency. A novel algorithm PAA is proposed to efficiently process approximate aggregation with an arbitrary confidence interval. The data space of dimensional attributes is divided into multiple hypercubes of the same cube size. Each partition maintains the tuples whose dimensional attributes fall into the corresponding hypercube. A random sample RS is pre-constructed on table. PAA consists of two stages. If the approximate result obtained by RS in stage 1 does not satisfy the confidence interval, it is required to retrieve more random tuples from partition set IPS whose hypercubes overlap with search region in stage 2. The novelty of PAA lies in how to retrieve random tuples from IPS when the exact number of tuples satisfying predicate in each partition is unknown and how to reduce random I/O cost of retrieval operation as much as possible. The experimental results show that PAA obtains up to two orders of magnitude speedup compared with the existing methods.