Abstract:
Join is one of the most important operations in relational database, and is also important in data stream management system. In group-bys which construct StreamCube, join will be done before them, and join between data stream and huge dimension tables (such as IPaddress table) would consume limited power of CPU and capacity of memory. Generally, a huge dimension table must be partitioned into small tables and each partition table is loaded into memory in turn that causes frequent disk IO. To avoid this shortage, it compress huge dimension tables losslessly by taking characters of dimension tables and their join-key layer into account and finding join-key redundancies in those tables. So, one dimension table with n concept columns is compressed into n ranged join-key dimension tables (RJ-DT) by reducing join-key redundancies and using decomposed of storage model of column-store. Each RJ-DT is composed of start and end columns and several concept columns. However, a new issue that non-equijoin called range join between data stream and RJ-DT is brought out. Then, it proposes a multi-join algorithm of huge dimension table, called multi dynamic index nested-loop join (MDI-NL), which implements non-equijoin efficiently, also supports multi-join. MDI-NL constructs RB+Tree index of each RJ-DT before join. In join operation, it dispatches index dynamically referring to demand of group-by which get the exact smallest index and makes MDI-NL more powerful. Through theoretical analysis and extensive experiments, it is found that MDI-NL outperforms other join algorithms by an order of magnitude for huge dimension table join and has a strong practicability.