高级检索

    一种基于动态平衡树的在线索引快速构建方法

    A Fast On-Line Index Construction Method Based on Dynamic Balancing Tree

    • 摘要: 倒排索引的构建可以通过离线方式高效地完成,但是仅当整个数据集索引完毕后方可提供检索服务.在线索引可以在构建倒排索引的同时提供检索服务,新加入的文档即刻可供检索.提出了一种基于动态平衡树的在线索引更新策略,利用动态平衡树控制索引合并过程,使索引合并总是在大小相近的子索引之间进行,以减少索引合并代价,同时可以调节索引和检索之间的性能平衡.该方法提供了一个基于合并的在线索引更新框架, 与已有方法相比具有更好的通用性、更高的性能和更好的规模可扩展性.在由4000万张网页构成的270 GB Web数据集上运行的实验表明,该方法在实际系统中是高效的,将索引更新的性能提高了92.28%,而检索性能仅下降4.79%,大幅度降低了在线索引构建的代价.

       

      Abstract: Examined in this paper is the issues of on-line maintenance for growing text collections. This approach builds inverted index to reflect changes in the collection it describes, supporting efficient document insertions and deletions. A fast on-line index construction method based on dynamic balancing tree is proposed. This method takes a merge-based approach and aims to keep a good performance during indexing and query processing by always merging indices with similar sizes. This is achieved by using a tree in which each node corresponds to one sub-index. The placing of sub-indexes in the tree depends on the sizes of the sub-indexes, so that each merging takes place among similarly sized sub-indexes. The method is suited very well not only for growing text collections, but also for dynamic text collections, and is a uniform framework for merge-based index maintenance strategies. It is more general, faster and scales better than previous methods, and can be adapted to suit different balances of insertion and querying operations. Using experiments on large scale Web data, the efficiency, flexibility and scalability of this method are demonstrated in practice, showing that on-line index construction for growing text collections can be performed efficiently and almost as fast as for static text collections.

       

    /

    返回文章
    返回