高级检索

    连接位Minwise Hash算法的研究

    • 摘要: 在信息检索中,Minwise Hash算法用于估计集合的相似度.b位Minwise Hash则通过存储Hash值的b位来估计相似度,从而节省了存储空间和计算时间.基于b位Minwise Hash的理论框架提出了连接位Minwise Hash算法,给出了连接位的相似度无偏估计和存储因子.通过理论证明了连接位Minwise Hash算法不需要损失很大的精度却可以成倍地减少比对的次数,提升了算法的性能.理论分析和实验验证了此方法的有效性.

       

      Abstract: Minwise Hashing has become a standard technique for estimating the similarity of the collection (e.g., resemblance) with applications in information retrieval. While traditional Minwise hashing methods store each hashed value using 64 bits, storing only the lowest b bits of each (Minwise) hashed value (e.g., b=1 or 2). The b-bit Minwise hashing algorithm can gain substantial advantages in terms of computational efficiency and storage space. Based on the b-bit Minwise hashing theory, a connected bit Minwise hashing algorithm is proposed. The unbiased estimator of the resemblance and storage factor of connected bit Minwise hashing are provided theoretically. It could be theoretically proved that the efficiency of similarity estimation is improved by the connected bit Minwise hashing algorithm since the number of comparisons is greatly reduced without significant loss of accuracy. Several key parameters (e.g., precision, recall and efficiency) are analyzed, and the availability of several estimators for connected bit Minwise hashing is analyzed. Theoretical analysis and experimental results demonstrate the effectiveness of this method.

       

    /

    返回文章
    返回