Abstract:
Minwise Hashing has become a standard technique for estimating the similarity of the collection (e.g., resemblance) with applications in information retrieval. While traditional Minwise hashing methods store each hashed value using 64 bits, storing only the lowest b bits of each (Minwise) hashed value (e.g., b=1 or 2). The b-bit Minwise hashing algorithm can gain substantial advantages in terms of computational efficiency and storage space. Based on the b-bit Minwise hashing theory, a connected bit Minwise hashing algorithm is proposed. The unbiased estimator of the resemblance and storage factor of connected bit Minwise hashing are provided theoretically. It could be theoretically proved that the efficiency of similarity estimation is improved by the connected bit Minwise hashing algorithm since the number of comparisons is greatly reduced without significant loss of accuracy. Several key parameters (e.g., precision, recall and efficiency) are analyzed, and the availability of several estimators for connected bit Minwise hashing is analyzed. Theoretical analysis and experimental results demonstrate the effectiveness of this method.