Alexandria Digital Research Library

Efficient Similarity Search with Cache-Conscious Data Traversal

Tang, Xun
Degree Grantor:
University of California, Santa Barbara. Computer Science
Degree Supervisor:
Tao Yang
Place of Publication:
[Santa Barbara, Calif.]
University of California, Santa Barbara
Creation Date:
Issued Date:
Computer Science
Information retrieval
Tree-based ensemble model
All-pairs similarity search
Cache locality optimization
Load balance
Search result ranking
Dissertations, Academic and Online resources
Ph.D.--University of California, Santa Barbara, 2015

Similarity search is important for many data-intensive applications to identify a set of similar objects. Examples of such applications include near-duplicate detection and clustering, collaborative filtering for similarity-based recommendations, search query suggestion, and data cleaning. Conducting similarity search is a time-consuming process, especially when a massive amount of data is involved, and when all pairs are compared. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up.

We have developed a fast two-stage partition-based approach for all-pairs similarity search which incorporates static partitioning, optimized load balancing, and cache-conscious data traversal. Static partitioning places dissimilar documents into different groups to eliminate unnecessary comparison between their content. To overcome the challenges introduced by skewed distribution of data partition sizes and irregular dissimilarity relationship in large datasets, we conduct computation load balancing for partitioned similarity search, with competitiveness analysis. These techniques can improve performance by one to two orders of magnitude with less unnecessary I/O and data communication and better load balance. We also discuss how to further accelerate similarity search by incorporating incremental computing and approximation methods such as Locality Sensitive Hashing.

Because of data sparsity and irregularity, accessing feature vectors in memory for runtime comparison incurs significant overhead in modern memory hierarchy. We have designed and implemented cache-conscious algorithms to improve runtime efficiency in similarity search. The idea of optimizing data layout and traversal patterns is also applied to the search result ranking problem in runtime with multi-tree ensemble models.

Physical Description:
1 online resource (163 pages)
UCSB electronic theses and dissertations
Catalog System Number:
Inc.icon only.dark In Copyright
Copyright Holder:
Xun Tang
File Description
Access: Public access
Tang_ucsb_0035D_12538.pdf pdf (Portable Document Format)