Alexandria Digital Research Library

Efficient Similarity Search with Cache-Conscious Data Traversal

Author:
Tang, Xun
Degree Grantor:
University of California, Santa Barbara. Computer Science
Degree Supervisor:
Tao Yang
Place of Publication:
[Santa Barbara, Calif.]
Publisher:
University of California, Santa Barbara
Creation Date:
2015
Issued Date:
2015
Topics:
Computer Science
Keywords:
Information retrieval
Tree-based ensemble model
All-pairs similarity search
Cache locality optimization
Load balance
Search result ranking
Genres:
Dissertations, Academic and Online resources
Dissertation:
Ph.D.--University of California, Santa Barbara, 2015
Description:

Similarity search is important for many data-intensive applications to identify a set of similar objects. Examples of such applications include near-duplicate detection and clustering, collaborative filtering for similarity-based recommendations, search query suggestion, and data cleaning. Conducting similarity search is a time-consuming process, especially when a massive amount of data is involved, and when all pairs are compared. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up.

We have developed a fast two-stage partition-based approach for all-pairs similarity search which incorporates static partitioning, optimized load balancing, and cache-conscious data traversal. Static partitioning places dissimilar documents into different groups to eliminate unnecessary comparison between their content. To overcome the challenges introduced by skewed distribution of data partition sizes and irregular dissimilarity relationship in large datasets, we conduct computation load balancing for partitioned similarity search, with competitiveness analysis. These techniques can improve performance by one to two orders of magnitude with less unnecessary I/O and data communication and better load balance. We also discuss how to further accelerate similarity search by incorporating incremental computing and approximation methods such as Locality Sensitive Hashing.

Because of data sparsity and irregularity, accessing feature vectors in memory for runtime comparison incurs significant overhead in modern memory hierarchy. We have designed and implemented cache-conscious algorithms to improve runtime efficiency in similarity search. The idea of optimizing data layout and traversal patterns is also applied to the search result ranking problem in runtime with multi-tree ensemble models.

Physical Description:
1 online resource (163 pages)
Format:
Text
Collection(s):
UCSB electronic theses and dissertations
ARK:
ark:/48907/f33f4msg
ISBN:
9781321700671
Catalog System Number:
990045119700203776
Rights:
Inc.icon only.dark In Copyright
Copyright Holder:
Xun Tang
File Description
Access: Public access
Tang_ucsb_0035D_12538.pdf pdf (Portable Document Format)