Alexandria Digital Research Library

Efficient Parallel Optimizations for All Pairs Similarity Search

Alabduljalil, Maha Ahmed
Degree Supervisor:
Tao Yang
Place of Publication:
[Santa Barbara, Calif.]
University of California, Santa Barbara
Creation Date:
Issued Date:
Engineering, System Science and Computer Science
Memory Optimization
Load Balancing
Parallel Optimizations
Similarity Search
Online resources and Dissertations, Academic
Degree Grantor:
University of California, Santa Barbara. Computer Science
Ph.D.--University of California, Santa Barbara, 2014

All Pairs Similarity Search (APSS), a computationally intensive process, is used in many search and mining applications such as data clustering, near-duplicate detection, anti-spamming, and collaborative filtering for recommendation. APSS is time consuming because its time complexity is quadratic to the dataset size, and thus it does not scale well for big data. The research in this thesis aims at speeding up APSS using exact, efficient and parallel techniques including data partitioning, load balancing and communication optimizations.

The first part of this thesis targets APSS with the cosine similarity metric. We present a two-stage APSS using a fast partitioning, load assignment algorithm. Given a large number of documents, our static partitioning places dissimilar documents into different groups to eliminate unnecessary comparison between their content. Next, due to the symmetry nature of the comparison, we use a circular load assignment to remove the redundant comparisons and reduce the I/O and communication overhead. Our evaluation shows that the proposed two-stage APSS is one order of magnitude faster than the previous work and reduces up to 70% of the total comparisons. To speed up intensive APSS computation in modern CPUs with memory hierarchy, we also describe cache-conscious algorithms using indexing with efficient data traversal and layout using index splitting and coalescing for better temporal locality. Our analysis shows the improved code is 2.74x as fast as the baseline.

In the second part of the thesis we discuss an offline duplicate removal system and techniques to speed up the near-duplicate detection of text documents. The system employs a multidimensional mapping to partition the dataset and balance the load across multiple machines. This is further extended to incremental duplicate clustering for applications with continuous data update.

Physical Description:
1 online resource (138 pages)
UCSB electronic theses and dissertations
Catalog System Number:
Inc.icon only.dark In Copyright
Copyright Holder:
Maha Alabduljalil
Access: This item is restricted to on-campus access only. Please check our FAQs or contact UCSB Library staff if you need additional assistance.