Alexandria Digital Research Library

Efficient Parallel Optimizations for All Pairs Similarity Search

Author:
Alabduljalil, Maha Ahmed
Degree Supervisor:
Tao Yang
Place of Publication:
[Santa Barbara, Calif.]
Publisher:
University of California, Santa Barbara
Creation Date:
2014
Issued Date:
2014
Topics:
Engineering, System Science and Computer Science
Keywords:
Memory Optimization
Load Balancing
Filtering
Parallel Optimizations
Partitioning
Similarity Search
Genres:
Online resources and Dissertations, Academic
Degree Grantor:
University of California, Santa Barbara. Computer Science
Dissertation:
Ph.D.--University of California, Santa Barbara, 2014
Description:

All Pairs Similarity Search (APSS), a computationally intensive process, is used in many search and mining applications such as data clustering, near-duplicate detection, anti-spamming, and collaborative filtering for recommendation. APSS is time consuming because its time complexity is quadratic to the dataset size, and thus it does not scale well for big data. The research in this thesis aims at speeding up APSS using exact, efficient and parallel techniques including data partitioning, load balancing and communication optimizations.

The first part of this thesis targets APSS with the cosine similarity metric. We present a two-stage APSS using a fast partitioning, load assignment algorithm. Given a large number of documents, our static partitioning places dissimilar documents into different groups to eliminate unnecessary comparison between their content. Next, due to the symmetry nature of the comparison, we use a circular load assignment to remove the redundant comparisons and reduce the I/O and communication overhead. Our evaluation shows that the proposed two-stage APSS is one order of magnitude faster than the previous work and reduces up to 70% of the total comparisons. To speed up intensive APSS computation in modern CPUs with memory hierarchy, we also describe cache-conscious algorithms using indexing with efficient data traversal and layout using index splitting and coalescing for better temporal locality. Our analysis shows the improved code is 2.74x as fast as the baseline.

In the second part of the thesis we discuss an offline duplicate removal system and techniques to speed up the near-duplicate detection of text documents. The system employs a multidimensional mapping to partition the dataset and balance the load across multiple machines. This is further extended to incremental duplicate clustering for applications with continuous data update.

Physical Description:
1 online resource (138 pages)
Format:
Text
Collection(s):
UCSB electronic theses and dissertations
ARK:
ark:/48907/f3ft8j50
ISBN:
9781321201284
Catalog System Number:
990045115570203776
Rights:
Inc.icon only.dark In Copyright
Copyright Holder:
Maha Alabduljalil
Access: This item is restricted to on-campus access only. Please check our FAQs or contact UCSB Library staff if you need additional assistance.