Efficient Parallel Optimizations for All Pairs Similarity Search

Author:

Alabduljalil, Maha Ahmed

Degree Grantor:

University of California, Santa Barbara. Computer Science

Degree Supervisor:

Tao Yang

Place of Publication:

[Santa Barbara, Calif.]

Publisher:

University of California, Santa Barbara

Creation Date:

2014

Issued Date:

2014

Topics:

Computer Science and Engineering, System Science

Keywords:

Load Balancing,
Parallel Optimizations,
Similarity Search,
Partitioning,
Filtering, and
Memory Optimization

Genres:

Online resources and Dissertations, Academic

Dissertation:

Ph.D.--University of California, Santa Barbara, 2014

Description:

All Pairs Similarity Search (APSS), a computationally intensive process, is used in many search and mining applications such as data clustering, near-duplicate detection, anti-spamming, and collaborative filtering for recommendation. APSS is time consuming because its time complexity is quadratic to the dataset size, and thus it does not scale well for big data. The research in this thesis aims at speeding up APSS using exact, efficient and parallel techniques including data partitioning, load balancing and communication optimizations.

The first part of this thesis targets APSS with the cosine similarity metric. We present a two-stage APSS using a fast partitioning, load assignment algorithm. Given a large number of documents, our static partitioning places dissimilar documents into different groups to eliminate unnecessary comparison between their content. Next, due to the symmetry nature of the comparison, we use a circular load assignment to remove the redundant comparisons and reduce the I/O and communication overhead. Our evaluation shows that the proposed two-stage APSS is one order of magnitude faster than the previous work and reduces up to 70% of the total comparisons. To speed up intensive APSS computation in modern CPUs with memory hierarchy, we also describe cache-conscious algorithms using indexing with efficient data traversal and layout using index splitting and coalescing for better temporal locality. Our analysis shows the improved code is 2.74x as fast as the baseline.

In the second part of the thesis we discuss an offline duplicate removal system and techniques to speed up the near-duplicate detection of text documents. The system employs a multidimensional mapping to partition the dataset and balance the load across multiple machines. This is further extended to incremental duplicate clustering for applications with continuous data update.

Physical Description: