Some of the material in is restricted to members of the community. By logging in, you may be able to gain additional access to certain collections or items. If you have questions about access or logging in, please use the form on the Contact Page.
Similarity join has been widely studied and used in various scientific and commercial applications. Given two datasets, similarity join finds all pairs of similar objects (one from each input dataset), subject to a distance metric and a ranking threshold. Typical applications of similarity join operators include knowledge discovery, data mining tasks, and nearest neighbor search in spatial databases GIS systems. With the amount of data to be processed growing at an ever-increasing rate, efficient computation of similarity joins becomes a challenging and imperative task. Although researchers have devoted extensive efforts on this subject, a number of problems still remain open. In particular, efficiency and scalability pose significant challenges and have seriously limited the applications of similarity join operators. To that end, a tempting choice is to leverage the power of parallel computations from a cluster of commodity machines, e.g., the MapReduce computation framework that offers opportunities to handle large data sets with efficiency, scalability and ease of use. Unfortunately, most existing work assumed a centralized, sequential setting, and cannot be applied to parallel and distributed systems. On the other hand, other existing work studied similarity joins in traditional parallel environments, such as in the MPI programming model, which are also not applicable in a shared-nothing cluster setting that MapReduce adopts. That said, focusing on the most popular type of similarity join operators, known as theknearest neighbor (kNN) join, this thesis has investigated the challenges of how to efficiently executekNN joins both in low-dimensional and high-dimensional spaces for large data sets in MapReduce. We present solutions for both exact and approximatekNN joins. Our constructions for the approximatekNN joins are particularly interesting, where we trade accuracy for efficiency and scalability in a principal fashion.