Some of the material in is restricted to members of the community. By logging in, you may be able to gain additional access to certain collections or items. If you have questions about access or logging in, please use the form on the Contact Page.
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data and machine learning technologies have been explored to mine the complex large-scale genomics data. In this dissertation, we first survey some of the existing scalable approaches for genomic analysis and identify the limitations of these solutions. We then investigate the still-unsolved challenges faced by computational biologists in large-scale genomic analysis. Specifically, in terms of using MapReduce-based bioinformatics analysis tools, Hadoop has a large number of parameters to control the behavior of a MapReduce job. The unique characteristics of MapReduce-based bioinformatics tools makes all the existing guidelines inapplicable; In Metagenomics, the intrinsic complexity and massive quantity of metagenomic data create tremendous challenges for microbial genomes recovery; When we applying NLP technologies to genome analysis, the enormous k-mer size and the low-frequency k-mers caused by the sequencing errors post significant challenges for k-mer embedding. To overcome the aforementioned problems, this dissertation introduces three countermeasures. First, we extract the key parameters from the large space of MapReduce parameters and present an exemplary case for tuning MapReduce-based bioinformatics analysis tools based on their unique characteristics. Second, we design and implement SpaRC, a scalable sequence clustering tool built on Apache Spark, to partition reads based on their molecules of origin to enable downstream assembly optimization in Metagenomics. SpaRC achieves high clustering accuracy, with the capability of scaling near linearly with the data size and the number of computing nodes. Lastly, we leverage Locality Sensitive Hashing (LSH) to overcome the two challenges faced by $k$-mer embedding and design LSHvec. With LSHvec, a DNA sequence can be represented as a dense low-dimensional vector. The trained sequence vectors are capable of capturing the rich characteristics of DNA sequences and can be fed to machine learning models for a wide variety of applications in genomics analysis. We compare our approaches with existing solutions. The experiments demonstrate our approaches achieve the state-of-the-art results. We open source our implementation of SpaRC and LSHvec to facilitate comparison of future work and inspire future research in genomic analysis.