De novo genomic sequencing, which is the process of discovering the sequence of a genome which has not previously been elucidated, provides unique challenges, especially for larger genomes. Modern high-throughput sequencing technologies have addressed the issue of covering the entire genome in a reasonable time by fragmenting the genome into portions that can be examined in a massively-parallel approach. While these have saved considerable time and cost for the chemical process of determining the sequence of a genome, they result in sets of many tens of millions of sequence fragments called reads, each of which is typically on the order of just 100 to 300 bases long. Assembling these reads into a genomic sequence is highly computationally complex.
A variety of assembly software packages are readily available for this purpose. In this project, a set of genomic assemblers was selected for examination. These programs were then tested with an Illumina data set for the grape species Vitis romanetii. Experimental runs with this dataset were performed to evaluate the run time required as well as the contiguity, completeness, and accuracy of the resulting assemblies. Different approaches to quality control preprocessing of the sequence data were also explored and evaluated. The results strongly recommend the use of the program MaSuRCA, run with data which has not been preprocessed for quality control. The second highest recommendation would be the use of ABySS with data preprocessed via QuorUM error-correction.
In the process of these tests, it was also hoped that at least the beginnings of a draft genome for V. romanetii would be produced. The assemblies which came closest to publication quality were produced by MaSuRCA. Examination of these using the assessment software BUSCO suggest that the best of these assemblies may well be approaching publishable quality.
Department, Program, or Center
Thomas H. Gosnell School of Life Sciences (COS)
Michael V. Osier
Julie A. Thomas
Olsen, Lars J., "Functional Comparison of Current Software Tools for Genomic Assembly from High Throughput Sequencing Data" (2019). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus