Abstract

Capillary Electrophoresis (CE) based on Sanger sequencing has given the ability to extract and explain the genetic information among any given biological system. Even though it brought about a major breakthrough in the field of biology, it had limitations like speed, throughput, scaling and resolution that paved the way for the invention of new technology named as Next-Generation Sequencing (NGS). With the invention of NGS technology, there has been a lot of insight into the genomes, transcriptomes and epigenomes of many of the species on earth. As time passed by, a lot of information has been generated using the NGS technology and new methods have been developed with each method having its merits and de-merits. Some of the most popular sequencing methods that were developed were Illumina sequencing, 454 pyrosequencing, SOLiD sequencing and Ion Torrent Semiconductor sequencing. All the information generated by these sequencing methods are stored in databases and of all the available databases, one of the most important one is National Center for Biotechnology Information (NCBI) integrated with Sequence Read Archive (SRA).

The sequencing data from the Sequence Read Archive is downloaded through a web interface and converted into the required and useful format using SRA toolkit provided by NCBI. Using the OS Architecture of the SRA toolkit, the data that is stored in `.sra' format is converted into tab delimited text and saved into a text file with `.txt' extension. The data obtained from the files have a lot of redundant information and only a particular data is required for analysis. So, in order to reduce the redundant information and in order to obtain only the desired data, an algorithm is developed that acts using a User Interface (UI), where the user can select the desired data for analysis. This ensures less computational time, high accuracy and memory efficiency. The User Interface developed is named as SRADE (Sequence Read Archive Data Extractor).

The data obtained from the SRA files have information regarding the sequence reads, quality of the reads, their position and their length that can be used for mapping. The information obtained from different types of sequencing methods may be different and the quality of the reads may be different. Therefore a comparison of the quality of the results developed from multiple runs of the same sequencing method as well as different sequencing methods is done, so as to find the differences, the best method for sequencing the genes and to find a cost effective way to determine the reads with high quality score and low quality score. For the purpose of comparison, a "whole exome sequencing of 1000 Genomes project of Illumina" with data from four runs are being considered along with "1000 Genomes whole exome project of Illumina and AB_SOLiD are being studied.

Library of Congress Subject Headings

Nucleotide sequence--Data processing; Gene mapping; User interfaces (Computer systems)--Design

Publication Date

5-1-2014

Document Type

Thesis

Student Type

Graduate

Degree Name

Bioinformatics (MS)

Department, Program, or Center

Thomas H. Gosnell School of Life Sciences (COS)

Advisor

Gary R. Skuse

Comments

Physical copy available from RIT's Wallace Library at QH441.2 .K68 2014

Campus

RIT – Main Campus

Plan Codes

BIOINFO-MS

Share

COinS