The FLiCK framework; enabling rapid development and performance benchmarking of compression applications for genetic data files

Alessandro Aiezza II

Physical copy available from RIT's Wallace Library at QA76.9.D33 A53 2016

Abstract

High-throughput sequencing (HTS) technologies are rapidly replacing seminal techniques genetic analysis (ex: RNA-Sequencing experiments replacing microarrays). As HTS continues to become more readily available and inexpensive, data produced by these technologies will continue to increase dramatically. These data can be found in publicly available databases like the Sequence Read Archive [1], but are also often stored locally by research institutions which produce the data. Wherever it is kept, once an individual file is processed and analyzed, it will need to be archived for potential reanalysis in the future. This trend has resulted in demand for more effective retrieval and long-term storage of these data through compression algorithm methodologies. This demand has been met by a wide variety of existing utilities. However, as novel genetic compression algorithms continue to be developed, their implementation in a particular programming language can become an obstacle preventing these potentially impactful algorithms from becoming widely adopted by the scientific community. Here, several prominent genetic data compression applications, such as Quip [2] and Gzip [3], are investigated and quantitatively compared. A framework, FLiCK, is developed with the aim of expediting the process of compression algorithm development and implementation. Additionally, an exploratory implementation of a compression algorithm facilitated by the FLiCK framework is demonstrated. The results of this empirical study suggest that the FLiCK framework is an effective tool that significantly improves the programming throughput for compression algorithm implementation. A sample algorithm implemented within FLiCK outperformed conventional tools on a subset of data taken from the SRA.