Abstract

The computational demands of multivariate clustering grow rapidly, and therefore processing large data sets, like those found in flow cytometry data, is very time consuming on a single CPU. Fortunately these techniques lend themselves naturally to large scale parallel processing. To address the computational demands, graphics processing units, specifically NVIDIA's CUDA framework and Tesla architecture, were investigated as a low-cost, high performance solution to a number of clustering algorithms. C-means and Expectation Maximization with Gaussian mixture models were implemented using the CUDA framework. The algorithm implementations use a hybrid of CUDA, OpenMP, and MPI to scale to many GPUs on multiple nodes in a high performance computing environment. This framework is envisioned as part of a larger cloud-based workflow service where biologists can apply multiple algorithms and parameter sweeps to their data sets and quickly receive a thorough set of results that can be further analyzed by experts. Improvements over previous GPU-accelerated implementations range from 1.42x to 21x for C-means and 3.72x to 5.65x for the Gaussian mixture model on non-trivial data sets. Using a single NVIDIA GTX 260 speedups are on average 90x for C-means and 74x for Gaussians with flow cytometry files compared to optimized C code running on a single core of a modern Intel CPU. Using the TeraGrid Lincoln high performance cluster at NCSA C-means achieves 42% parallel efficiency and a CPU speedup of 4794x with 128 Tesla C1060 GPUs. The Gaussian mixture model achieves 72% parallel efficiency and a CPU speedup of 6286x.

Library of Congress Subject Headings

Flow cytometry--Data processing; Cluster analysis; Multivariate analysis; Parallel processing (Electronic computers); Graphics processing units--Programming

Publication Date

5-1-2010

Document Type

Thesis

Department, Program, or Center

Computer Engineering (KGCOE)

Advisor

Shaaban, Muhammad

Comments

Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: QH585.5.F56 P36 2010

Recommended Citation

Pangborn, Andrew D., "Scalable data clustering using GPUs" (2010). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/5464

Campus

RIT – Main Campus

Download

COinS

Theses

Scalable data clustering using GPUs

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Department, Program, or Center

Advisor

Comments

Recommended Citation

Campus

Search

Browse

Author Corner

RIT Links

Theses

Scalable data clustering using GPUs

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Department, Program, or Center

Advisor

Comments

Recommended Citation

Campus

Share

Search

Browse

Author Corner

RIT Links