The increasing use of computing resources in our daily lives leads to data generation at an astonishing rate. The computing industry is being repeatedly questioned for its ability to accommodate the unpredictable growth rate of data. It has encouraged the development of cluster based storage systems. Hadoop is a popular open source framework known for its massive cluster based storage. Hadoop is widely used in the computer industry because of its scalability, reliability and low cost of implementation. The data storage of the Hadoop cluster is managed by a user level distributed file system. To provide a scalable storage on the cluster, the file system metadata is decoupled and is managed by a centralized namespace server known as NameNode. Compute Nodes are primarily responsible for the data storage and processing. In this work, we analyze the limitations of Hadoop such as single point of access of the file system and fault tolerance of the cluster. The entire namespace of the Hadoop cluster is stored on a single centralized server which restricts the growth and data storage capacity. The efficiency and scalability of the cluster depends heavily on the performance of the single NameNode. Based on thorough investigation of Hadoop limitations, this thesis proposes a new architecture based on distributed metadata storage. The solution involves three layered architecture of Hadoop, first two layers for the metadata storage and a third layer storing actual data. The solution allows the Hadoop cluster to scale up further with the use of multiple NameNodes. The evaluation demonstrates effectiveness of the design by comparing its performance with the default Hadoop implementation.
Library of Congress Subject Headings
Electronic data processing--Distributed processing; Open source software
Department, Program, or Center
Computer Science (GCCIS)
Talwalkar, Anup, "HadoopT - breaking the scalability limits of Hadoop" (2011). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus