Abstract

Grouping objects that are described by attributes, or clustering is a central notion in data mining. On the other hand, similarity or relationships between attributes themselves is equally important but relatively unexplored. Such groups of attributes are also known as directories, concept hierarchies or topics depending on the underlying data domain. The similarities between the two problems of grouping objects and attributes might suggest that traditional clustering techniques are applicable. This thesis argues that traditional clustering techniques fail to adequately capture the solution we seek. It also explores domain-independent techniques for grouping attributes. The notion of similarity between attributes and therefore clustering in categorical datasets has not received adequate attention. This issue has seen renewed interest in the knowledge discovery community, spurred on by the requirements of personalization of information and online search technology. The problem is broken down into (a) quantification of this notion of similarity and (b) the subsequent formation of groups, retaining attributes similar enough in the same group based on metrics that we will attempt to derive. Both aspects of the problem are carefully studied. The thesis also analyzes existing domainindependent approaches to building distance measures, proposing and analyzing iii several such measures for quantifying similarity, thereby providing a foundation for future work in grouping relevant attributes. The theoretical results are supported by experiments carried out on a variety of datasets from the text-mining, web-mining, social networks and transaction analysis domains. The results indicate that traditional clustering solutions are inadequate within this problem framework. They also suggest a direction for the development of distance measures for the quantification of the concept of similarity between categorical attributes.

Library of Congress Subject Headings

Data mining; Cluster analysis; Information organization

Publication Date

2005

Document Type

Thesis

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Teredesai, Ankur - Chair

Advisor/Committee Member

Hemaspaandra, Edith

Advisor/Committee Member

Gaborski, Roger

Comments

Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: QA76.9.D343 D39 2004

Campus

RIT – Main Campus

Share

COinS