Abstract

The goal of authorship attribution is to find a set of unconscious writing characteristics or style features that distinguish text written by one person from text written by another. Once these features are found, they can be used to pair a text with the individual who wrote it. It is now well accepted that authors develop distinct and unconscious writing features. Over one thousand stylometric features (style markers) have been proposed in a variety of research disciplines [44] but none of that research has looked at the syntactic structure of the text. I conjectures that the distinct writing features of an author are not limited to these features already studied, but also include syntactic features. To support this hypothesis, I ran experiments using two open source parsing programs and analyzed the results to see if features given to me from these programs were enough for me to determine who is the most probable author of a text. Parsing programs are designed to determine syntactic structures in nat ural language. They take a text or a writing sample and produce output showing the grammatical relationship between the words in the text. They provide a means to test the hypothesis that authors' syntactic use of words provide enough identifying characteristics to differentiate between them. Using two open source natural language parsing programs, the Link Gram mar Parser and Collins' Parser, this research tested to see if an authors sentence structure is unique enough to provide a means of recognizing the probable author of a text. Initial data was collected on a pool of test au thors. Sample texts by each author were run through both parsers. The output of each parser was analyzed using two multivariate analysis methods: discriminant analysis and cluster k- means. My results show that syntactic sentence structures may be a viable method for authorship attribution. The Link Grammar shows promise as a way to augment authorship attribution methods already out there. Collins' Parser provided even better results that should be solid enough to stand on their own as a new and viable alternative to methods that already exist. Collins' parser also provided new predictors that might improve current authorship attribution methods. For example, elements and phrases with wh- words and the length of noun phrases are highly corrolated with authorship in this study.

Library of Congress Subject Headings

Authorship; Natural language processing (Computer science); Parsing (Computer grammar)

Publication Date

2003

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Science (MS)

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Edith Hemaspaandra

Advisor/Committee Member

Myroslava Dzikovska

Advisor/Committee Member

Carol Marchetti

Comments

Physical copy available from RIT's Wallace Library at QA76.9.N38 M33 2003

Campus

RIT – Main Campus

Share

COinS