Abstract

The wealth of texts available publicly online for analysis is ever increasing. Much work in computational linguistics focuses on syntactic, contextual, morphological and phonetic analysis on written documents, vocal recordings, or texts on the internet. Twitter messages present a unique challenge for computational linguistic analysis due to their constrained size. The constraint of 140 characters often prompts users to abbreviate words and phrases. Additionally, as an informal writing medium, messages are not expected to adhere to grammatically or orthographically standard English. As such, Twitter messages are noisy and do not necessarily conform to standard writing conventions of linguistic corpora, often requiring special pre-processing before advanced analysis can be done. In the area of computational linguistics, there is an interest in determining latent attributes of an author. Attributes such as author gender can be determined with some amount of success from many sources, using various methods, such as analysis of shallow linguistic patterns or topic. Author age is more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. Twitter messages present a difficult problem for latent user attribute analysis, due to the pre-processing necessary for many computational linguistics analysis tasks. An added logistical challenge is that very few latent attributes are explicitly defined by users on Twitter. Twitter messages are a part of an enormous data set, but the data set must be independently annotated for latent writer attributes not defined through the Twitter API before any classification on such attributes can be done. The actual classification problem is another particular challenge due to restrictions on tweet length. Previous work has shown that word and phrase abbreviation patterns used on Twitter can be indicative of some latent user attributes, such as geographic region or the Twitter client (iPhone, Android, Twitter website, etc.) used to make posts. Language change has generally been posited as being driven by women. This study explores if there there are age-related patterns or change in those patterns over time evident in Twitter posts from a variety of English authors. This work presents a growable data set annotated by Twitter users themselves for age and other useful attributes. The study also presents an extension of prior work on Twitter abbreviation patterns which shows that word and phrase abbreviation patterns can be used toward determining user age. Notable results include classification accuracy of up to 83%, which was 63% above relative majority class baseline (ZeroR in Weka) when classifying user ages into 6 equally sized age bins using a multilayer perceptron network classifier.

Library of Congress Subject Headings

Twitter--Research; Computational linguistics; Abbreviations--Data processing; Language and languages--Age differences; Text processing (Computer science)

Publication Date

5-20-2013

Document Type

Thesis

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Alm, Cecilia

Advisor/Committee Member

Bailey, Reynold

Comments

Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: HM743.T95 M67 2013

Campus

RIT – Main Campus

Plan Codes

COMPSCI-MS

Share

COinS