Abstract

Production Data Quality (PDQ) is a specialized pattern classifier whose main purpose is to assess independently the data quality of a production classifier. It accomplishes this by producing a high quality Truth from the source input, and then using the Truth to identify errors in the production classifier's output data. Previous studies have shown close agreement between PDQ processing outcomes and a particular mathematical model of the system. In this study we describe and analyze an expanded model that addresses the potential tradeoff between Truth error and manual processing in PDQ, with an eye towards informing operational decisions about precision and efficiency. Using statistical data from the 2010 Census PDQ system as input, we examine the predictions of the new model in order to understand their potential usefulness. The outcomes show strong agreement between two methods for estimating Projected Truth error rate, supporting the validity of both methods as well as the existing static model. In addition, the new Projector model gives tight bounds on the projected manual processing rate and reveals a characteristic relationship between Projected Truth error and projected manual processing. We explore a practical application of this model for tuning PDQ, and we find an opportunity to achieve a 60% efficiency increase for the selected sample, while maintaining an acceptable degree of precision.

Library of Congress Subject Headings

Computer software--Testing; Computer software--Reliability; Debugging in computer science; Classification--Data processing; Pattern recognition systems

Publication Date

2011

Document Type

Thesis

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Gaborski, Roger

Comments

Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: QA76.76.T48 B35 2011

Campus

RIT – Main Campus

Share

COinS