We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.

Date of creation, presentation, or exhibit



©2005 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Document Type

Conference Proceeding

Department, Program, or Center

Computer Science (GCCIS)


RIT – Main Campus