Content area
In this thesis, I investigate how linguists can effectively prepare Interlinear Glossed Text (IGT) data for use with the AGGREGATION grammar inference system, particularly under constraints such as limited time, sparse annotations, and variable corpus quality. AGGREGATION aims to automate the creation of precision Head-driven Phrase Structure Grammar (HPSG) grammars from IGT, but its output quality depends heavily on input structure and annotation consistency. To explore this, I develop a modeling framework to evaluate how structural and annotation-based features (such as affix ambiguity, type-stems ratio, and POS tag source) affect grammar quality across 75,000 grammar runs on 25 datasets. I use both linear mixed-effects models and XGBoost to identify predictors of four key metrics: coverage, ambiguity, morphological complexity, and inference time. Results show that smaller, structurally coherent datasets often outperform larger, noisier ones. Manual POS tags improve coverage and generalization but increase ambiguity, while automatic tags result in cleaner grammars with lower parse success. A case study on Meitei highlights how annotation quality interacts with language-specific features. This work offers practical guidance for preparing IGT data for grammar generation and proposes future improvements to AGGREGATION, including support for structure-aware sampling and multi-version grammar comparison.