Content area
Abstract
This dissertation investigates how engaging with stakeholder groups, namely natural language processing (NLP) practitioners and language communities, can contribute to the development of documentation toolkits that are more responsive to the needs of these groups. The development process follows value sensitive design in conducting a series of investigations to learn what are the needs of these groups and how iterative improvements to technology can help address those needs. Building from the data statements for NLP Version 1 schema proposed in Bender and Friedman (2018), Dr. Emily M. Bender, Dr. Batya Friedman, and I conduct an empirical investigation and a technical investigation to develop the data statements. Version 2 schema by engaging with natural language processing professionals. To learn about the needs of indigenous and deaf communities with respect to collaborating with researchers, in a retrospective technical investigation I analyze ethical guidelines and licenses for the values frequently expressed in these communities’ stated expectations for research collaborations. I then conduct a technical investigation to meld the data statements Version 2 schema, aspects of datasheets for datasets (Gebru et al., 2021), and the results of the retrospective technical investigation into a single toolkit. Rather than documenting existing datasets, the Collaborative Discussions for the Documentation and Design of Linguistic Archival Resources (C3DAR) toolkit is designed to facilitate collaborative partnerships between communities and researchers working to develop language datasets. I conclude with possible future investigations, focusing on community researchers as key stakeholders, and considerations for uptake.