Content area

Abstract

Language shows up everywhere. It's in the digital content we circulate online, and it's in our conversations with each other. It's also in the training data and generations of language models, which are increasingly integrated into our everyday lives. Language is powerful because it embeds social identities and beliefs: it expresses who we are, and shapes our understanding of each other. Thus, language is not only a window for understanding society, but also an instrument for defining it. The doctoral research presented in this thesis focuses on computational analyses of language and society. It covers several empirical studies of language, used to inform human-centered language model development and facilitate data-driven studies of people.

This thesis is structured into two parts. In the first, I examine language by and for people. The work here incorporates a sociolinguistic perspective, emphasizing how the social identities of communities may relate to language differences. I map how language varies in communities at scale, and measure whose language is prioritized in the early stages of model development. In the second part, I examine language about people. There, I show how text analysis can characterize discussions and depictions of people. The studies in this part span the social dimensions of gender and race, and demonstrate how methods ranging from word representations to topic modeling can help make sense of socially significant language patterns. Altogether, this thesis ties together multiple ways in which people and language may intersect, and uses computational text analysis to benefit both social scientific inquiry and NLP methodology.

Details

1010268
Literature indexing term
Title
Modeling Language as Social and Cultural Data
Author
Number of pages
167
Publication year
2025
Degree date
2025
School code
0028
Source
DAI-A 87/1(E), Dissertation Abstracts International
ISBN
9798288863738
Committee member
Bleaman, Isaac L.; Jurafsky, Dan; Salehi, Niloufar
University/institution
University of California, Berkeley
Department
Information Management & Systems
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32122966
ProQuest document ID
3232734592
Document URL
https://www.proquest.com/dissertations-theses/modeling-language-as-social-cultural-data/docview/3232734592/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic