Content area
This dissertation develops a general framework for modeling user behavior from transactional data, using decentralized finance (DeFi) as a motivating case study. DeFi protocols such as Aave offer a rare opportunity to study large-scale, real-world financial behavior through publicly available transaction-level data. However, this data is complex. It is high-dimensional, heterogeneous, and irregularly timestamped. To address the modeling challenges this poses, this dissertation explores a range of methods, including clustering, survival analysis, transformer-based representation learning, and code generation with large language models.
We begin by characterizing user behaviors in Aave using quarterly address-level summaries and unsupervised clustering, revealing dominant behavioral archetypes and their evolution over time. Building on this, we introduce a novel survival analysis framework tailored to DeFi, modeling event timing from raw transaction sequences and uncovering patterns in loan repayments, liquidations, and platform usage.
These insights motivate the creation of FinSurvival, a benchmark suite of 16 large-scale survival prediction and classification tasks. FinSurvival is the first publicly available benchmark of its kind in finance, and we show that standard machine learning methods often outperform deep learning models in this high-censoring environment.
To explore learned representations, we develop Large Transaction Models (LTMs). These transformer-based models generate embeddings of transaction sequences. We evaluate the effectiveness of these embeddings on FinSurvival tasks, demonstrating that they can improve performance in classification settings relative to both raw and hand-engineered features.
Finally, we present a benchmark for evaluating the ability of LLMs to generate code for analyzing transaction data, demonstrating the feasibility of natural-language-driven automation of data querying and transformation.
Collectively, this work contributes new methods, datasets, and evaluation tools for behavioral modeling with transaction data. It highlights the challenges of modeling with transaction data, the tradeoffs between hand-engineered and learned representations, and the promise of AI models for handling this kind of data.
