Content area

Abstract

Software development is a complex and essential process that underpins nearly every facet of modern life, from everyday applications to critical systems in healthcare, finance, transportation, and more. To support the software development process, artificial intelligence (AI)-powered assistants have emerged as valuable tools, offering assistance in various tasks such as code generation, program repair, reverse engineering, and beyond. While existing AI assistants show initial promise, they often struggle to provide accurate support to developers, frequently producing syntactically invalid or semantically incorrect code.

This dissertation proposes a solution to improve the accuracy, reliability, and practicality of AI assistants in supporting software development tasks: enhancing AI techniques with software domain knowledge. Software domain knowledge encompasses a wide range of aspects, including programming language syntax, project-specific context, and developers’ working practices. Four general strategies are explored for injecting domain knowledge into AI models: (1) training on large-scale datasets that implicitly encode domain patterns, (2) designing model architectures that mirror programming language structure, (3) developing learning objectives that enforce syntax and semantics, and (4) applying domain-guided constraints during inference to steer model behavior.

This dissertation presents five research contributions across three critical tasks–program repair, code generation, and reverse engineering–to demonstrate the effectiveness of these strategies. The first two contributions, CURE and KNOD, introduce novel neural architectures that incorporate syntax and semantic constraints through pre-training and tree-based decoding to reduce invalid code generation in automated program repair. A large-scale empirical study further quantifies the performance gains of large language models (LLMs) in fixing bugs by leveraging pre-trained knowledge and domain-specific fine-tuning. In the domain of code generation, the LeDex pipeline enables LLMs to self-debug their own outputs via a combination of synthetic data, fine-tuning, and reinforcement learning, improving their ability to iteratively refine and correct code. Finally, Nova advances AI understanding of low-level binary code by introducing a hierarchical attention mechanism and contrastive learning objective to overcome challenges posed by Assembly codes sparsity and compiler variability.

Together, these works show that incorporating domain knowledge can significantly improve the trustworthiness, correctness, and applicability of AI assistants in real-world development settings. This dissertation lays a foundation for building intelligent, domain-aware development tools and outlines key research directions-including repository-level reasoning, unified lifecycle assistance, and efficient model deployment for the next generation of AIpowered software engineering.

Details

1010268
Business indexing term
Title
Toward Accurate and Practical AI Assistants in Software Development
Author
Number of pages
187
Publication year
2025
Degree date
2025
School code
0183
Source
DAI-B 87/1(E), Dissertation Abstracts International
ISBN
9798290635286
Advisor
Committee member
Zhang, Tianyi; Goldwasser, Dan; Zhang, Xiangyu
University/institution
Purdue University
University location
United States -- Indiana
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32124037
ProQuest document ID
3257520075
Document URL
https://www.proquest.com/dissertations-theses/toward-accurate-practical-ai-assistants-software/docview/3257520075/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic