Content area
This thesis examines the accuracy and usefulness of large language models (LLMs) as intelligent tax advisors for individual tax preparation in the United States. The study assesses the performance of LLMs - OpenAI, Anthropic, and Deepseek - particularly when employing a Retrieval-Augmented Generation (RAG) model that enhances responses with reputable tax sources. The study focuses on shared tax-related inquiries about income reporting, credits, deductions, and special tax treatment according to IRS Form 1040.
GEval test framework, offering standardized measures to test factual accuracy of generated answers, was used to verify the accuracy of the model. Additionally, VITA tax preparers from Southeastern Louisiana University offered qualitative feedback. For verification of the usability of the models, their readability, and their potential integration into tax aid services for taxpayers, the practitioners tested them based on actual tax situations.
The results show that although augmented LLMs with retrieval capabilities can generate responses that are, overall, accurate and informative, there remain certain boundaries, especially in border situations and intricate filing scenarios. The study also highlights the potential of RAG-based LLMs to support taxpayer education and tax preparers, highlighting the need for more validation, regulation, and prudent deployment in public tax assistance programs.