Content area
With the increasing complexity of financial statement manipulation and severe class imbalance issue, the growing complexity of financial fraud detection systems has revealed limitations in conventional approaches that rely exclusively on quantitative financial data and traditional machine learning algorithms. To overcome these constraints, we propose an enhanced financial fraud detection model that leverages advanced ensemble learning classifiers on combined features, comprising both textual information extracted from annual reports through natural language processing techniques and structured financial data from corporate statements. Utilizing a dataset of Chinese manufacturing firms listed between 2010 and 2019, we integrate textual topic indicators derived from the latent Dirichlet allocation (LDA) model with raw financial items to construct a comprehensive fraud detection system. Empirical results demonstrate the superiority of combined textual and financial indicators, which achieves significant improvements, with AUC increasing +1.5% for RUSBoost and +1.6% for XGBoost, alongside 4.5% and 3.8% NDCG@K gains (p < 0.01). Further evaluation using precision, recall, and F1‐score confirms the robustness and practical effectiveness of the proposed model under imbalanced class distributions.
1. Introduction
Financial statement fraud is a global phenomenon and inflicts enormous collective harm on stakeholders and financial markets and substantial losses for investors [1]. Developing a well-performed fraud detection method through timely detection and handling of potential financial statement fraud is of great significance to stakeholders and remains a challenging task. Many scholars are committed to using financial data to develop the fraud detection models, such as logistic regression, support vector machines (SVMs), neural networks, random forests, and ensemble methods [2–7].
Since financial fraud strategies become increasingly complex and methods of embellishing financial data become more mature, using only quantitative financial data may not be significant in detecting fraud [8]. Therefore, researchers should also consider supplementing the textual information included in the annual report, such as “Management Discussion and Analysis (MD&A)” section. Extracting semantic signals from unstructured text provides a complementary lens through which managerial intent, risk exposure, and strategic orientation can be assessed—dimensions often obscured in purely numerical data [9, 10].
From an information economics perspective, textual disclosures serve as a vital channel for management to convey private information [11], while fraudulent firms often exhibit anomalous linguistic patterns in these narratives [12]. The latent Dirichlet allocation (LDA) topic model is particularly effective at uncovering such subtle yet indicative textual features, enabling the identification of thematic signals related to managerial intentionality, risk disclosure, and strategic shifts—elements typically latent in traditional financial statements. When integrated with traditional financial indicators, these textual insights provide an additional layer of evidence that enhances discriminative accuracy through cross-validation. This multimodal approach mitigates the limitations inherent in purely quantitative analysis by integrating contextual and semantic cues, thereby offering a more robust mechanism for detecting sophisticated fraud schemes that may otherwise evade conventional detection methods.
The MD&A section within annual reports provides an evaluation of a company’s historical operations by its managers, along with future market development prospects. This section holds significant importance within regular financial reporting for investors and regulatory agencies alike, enabling them to comprehend the company’s operational status from a management perspective and future performance and mitigate potential risks. However, due to information asymmetry concerns such as stock price fluctuations, financing costs, and self-interests, there may arise instances where managers manipulate intonation, decrease text readability and similarity, or make inaccurate expectations. Continuous concealment of negative information by management could lead to concentrated release of accumulated negative information resulting in losses for investors and even systemic risks. Therefore, enhancing the disclosure of management information can effectively mitigate information asymmetry and reduce the occurrence of financial fraud. Several scholars have quantified textual information in MD&A reports by employing various approaches such as management expectations, text similarity, text readability, and text intonation. These quantified textual indicators have been incorporated into financial fraud detection models, leading to improved accuracy according to empirical analysis [1, 9].
However, managers may manipulate the tone, readability, and content of these reports to conceal negative information, leading to information asymmetry and potential financial losses for investors. To address this, we employ the LDA model to extract textual topic indicators from MD&A reports. The LDA model has emerged as a powerful tool for text analysis due to its ability to explore implicit topics in natural language processing. It effectively captures relevant information between words and is widely utilized in research fields including finance, medicine, and news analysis. Brown et al. [9] found that the LDA model produces a valid set of semantically meaningful topics that can predict the financial misreporting.
In this study, we propose a systematic framework to quantify valuable textual information in MD&A disclosures using LDA topic modeling. Document-topic distribution vectors generated by LDA were numerically transformed into textual topic indicators, providing interpretable semantic representations of the content. These indicators were integrated with raw financial items to construct combined features, enriching the input space with both unstructured textual patterns and structured financial data. An ensemble learning–based fraud detection model was then developed using the combined features. The framework rigorously evaluates the supplementary value of textual topic indicators in enhancing detection accuracy, demonstrating their critical role in complementing traditional financial variables. This approach advances the integration of unstructured textual analytics into financial surveillance systems, offering both methodological innovation and practical utility.
The remainder of this paper proceeds as follows. Section 2 reviews the literature. Section 3 discusses the research gaps and theoretical positioning. Section 4 introduces the empirical methods, evaluation metrics and variable selection, indicator calculation, sample sources, and processing. Section 5 describes the empirical results and mechanisms. Section 6 concludes the paper.
2. Literature Review
Contemporary scholarship elucidates that western industrialized nations, notably the US, have systematically investigated financial fraud detection through evolving analytical paradigms. This progression encompassed a methodological transition from conventional statistical techniques to sophisticated machine learning architectures and ultimately converging toward integrated assessment frameworks. The developmental trajectory reflects iterative enhancements in predictive modeling capabilities and multidimensional risk evaluation systems within financial surveillance domains. Bell and Carcello [13] developed and tested a logistic regression model that estimates the likelihood of fraudulent financial reporting for an audit client, conditioned on the presence or absence of several fraud-risk factors. Kirkos et al. [14] have used data mining and decision trees to detect financial fraud. Kim and Upneja [15] constructed an AdaBoosted decision tree to investigate key factors contributing to financial distress in catering enterprises using financial data. Hajek and Henriques [16] employed feature selection and classification of machine learning methods to develop early warning systems of financial statement fraud. Despite the rapid advancements in modeling methodologies for detecting financial fraud, many scholars have overlooked a crucial fact: the financial fraud detection datasets used in the financial market have consistently shown significant imbalance. Bao et al. [3] employed two imbalanced classifiers (RUSBoost and SVM-FK) to predict financial statement fraud in publicly traded US companies. Faris et al. [17] combined AdaBoost ensemble algorithms with the SMOTE oversampling method, and their imbalanced ensemble learning classifier (SMOTEBoost) demonstrated reliable and promising performance in predicting financial bankruptcy. Khedr et al. [18] evaluated the XGBoost algorithm for financial statement fraud detection that helped to identify fraud on a set of sample companies drawn from the Middle East and North Africa (MENA) region.
Although the aforementioned models exhibit robust predictive capabilities, empirical research suggests that the predictive potential of numerical features, particularly financial ratio indicators, is constrained. The integration of textual analysis and quantitative data has emerged as a pivotal research frontier in financial fraud detection. Textual data offer distinctive predictive power, as it reflects management’s subjective intentions and subtle risk cues—semantic information that is typically absent or concealed in structured financial data. Healy and Palepu [11] established the theoretical foundation for understanding information asymmetry in corporate disclosures, demonstrating that textual disclosures serve as a vital conduit through which managers communicate private information. This insight provides a robust theoretical basis for identifying linguistic patterns indicative of fraudulent behavior. Kaminski et al. [19] have commenced exploring the prognostic value of language-based tools in financial fraud detection systems. In this regard, Goel et al. [20] established a model to forecast financial misstatements based on variables encompassing oral content and presentation style within annual reports. The study revealed that an increased presence of passive voices and neutral vocabulary in these reports augment the likelihood of fraudulent activities. Cecchini et al. [21] developed an automated text analysis method to identify companies facing catastrophic financial events, which supplements quantitative financial information by combining textual data with quantitative data. Additionally, Loughran and Mcdonald [22] developed an English lexicon for financial sentiment analysis pertaining to transaction volume, return volatility, and fraudulence, subsequently investigating its application in text analysis within the financial domain.
From a behavioral finance perspective, Tetlock [23] showed that media texts can predict market sentiment, suggesting that textual information often precedes traditional numerical indicators. This finding provides theoretical support for the central argument in this paper—that textual content contains early warning signals of financial fraud. Furthermore, Hobson et al. [12] detected financial fraud through voice and text analysis, showing that linguistic features such as ambiguity and negative sentiment are significantly associated with fraudulent behavior. Their work offers empirical and theoretical justification for detecting fraud-related patterns through linguistic analysis in this study. Larcker and Zakolyukina [24] quantified textual characteristics derived from formal management discussions as well as Q&A narratives during quarterly earnings conference calls to predict instances of “deceptive” reporting within financial statements. Mayew et al. [25] extracted linguistic tones from management opinions regarding a company’s ability to sustain operations and found it significantly contributes toward explaining corporate defaults. Hajek and Henriques [16] employed feature selection and classification using a wide range of machine learning methods to examine whether an improved financial fraud detection system could be developed by combining specific features derived from financial information and managerial comments in corporate annual reports. Hoberg and Lewis [26] discovered that fraudulent companies employ language that deviates from reality by exaggerating performance and providing limited explanations for their sources of company performance. Purda and Skillicorn [10] predicted financial misstatements using a language application model based on MD&A, demonstrating that the generated word dictionary model outperformed models constructed using predefined dictionaries and those relying solely on financial ratio indicators. Donovan et al. [2] utilized machine learning methods to establish a comprehensive credit risk assessment framework based on quantitative indicators derived from telephone conferences and MD&A. Zhang and Ghorbani [27] conducted a focused analysis on sentiment in social media, proposing that semantic features can effectively capture collective psychology states and detect anomalous events. Bao et al. [3] is the first study to evaluate the value of raw financial data in fraud detection, indicating that the ensemble learning model coupled with raw financial data outperforms traditional models with financial indicators.
Furthermore, with the advancement of the LDA model, its capability in extracting latent thematic information from text has become increasingly prominent. Blei and Jordan [28] pioneered its application by using it to identify topic structures in documents, including annual financial reports. Loughran and McDonald [29] conducted a comprehensive review of textual analysis in accounting and finance, demonstrating that textual features such as tone, readability, and topic distribution can reliably reflect managerial intent and risk signals. Their findings provide methodological justification for employing LDA-based topic modeling in this study. Analyzing textual content, Brown et al. [9] attempted to predict financial misstatements using financial and textual style variables and found that topics discussed in annual report filings provided more significant incremental predictive ability compared to commonly used financial statements and text style variables. Yadav [30] addressed the challenge of detecting financial statement fraud by leveraging text mining and advanced deep learning techniques, utilizing the Harris Hawks Optimization (HHO) algorithm for feature selection and integrating Deer Hunting Optimization (DHO) with a deep neural network to achieve superior detection accuracy of 95% compared to conventional classifiers.
In recent years, with the sustained and rapid development of the Chinese economy, the operating environment of listed companies has become increasingly complex, leading to heightened issues such as financial fraud, questionable information disclosure quality, and false statements, which have intensified the demand for high-quality and reliable financial statements among investors. Consequently, scholars, particularly in China, have proposed numerous financial fraud detection models to examine the reliability of financial conditions in the Chinese market. Songet al. [31] used four classifiers (logistic regression, backpropagation [BP] neural network, decision tree, and SVM) and an ensemble of those classifiers to assess financial statement fraud risk of Chinese companies. Mao et al. [32] improved fraud identification by constructing a network model linking companies and related parties. Li et al. [4] proposed an optimized BP neural network based on AdaBoosted approach by utilizing financial indicators such as asset size, total debt, and main business profit margin to establish a high-performance financial risk warning model. To evaluate the performance of ensemble learning classifiers in financial fraud detection, Rahman and Zhu [6] used imbalanced ensemble learning algorithms to reveal that CUSBoost and RUSBoost outperformed common ensemble learning models such as AdaBoost and XGBoost on average. Sun et al. [7] developed an intelligent detecting model to efficiently identify financial fraud by using XGBoost on raw financial data items in corporation financial statements and suggested that transferring raw financial data into specific financial ratio indicators could result in some loss of effective information for prediction.
Aiming to detect plausible frauds of financial companies via text analysis of China’s listed companies, Dong et al. [33] employed machine learning techniques to develop a text-based enterprise fraud monitoring model, incorporating emotional features, thematic features, and lexical functions. The results indicated that integrating textual information into the models led to enhanced accuracy and stability compared to using solely financial data. Zhang et al. [8] suggested that compared with various vector indexes, the bag-of-words (BoW) model and machine learning algorithm have a prediction effect. Additionally, Zhang et al. [5] established a multidimensional indicator-based financial risk warning model incorporating both corporate governance concepts as well as management concepts. Furthermore, Li et al. [1] selected readability, forward-lookingness, similarity, matching degree, positive and negative sentiment indicators from MD&A textual language structure, quality, and expression along with financial indicators to detect financial frauds.
3. Research Gaps and Theoretical Positioning
Although existing studies have demonstrated the value of integrating textual and financial information, most approaches rely on simple feature concatenation and fail to capture deeper semantic interactions between modalities. Furthermore, the theoretical rationale for how textual semantics provide complementary signals to financial data remains insufficiently explored.
Compared to prior research in multimodal fraud detection, this study offers clearer theoretical positioning and addresses critical gaps in the following aspects.
First, at the theoretical level, prior research lacks a systematic explanation of the cross-modal relationships between text and financial data. By integrating information asymmetry theory and signaling theory, we develop a theoretical framework that elucidated how textual topics mitigate the limitations inherent in financial data, thus overcoming the “empirical-over-mechanistic” bias prevalent in traditional approaches.
Second, at the methodological level, while existing studies largely rely on simple feature-level concatenation, our work employs LDA topic modeling to extract semantic features, thereby preserving the structured nature of textual meaning and enabling a shift from shallow integration to deep semantic fusion. Furthermore, by utilizing raw financial items instead of predefined ratios, we minimize potential information loss and enhance the model’s capacity to capture complex nonlinear patterns.
Third, in terms of application context, most existing models are developed and validated within western mature markets. In contrast, this study focuses on China—the world’s second-largest economy—where corporate governance mechanisms and regulatory frameworks remain under development and information asymmetry is more pronounced. Through a comprehensive sample of A-share listed companies, we construct a fraud detection model tailored to the distinctive characteristics of emerging markets, improving its practical relevance.
Fourth, with regard to evaluation frameworks, we extend beyond the conventional reliance on AUC by establishing a multidimensional assessment system that incorporates AUC, normalized discounted cumulative gain @K (NDCG@K), precision, recall, and F1-score. This provides a more comprehensive evaluation standard for model performance under class-imbalanced settings.
The contributions of this study are threefold. First, we demonstrate the effectiveness of integrating textual topic indicators derived from corporate reports with raw financial items to construct combined features, which significantly enhances fraud detection accuracy in ensemble models. Empirical results confirm the statistical and economic superiority of this approach. Second, we advance financial fraud detection methodologies by introducing RUSBoost and XGBoost, machine learning algorithms that empirically outperform traditional classifiers such as logistic regression and SVM in terms of detection performance. Third, our findings provide actionable insights for financial stakeholders—by demonstrating that the integration of textual and structured data strengthens predictive capabilities, we advocate for regulators, auditors, and investors to systematically incorporate both data types as integral components of financial analytics frameworks.
4. Research Design
4.1. Methodologies
This study employs a multimethod approach to detect financial fraud among A-share publicly listed Chinese firms. The methodologies include the LDA topic model for textual analysis and logistic regression, SVM, RUSBoost, and XGBoost for classification tasks. These methods are selected to evaluate the effectiveness of combining textual and financial data in fraud detection.
4.1.1. LDA
LDA is a generative probabilistic model used to identify latent topics within textual data. It assumes that each document is a mixture of topics, and each topic is a distribution over words. The topic distribution θd of the document and the word distribution φk of the topic follow Dirichlet distributions with parameters of α and β, respectively.
The document generation process is as follows.
For each document d:
- 1.
Generate topic distribution θd of documents from Dirichlet distributions.
- 2.
For each position n in the document:
Select a topic zdn from the topic distribution θd.
Select a word from the word distribution corresponding to the wdn topic zdn.
The word sequence in the generated document is the document.
The mathematical expression for the above process is
Assuming there is a total of N words in the corpus of M documents, repeating the above processes multiple times can obtain the probability distribution of vocabulary corresponding to each topic and the probability distribution of a topic corresponding to each document. The basic logic diagram is shown in Figure 1.
[IMAGE OMITTED. SEE PDF]
D represents the number of documents; N is the number of words in the document; α is the prior distribution hyperparameter of the document topic distribution; θd is the topic distribution of document d; zdn is the topic number of the nth word in document d; β is a prior Dirichlet distribution hyperparameter for topic-word distribution; φk is the distribution of words that constitute the k topic; and wdn is the n nonrepetitive word in document d.
Perplexity and coherence scores are commonly used indicators to evaluate the performance of topic model. Perplexity is an indicator used to measure the performance of a topic model in textual analysis. It is defined as the reciprocal of the geometric mean of the probability of predicting each word in a given test set. Lower perplexity indicates that the model can fit better. Coherence score is an indicator used to evaluate the quality and coherence of the topics generated by the topic model. It measures the correlation and degree of association between vocabularies in the topic. A higher coherence score indicates that the topic is more cohesive and interpretable. The commonly used method for calculating coherence score is to calculate the similarity of word pairs in the topic.
4.1.2. Logistic Regression
Logistic regression is a machine learning algorithm utilized for solving binary classification problems, wherein it effectively maps the outcomes to a probability range between 0 and 1 by linearly combining input features and applying a nonlinear function known as the logistic or sigmoid function. This form is represented through a conditional probability distribution:
4.1.3. SVM
SVM is a widely used supervised learning algorithm that relies on the decision plane to define the decision boundary, optimize the margin, and identify the separating hyperplane that effectively partitions the dataset and maximizes the distance between different classes. It is primarily employed for solving classification and regression problems. The objective function can be formulated as follows:
4.1.4. XGBoost Model
XGBoost, which stands for Extreme Gradient Boosting, is an ensemble learning algorithm based on decision trees, widely used in classification and regression problems. XGBoost is a highly optimized implementation of the gradient boosting algorithm. Its competitive advantage lies in its superior balance between exploration and exploitation, making it more effective than alternative methods. Notably, XGBoost incorporates diverse regularization penalties to mitigate overfitting risks and can detect and learn from nonlinear data patterns. It trains multiple decision trees sequentially and applies L1 and L2 regularization to the leaf node scores in each iteration to improve upon the results of the previous iteration and gradually enhance model performance. The objective is to minimize the sum of the loss function and regularization term to obtain the optimal model parameters. The objective function is
4.1.5. RUSBoost Model
RUSBoost is an ensemble learning algorithm that integrates AdaBoost with random undersampling (RUS) techniques to address class imbalance issues. AdaBoost, a representative boosting algorithm, iteratively generates weak hypotheses to improve the classification performance of weak learners. In each iteration, the base learner creates a weak hypothesis and adjusts the weights of misclassified examples by increasing their weights while decreasing the weights of correctly classified instances. This adjustment ensures that subsequent iterations focus more on previously misclassified examples. RUSBoost extends AdaBoost by incorporating RUS in each iteration to mitigate class imbalance. Specifically, during each training iteration, RUSBoost uses the entire set of minority class samples (e.g., fraudulent firms) and a randomly selected subset of majority class samples (e.g., nonfraudulent firms) from the same training period.
4.2. Classifier Evaluation Metrics
Performance is assessed using metrics including AUC and NDCG@K. These metrics provide a comprehensive evaluation of model performance in terms of classification accuracy, robustness, and ranking quality.
4.2.1. AUC
The AUC value corresponds to the area under the ROC curve, serving as a quantitative measure of performance. Typically ranging from 0.5 to 1.0, a higher AUC signifies superior predictive capability.
4.2.2. NDCG@K
The NDCG@K metric is commonly employed in the fields of information retrieval and recommendation systems to assess the ranking performance of models, taking into account both accuracy and relevance aspects.
4.2.3. Precision, Recall, and F1-Score
These metrics are particularly critical for evaluating model performance on imbalanced datasets. Precision quantifies the proportion of correctly identified fraud cases among all instances classified as fraud, whereas recall quantifies the proportion of actual fraud cases that are correctly identified by the model. The F1-score computes the harmonic mean of precision and recall, thereby providing a balanced evaluation that accounts for both false positives and false negatives.
4.3. Feature Engineering
This study employs a systematic feature engineering approach to construct features of corporate financial and narrative disclosures, integrating quantitative financial metrics with qualitative textual insights.
Drawing on established methodologies in financial fraud detection [34], 17 financial ratios are derived and categorized into four interpretable dimensions: solvency, operating ability, development ability, and profitability. These include the current ratio (CRO), quick ratio (QRO), cash ratio (CARO), asset liability ratio (GRO), accounts receivable turnover (ART), inventory turnover (ITT), total asset turnover (TAT), total asset growth rate (TAGR), operating income growth rate (OIGR), owner’s equity growth rate (OGR), fixed asset growth rate (FAGR), return on assets (ROA), net profit margin of current assets (NPM), return on equity (ROE), return on invested capital (ROIC), operating gross profit margin (OGPM), and operating net profit margin (ONPM). The calculation methods for these indicators are presented in Table 1.
Table 1 Financial indicators with calculation formula.
| Primary indicators | Secondary indicators | Abbreviation | Formula |
| Solvency | Current ratio | CRO | Current Assets/Current Liabilities |
| Quick ratio | QRO | (Current Assets-Stock-Prepayments)/Current Liabilities | |
| Cash ratio | CARO | Cash and Cash Equivalents/Current Liabilities | |
| Asset liability ratio | GRO | Total Liabilities/Total Assets | |
| Operating ability | Accounts receivable turnover | ART | Operating Revenue/(Accounts Receivable + Accounts Receivablet−1)/2 |
| Inventory turnover | ITT | Operating Revenue/(Stock + Stockt−1)/2 | |
| Total asset turnover | TAT | Operating Revenue/Total Asset | |
| Development ability | Total asset growth rate | TAGR | (Total Assets − Total Assetst−1)/Total Assetst−1 |
| Operating income growth rate | OIGR | (Operating Revenue − Operating Revenuet−1)/Operating Revenuet−1 | |
| Owner’s equity growth rate | OGR | Total owner’s equity/Total Current Liabilities | |
| Fixed asset growth rate | FAGR | (NetFixed Assets − NetFixed Assetst − 1)/NetFixed Assetst−1 | |
| Profitability | Asset return rate | ROA | Net Profit/Total Assets |
| Net profit margin of current asset | NPM | NetProfit/(Total current assets + Total currentassetst−1)/2 | |
| Return on equity | ROE | NetProfit/(Total Owner’ s Equity + Total Owners Equityt−1)/2 | |
| Return on invested capital | ROIC | Net Profit/(Total Assets − Total Noncurrent Liabilities) | |
| Operating gross profit margin | OGPM | (Operating Revenue-Operating Costs)/Operating Revenue | |
| Operating net profit margin | ONPM | Net Profit/Operating Revenue |
Table 1 presents 22 raw financial items which could compute the 17 financial ratios from financial statements, encompassing total current assets, total current liabilities, monetary funds, total liabilities, total assets, accounts receivable, inventory, total owner’s equity, net fixed assets, prepayments, total noncurrent liabilities, operating income, operating costs, net profit, cash and cash equivalents, total current assetst−1, total assetst−1, accounts receivablet−1, inventoryt−1, total owner’s equityt−1, net fixed assetst−1, and operating revenuet−1 (note that the subscript t − 1 represents the indicator value for year t − 1).
Textual topic indicators are developed from MD&A sections through the application of LDA. The LDA model identifies K latent topics, each defined as a probability distribution over a vocabulary (e.g., risk disclosure and strategic planning). Each document is represented as a topic-proportion vector θ, quantifying the prominence of these topics. This transforms unstructured narratives into structured semantic features, capturing managerial emphasis on forward-looking statements, risk factors, or operational strategies.
The features are subsequently constructed by concatenating textual topic indicators with raw financial items, thereby yielding combined features that synergistically encode both qualitative discourse patterns and quantitative financial realities.
4.4. Sample and Data
This study focuses on manufacturing firms listed on China’s A-share market from 2010 to 2019 as the research sample. Firms designated as special treatment (ST) during this period—a regulatory classification indicating financial distress or operational irregularities—are identified as financial fraud observations, while non-ST firms are classified as nonfraudulent controls. Following the exclusion of entities with incomplete or missing data, the final dataset comprises 385 firm-year financial fraud observations and 11,793 nonfraudulent observations. This sample construction reflects the inherent imbalance in financial fraud datasets, a critical consideration for subsequent model development and evaluation. The temporal span and sector-specific focus ensure the sample’s representativeness and relevance to the study’s objectives.
This study utilizes web scraping techniques to collect MD&A texts from annual reports available on the Information Network () and raw financial items from financial statements provided by Sina Finance (). Given that Chinese listed companies typically publish their annual reports for the previous fiscal year by April 5th of the following year, the textual and financial data from year t − 2 are employed to construct a fraud detection model. This study leverages machine learning methodologies to predict the likelihood of financial fraud occurrence in year t. Specifically, the data from the starting year (year t − 2) are used to forecast fraudulent activities in the target year (year t), as illustrated in the temporal framework below. This approach ensures that the predictive model is grounded in timely and relevant data, enhancing its applicability and accuracy in real-world financial fraud detection scenarios.
The study employs a rigorous descriptive statistical analysis framework to examine discriminative patterns between financial fraud and nonfraudulent corporate entities. As delineated in Table 2, our analytical framework incorporates 22 raw financial items complemented by textual topic indicators derived through LDA modeling, with temporal granularity maintained through annualized topic distributions. Utilizing the nonparametric Mann–Whitney U test—particularly appropriate for nonnormally distributed financial data—we identified statistically significant divergences (α < 0.05) in both raw financial items and textual topic indicators between comparative groups.
Table 2 Descriptive statistics of indicators.
| Significant variables | Nonfraudulent firms | Fraudulent firms | ||||||
| Count | Mean | Standard deviation | Median | Count | Mean | Standard deviation | Median | |
| topic5 | 1142 | 0.1485 | 0.0407 | 0.1427 | 65 | 0.1225 | 0.0742 | 0.1217 |
| topic8 | 0.0005 | 0.0047 | 0 | 0.0049 | 0.0142 | 0 | ||
| Total Current Assets | 311,505.2086 | 799,773.9784 | 119,231.555 | 62,307.1718 | 100,807.4148 | 24,762 | ||
| Monetary Funds | 109,772.0345 | 291,383.5116 | 44,457.81 | 15,184.2032 | 28,421.2672 | 5178.66 | ||
| Total Assets | 555,129.8139 | 1,454,283.0251 | 190,742.95 | 130,312.186 | 186,787.2994 | 67,902.62 | ||
| Accounts Receivable | 44,704.8008 | 124,823.6368 | 16,265.31 | 11,720.3357 | 21,809.1485 | 4052.3 | ||
| Stock | 93,832.9962 | 270,915.0517 | 26,313.13 | 22,934.9229 | 44,319.8718 | 7312.51 | ||
| Total Owner’s Equity | 231,195.2934 | 546,074.1384 | 106,093.43 | 5231.3763 | 64,470.1765 | 7090.97 | ||
| Net Fixed Assets | 155,344.4847 | 563,343.3731 | 38,197.625 | 44,205.6065 | 66,342.5133 | 21,132.22 | ||
| Prepayments | 20,880.3244 | 68,678.4826 | 5479.295 | 6120.8617 | 11,354.1893 | 1825.66 | ||
| Operating Income | 467,050.8163 | 1,523,354.9799 | 125,076.68 | 106,599.7963 | 247,311.5780 | 36,450.68 | ||
| Operating Costs | 383,555.8064 | 1,296,142.7487 | 92,504.255 | 95,078.4534 | 214,300.0984 | 30,728.74 | ||
| Net Profit | 25,969.6844 | 81,006.3595 | 8322.405 | −3609.7813 | 23,002.0782 | 118.28 | ||
| Cash and Cash Equivalents | 96,928.3513 | 247,430.5310 | 40,165.375 | 5231.3765 | 17,226.7024 | 3991.17 | ||
| Total Owner’s Equity (t − 1) | 222,894.5451 | 579,526.2518 | 78,314.83 | 51,820.3392 | 73,533.5348 | 26,315.19 | ||
| Total Assets (t − 1) | 430,812.7014 | 1,189,229.4633 | 141,792.275 | 121,518.9506 | 155,004.2818 | 75,297.61 | ||
| Accounts Receivable (t − 1) | 34,682.4279 | 102,302.4307 | 12,947.84 | 11,137.5312 | 19,394.3225 | 4052.3 | ||
| Stock (t − 1) | 70,147.6241 | 205,734.0039 | 19,789.695 | 18,831.8263 | 36,096.4766 | 7679.02 | ||
| Total Owner’s Equity (t − 1) | 171,530.0763 | 465,467.6962 | 68,933.385 | 6851.5074 | 72,083.8704 | 8163.21 | ||
| Operating Income (t − 1) | 333,801.5336 | 955,200.0193 | 96,533.135 | 80,302.9748 | 146,047.0322 | 40,598.73 |
For instance, during the 2010 test cycle, nonfraudulent entities demonstrated significantly elevated values (p < 0.01) in critical financial metrics including total current assets, net fixed assets, and cash equivalents, alongside enhanced representation in topic5 (associated with operational transparency narratives). Conversely, financial fraud entities exhibited marked elevation in topic8 prevalence (characterized by obfuscatory lexical patterns), indicating substantial practical significance. Notably, our longitudinal analysis revealed year-specific significant topic indicators, with financial fraud entities consistently demonstrating depressed scores on substantively positive/neutral thematic dimensions (topic5) and elevated negative thematic loadings (topic8) relative to compliant counterparts. This systematic discrepancy in MD&A textual signatures, when combined with multivariate raw financial data such as total current assets and net profit, provides robust empirical evidence of both quantitative and qualitative informational asymmetries between comparison groups. The converged analytical paradigm thereby validates the discriminative capacity of our hybrid indicator system in corporate financial fraud detection.
5. Process and Results
5.1. LDA Topic Model
This study employs LDA to derive textual topic indicators from the MD&A sections of corporate annual reports. To determine the optimal number of topics, we implement a hybrid methodology combining quantitative evaluation metrics (perplexity and coherence scores) with qualitative visual clustering analysis through the pyLDAvis module. The specific steps are illustrated in Figure 2.
[IMAGE OMITTED. SEE PDF]
As illustrated in Figure 3, the perplexity curve for the 2010 test dataset exhibits a localized minimum at K = 18, while coherence scores remain stable, suggesting robust topic interpretability. Complementary visualization in Figure 4 confirms moderate granularity and nonoverlapping topic clusters, with uniform spatial distribution across quadrants. This spatial exclusivity indicates effective semantic differentiation and comprehensive coverage of discourse domains. Based on empirical validation criteria—including bubble size (topic prevalence) and intercluster distances—the optimal number of topics is empirically determined as K = 18. The model is trained with 100 iterations to ensure convergence of the document-topic and topic-word probability distributions.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
To mitigate sparsity and minimize the impact of noise, it is crucial to filter out words that occur excessively or infrequently across documents, avoid incorporating overly specific and uncommon terms into the model, and prioritize more discriminative words to enhance the generalization ability of the model. In this study, we fine-tune the parameters no_blow (indicating the minimum number of document occurrences for word retention) and no_above (representing the maximum document frequency for word inclusion in vocabulary) to curate a relevant vocabulary. Hereby, Table 3 summarizes significant topic indicators and associated lexicons across test periods, with thematic polarity annotated via color-coding (bold: positive/nonfraudulent; italic: negative/fraudulent).
Table 3 Topic-words during the test period 2012–2019.
| Test period | Optimal num of topics | Significant topic variables | Words |
| 2012 | 18 | topic8 | reorganization, court, bankruptcy, counterfeiting, etc. |
| topic11 | Digital, Creative, Fast, E-commerce, etc. | ||
| 2013 | 15 | topic4 | Utility Model, invention, automation, New energy, etc. |
| topic13 | disasters, branches, repairs, Courts, etc. | ||
| 2014 | 15 | topic6 | Electronics, Automation, Intelligence, Utility Models, etc. |
| topic7 | Coal, fuel, steel, losses, etc. | ||
| 2015 | 17 | topic12 | Energy, coal, Smelting, losses, etc. |
| topic14 | Pharmaceuticals, Raw materials, Health care, drug, etc. | ||
| 2016 | 15 | topic2 | Pharmaceutical, Health Care, drug, clinical, etc. |
| topic14 | fertilizer, soda ash, agricultural inputs, soil, etc. | ||
| 2017 | 18 | topic2 | Prescription Drugs, Health Products, Injections Traditional Chinese Patent, etc. |
| topic7 | New energy, Engines, Electricity, Power Plants, etc. | ||
| 2018 | 16 | topic0 | Clinical, Raw materials, Medical Devices, Clinical Trials, etc. |
| topic4 | Optoelectronics, optics, Communications, controllers, etc. | ||
| 2019 | 15 | topic1 | Communications, Internet, Intelligence, Communication Equipment, etc. |
| topic14 | Clinical, Immunology, Biomedicine, Medical Insurance, etc. |
Thematic polarity is systematically linked to lexical patterns, with negative indicators (topic8 of 2012, topic13 of 2013, topic7 of 2014, topic12 of 2015, and topic2 of 2017) such as bankruptcy (β = 0.032), litigation (β = 0.028), and overcapacity (β = 0.025) disproportionately represented in fraud-associated topics, reflecting operational instability or regulatory risks. Conversely, positive indicators (topic11 of 2012, topic4 of 2013, topic6 of 2014, topic14 of 2015, topic2 of 2016, topic14 of 2016, topic7 of 2017, topic0 of 2018, topic4 of 2018, topic1 of 2019, and topic14 of 2019), including technocentric lexicons like automation (β = 0.041) and renewable energy (β = 0.035), dominate nonfraudulent topics, signaling innovation, or strategic adaptability. Methodological robustness is reinforced through statistical validation, where topic coherence scores and perplexity align with benchmark thresholds for short-text corpora, and visual consistency, such as intertopic distances (mean = 0.78 ± 0.12) and cluster dispersion (Jensen–Shannon divergence < 0.15), confirms semantic distinctiveness.
The longitudinal persistence of risk-associated lexicons (e.g., losses and disasters) in fraudulent firm disclosures suggests that topic modeling effectively captures latent predictors of financial misconduct. Specifically, negative topics exhibit a 23% higher prevalence in fraud-prone firms (p < 0.01 with t-test), demonstrating the utility of LDA-derived indicators for forensic accounting applications. This alignment between thematic content and regulatory outcomes underscores the methodological significance of integrating textual analytics into financial surveillance frameworks, providing a robust foundation for early detection and mitigation of financial risks.
5.2. Comparison Results
5.2.1. Model Performance Using Raw Financial Items
The study evaluated the fraud detection models using 28 raw financial items across 2012–2019. Table 4, Table 5, and Figure 5 give the final results under AUC and NDCG@K metrics, respectively. The results show that ensemble methods such as RUSBoost and XGBoost significantly outperformed the traditional models like logistic regression and SVM. XGBoost demonstrated consistent superiority over RUSBoost, while SVM exhibited the weakest robustness.
Table 4 AUC of the built fraud classifiers using only raw financial items.
| Training period | Test period | AUC | |||
| Logistic regression | SVM | RUSBoost | XGBoost | ||
| 2010 | 2012 | 0.9332 | 0.9695 | 0.9462 | 0.9401 |
| 2011 | 2013 | 0.8572 | 0.7824 | 0.9426 | 0.9477 |
| 2012 | 2014 | 0.8389 | 0.7645 | 0.9422 | 0.9458 |
| 2013 | 2015 | 0.9459 | 0.9247 | 0.8995 | 0.9418 |
| 2014 | 2016 | 0.7905 | 0.7641 | 0.8972 | 0.9002 |
| 2015 | 2017 | 0.8189 | 0.8015 | 0.8079 | 0.8179 |
| 2016 | 2018 | 0.8292 | 0.7697 | 0.8728 | 0.8935 |
| 2017 | 2019 | 0.8587 | 0.8612 | 0.8846 | 0.8755 |
| Average | 0.8591 | 0.8297 | 0.8991 | 0.9078 |
Table 5 NDCG@K of the built fraud classifiers using only raw financial items.
| Training period | Test period | NDCG@K | |||
| Logistic regression | SVM | RUSBoost | XGBoost | ||
| 2010 | 2012 | 0.7162 | 0.7442 | 0.8756 | 0.8001 |
| 2011 | 2013 | 0.5953 | 0.0684 | 0.4791 | 0.7370 |
| 2012 | 2014 | 0.1699 | 0.2047 | 0.5588 | 0.3569 |
| 2013 | 2015 | 0.4014 | 0.4124 | 0.5126 | 0.5841 |
| 2014 | 2016 | 0.5646 | 0.4841 | 0.6621 | 0.6198 |
| 2015 | 2017 | 0.4849 | 0.4064 | 0.5329 | 0.6383 |
| 2016 | 2018 | 0.4783 | 0.2452 | 0.5757 | 0.6096 |
| 2017 | 2019 | 0.4312 | 0.5353 | 0.6554 | 0.6473 |
| Average | 0.4802 | 0.3876 | 0.6065 | 0.6216 |
[IMAGE OMITTED. SEE PDF]
5.2.2. Model Performance Using Combined Features
To further enhance fraud detection, the study introduced combined features, integrating raw financial items with textual topic indicators derived from MD&A reports. As shown in Tables 6 and 7 and Figure 6, ensemble classifiers maintained superiority over traditional classifiers under both AUC and NDCG@K metrics. XGBoost achieved superior performance (AUC: 0.9226; NDCG@K: 0.6455) compared to RUSBoost (AUC: 0.9128; NDCG@K: 0.6335), significantly outperforming logistic regression (AUC: 0.8088; NDCG@K: 0.3857) and SVM (AUC: 0.8189; NDCG@K: 0.4174). Notably, the performance disparity between ensemble models and traditional models increased significantly when utilizing combined features. The textual topic indicators exhibited substantial supplementary value, especially when integrated with ensemble classifiers, thereby enhancing the efficacy of fraud detection. These findings underscore the importance of combined features engineering in financial fraud detection systems.
Table 6 AUC of the models using combined features.
| Training period | Test period | AUC | |||
| Logistic regression | SVM | RUSBoost | XGBoost | ||
| 2010 | 2012 | 0.9077 | 0.9322 | 0.9602 | 0.9585 |
| 2011 | 2013 | 0.8120 | 0.8203 | 0.9469 | 0.9713 |
| 2012 | 2014 | 0.9046 | 0.8678 | 0.9509 | 0.9559 |
| 2013 | 2015 | 0.7741 | 0.7314 | 0.9176 | 0.9424 |
| 2014 | 2016 | 0.6731 | 0.7671 | 0.9156 | 0.9199 |
| 2015 | 2017 | 0.7396 | 0.7675 | 0.8253 | 0.8222 |
| 2016 | 2018 | 0.8285 | 0.7974 | 0.8911 | 0.8987 |
| 2017 | 2019 | 0.8311 | 0.8673 | 0.8950 | 0.8917 |
| Average | 0.8088 | 0.8189 | 0.9128 | 0.9226 |
Table 7 NDCG@K of models using combined features.
| Training period | Test period | NDCG@K | |||
| Logistic regression | SVM | RUSBoost | XGBoost | ||
| 2010 | 2012 | 0.5787 | 0.6813 | 0.8995 | 0.8005 |
| 2011 | 2013 | 0.4677 | 0.5102 | 0.5778 | 0.7449 |
| 2012 | 2014 | 0.2607 | 0.2277 | 0.5823 | 0.6502 |
| 2013 | 2015 | 0.1923 | 0.2051 | 0.5154 | 0.5869 |
| 2014 | 2016 | 0.3580 | 0.4479 | 0.6811 | 0.6327 |
| 2015 | 2017 | 0.3374 | 0.3290 | 0.5653 | 0.3797 |
| 2016 | 2018 | 0.4614 | 0.4120 | 0.5431 | 0.6179 |
| 2017 | 2019 | 0.4294 | 0.5256 | 0.7038 | 0.7512 |
| Average | 0.3857 | 0.4174 | 0.6335 | 0.6455 |
[IMAGE OMITTED. SEE PDF]
5.2.3. Comprehensive Performance Evaluation With Additional Metrics
To provide a more comprehensive assessment of model performance under class imbalance, we further evaluated the models using precision, recall, and F1-score. As shown in Table 8, the ensemble models consistently outperformed traditional classifiers across all metrics when using combined features. XGBoost achieved the highest F1-score (0.712), followed by RUSBoost (0.698), reflecting their superior capability in handling imbalanced datasets. The improvement in both precision and recall when using combined features—compared to raw financial items alone—further validates the complementary value of textual topic indicators.
Table 8 Performance comparison using precision, recall, and F1-score (2012–2019 average).
| Model | Precision | Recall | F1-score |
| Logistic Regression | 0.524 | 0.487 | 0.505 |
| SVM | 0.538 | 0.502 | 0.519 |
| RUSBoost | 0.683 | 0.714 | 0.698 |
| XGBoost | 0.695 | 0.731 | 0.712 |
5.3. Performance Improvements for the Ensemble Learning Models Using Combined Features
To quantify the supplementary value of textual topic indicators, this study also compared the performance of ensemble learning models using raw financial items against combined features. As shown in Tables 9 and 10, integrating textual topic indicators with raw financial items resulted in statistically significant improvements (paired t-test, p < 0.01) for both classifiers. Specifically, the RUSBoost model achieved a 1.5% increase in AUC (from 0.8991 to 0.9128) and a 4.5% improvement in NDCG@K (from 0.6065 to 0.6335), while XGBoost demonstrated a 1.6% gain in AUC (from 0.9078 to 0.9226) and a 3.8% higher NDCG@K (from 0.6216 to 0.6455). Analysis revealed that textual topic indicators provided robustness by counterbalancing the instability of raw financial items through complementary semantic patterns. The topics derived from MD&A provide critical contextual signals that complement traditional financial raw items, especially in detecting early-stage fraud patterns where numerical anomalies may not yet be apparent.
Table 9 Performance comparison results of RUSBoost using different input features.
| Training period | Test period | Metric 1: AUC | Metric 2: NDCG@K | ||
| Raw financial items | Combined features | Raw financial items | Combined features | ||
| 2010 | 2012 | 0.9462 | 0.9602 | 0.8756 | 0.8995 |
| 2011 | 2013 | 0.9426 | 0.9469 | 0.4191 | 0.5718 |
| 2012 | 2014 | 0.9422 | 0.9509 | 0.5588 | 0.5823 |
| 2013 | 2015 | 0.8995 | 0.9176 | 0.5126 | 0.5154 |
| 2014 | 2016 | 0.8972 | 0.9156 | 0.6621 | 0.6811 |
| 2015 | 2017 | 0.8079 | 0.8253 | 0.5329 | 0.5653 |
| 2016 | 2018 | 0.8728 | 0.8911 | 0.5757 | 0.5431 |
| 2017 | 2019 | 0.8846 | 0.8950 | 0.6554 | 0.7038 |
| Average | 0.8991 | 0.9128 | 0.6065 | 0.6335 |
Table 10 Performance comparison results of XGBoost using different input features.
| Training period | Test period | Metric 1: AUC | Metric 2: NDCG@K | ||
| Raw financial items | Combined features | Raw financial items | Combined features | ||
| 2010 | 2012 | 0.9401 | 0.9585 | 0.8001 | 0.8005 |
| 2011 | 2013 | 0.9477 | 0.9713 | 0.7370 | 0.7449 |
| 2012 | 2014 | 0.9458 | 0.9559 | 0.3369 | 0.6502 |
| 2013 | 2015 | 0.9418 | 0.9424 | 0.5841 | 0.5869 |
| 2014 | 2016 | 0.9002 | 0.9199 | 0.6198 | 0.6327 |
| 2015 | 2017 | 0.8179 | 0.8222 | 0.6383 | 0.3797 |
| 2016 | 2018 | 0.8935 | 0.8987 | 0.6096 | 0.6179 |
| 2017 | 2019 | 0.8755 | 0.8917 | 0.6473 | 0.7512 |
| Average | 0.9078 | 0.9226 | 0.6216 | 0.6455 |
5.4. Theoretical and Practical Implications
The superior performance of combined features can be explained through the lens of information asymmetry theory and signaling theory. Textual disclosures in MD&A sections enable managers to convey private information regarding firm performance and future prospects [11]. However, fraudulent firms may manipulate these disclosures to obscure adverse information, resulting in detectable anomalies in topic distributions that complement traditional financial metrics.
From a practical standpoint, integrating textual topic indicators with raw financial items provides auditors and regulators a more comprehensive assessment of corporate health. Textual features can reveal early warning signals that are not yet reflected in financial ratios, thereby supporting proactive fraud detection. This capability is particularly valuable in the Chinese market, where institutional frameworks remain underdeveloped and information asymmetry is more prevalent.
The proposed framework is most effective in settings where (1) sufficient textual data are available, such as detailed MD&A sections; (2) class imbalance exists, necessitating robust ensemble learning techniques; and (3) early fraud detection is prioritized over retrospective identification. These conditions are frequently present in real-world financial monitoring systems, which enhances the practical applicability of our approach.
6. Conclusions
This study introduces an innovative financial fraud detection framework that integrates textual topic indicators, extracted from MD&A texts using LDA, with raw financial items based on ensemble learning models. By merging unstructured textual semantics with structured financial data, the framework provides auditors with a triangulated analytical perspective to contextualize financial trends within managerial intent, thereby reducing reliance on retrospective numerical analysis. Empirical results robustly demonstrate the paradigm-shifting potential of this synergistic fusion strategy. The hybrid architecture not only achieves statistically significant performance improvements over unimodal approaches but also enhances semantic interpretability and generalizability across imbalanced financial datasets.
This methodology redefines proactive fraud detection by advancing financial analytics beyond isolated data modalities toward contextually grounded, proactive frameworks. Textual topics reveal subtle anomalies that enable early fraud detection, while reducing dependence on lagging numerical metrics. The feature fusion strategy enhances generalizability across imbalanced datasets, establishing a replicable paradigm for fraud detection and advancing context-aware analytics in computational finance. Textual topics provide semantic-level insights into financial anomalies—for instance, the prevalence of terms like “bankruptcy” and “litigation” in topic_8 closely reflects underlying financial distress—thereby providing auditors with intuitive and actionable information.
Several limitations warrant acknowledgement of this study. First, the analysis is confined to MD&A disclosures, whereas other corporate communications, such as earnings calls and press releases, may contain complementary signals. Second, the feature integration strategy employs simple concatenation rather than more advanced multimodal fusion techniques, potentially limiting the model’s capacity to capture complex interactions. Third, the framework’s effectiveness in real-time fraud detection settings remains to be fully validated.
Future research should focus on several key directions: (1) developing multimodal deep learning architectures that better capture the interactions between textual and numerical financial data; (2) integrating temporal modeling techniques, such as recurrent and graph-based networks, to track the evolution of fraudulent behaviors over time; (3) broadening textual inputs to include earning call transcripts, social media content, and news articles; and (4) exploring sophisticated fusion mechanisms, including attention networks and cross-modal transformers. Furthermore, studies should assess the framework’s transferability across diverse markets and regulatory contexts to strengthen its generalizability and practical utility.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare no conflicts of interest.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
1 Li J., Li N., Xia T., and Guo J., Textual Analysis and Detection of Financial Fraud: Evidence From Chinese Manufacturing Firms, Economic Modelling. (2023) 126, https://doi.org/10.1016/j.econmod.2023.106428.
2 Donovan J., Jennings J. N., Koharki K., and Lee J. A., Determining Credit Risk Using Qualitative Disclosure, SSRN Electronic Journal. (2018) 26.
3 Bao Y., Ke B., Li B., Yu Y. J., and Zhang J., Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach, Journal of Accounting Research. (2020) 58, no. 1, 199–235, https://doi.org/10.1111/1475-679x.12292.
4 Li X., Wang J., and Yang C., Risk Prediction in Financial Management of Listed Companies Based on Optimized Bp Neural Network Under Digital Economy, Neural Computing & Applications. (2023) 35, no. 3, 2045–2058, https://doi.org/10.1007/s00521-022-07377-0.
5 Zhang T., Zhu W., Wu Y., Wu Z., Zhang C., and Hu X., An Explainable Financial Risk Early Warning Model Based on the ds-xgboost Model, Finance Research Letters. (2023) 56, https://doi.org/10.1016/j.frl.2023.104045.
6 Rahman M. J. and Zhu H., Predicting Accounting Fraud Using Imbalanced Ensemble Learning Classifiers–Evidence from China, Accounting and Finance. (2023) 63, no. 3, 3455–3486, https://doi.org/10.1111/acfi.13044.
7 Sun Y., Zeng X., Xu Y., Yue H., and Yu X., An Intelligent Detecting Model for Financial Frauds in Chinese A-Share Market, Economics & Politics. (2024) 36, no. 2, 1110–1136, https://doi.org/10.1111/ecpo.12283.
8 Zhang Y., Hu A., Wang J., and Zhang Y., Detection of Fraud Statement Based on Word Vector: Evidence From Financial Companies in China, Finance Research Letters. (2022) 46, https://doi.org/10.1016/j.frl.2021.102477.
9 Brown N. C., Crowley R. M., and Elliott W. B., What Are You Saying? Using Topic to Detect Financial Misreporting, Journal of Accounting Research. (2020) 58, no. 1, 237–291, https://doi.org/10.1111/1475-679x.12294.
10 Purda L. and Skillicorn D., Accounting Variables, Deception, and a Bag of Words: Assessing the Tools of Fraud Detection, Contemporary Accounting Research. (2015) 32, no. 3, 1193–1223, https://doi.org/10.1111/1911-3846.12089, 2-s2.0-84928486107.
11 Healy P. M. and Palepu K. G., Information Asymmetry, Corporate Disclosure, and the Capital Markets: A Review of the Empirical Disclosure Literature, Journal of Accounting and Economics. (2001) 31, no. 1-3, 405–440, https://doi.org/10.1016/s0165-4101(01)00018-0, 2-s2.0-0012319054.
12 Hobson J. L., Mayew W. J., and Venkatachalam M., Analyzing Speech to Detect Financial Misreporting, Journal of Accounting Research. (2012) 50, no. 2, 349–392, https://doi.org/10.1111/j.1475-679x.2011.00433.x, 2-s2.0-84859828114.
13 Bell T. B. and Carcello J. V., A Decision Aid for Assessing the Likelihood of Fraudulent Financial Reporting, Auditing: A Journal of Practice & Theory. (2000) 19, no. 1, 169–184, https://doi.org/10.2308/aud.2000.19.1.169.
14 Kirkos E., Spathis C., and Manolopoulos Y., Data Mining Techniques for the Detection of Fraudulent Financial Statements, Expert Systems With Applications. (2007) 32, no. 4, 995–1003, https://doi.org/10.1016/j.eswa.2006.02.016, 2-s2.0-33751432287.
15 Kim S. Y. and Upneja A., Predicting Restaurant Financial Distress Using Decision Tree and Adaboosted Decision Tree Models, Economic Modelling. (2014) 36, 354–362, https://doi.org/10.1016/j.econmod.2013.10.005, 2-s2.0-84886437226.
16 Hajek P. and Henriques R., Mining Corporate Annual Reports for Intelligent Detection of Financial Statement Fraud a Comparative Study of Machine Learning Methods, Knowledge-Based Systems. (2017) 128, 139–152, https://doi.org/10.1016/j.knosys.2017.05.001, 2-s2.0-85019927886.
17 Faris H., Abukhurma R., Almanaseer W. et al., Improving Financial Bankruptcy Prediction in a Highly Imbalanced Class Distribution Using Oversampling and Ensemble Learning: A Case From the Spanish Market, Progress in Artificial Intelligence. (2020) 9, no. 1, 31–53, https://doi.org/10.1007/s13748-019-00197-9, 2-s2.0-85068831986.
18 Khedr A. M., Bannany M. E., and Kanakkayil S., An Ensemble Model for Financial Statement Fraud Detection, Machine Learning. (2021) .
19 Kaminski K. A., Sterling Wetzel T., and Guan L., Can Financial Ratios Detect Fraudulent Financial Reporting?, Managerial Auditing Journal. (2004) 19, no. 1, 15–28, https://doi.org/10.1108/02686900410509802, 2-s2.0-84993019262.
20 Goel S., Gangolly J., Faerman S. R., and Uzuner O., Can Linguistic Predictors Detect Fraudulent Financial Filings, Journal of Emerging Technologies in Accounting. (2010) 7, no. 1, 25–46, https://doi.org/10.2308/jeta.2010.7.1.25, 2-s2.0-79961057452.
21 Cecchini M., Aytug H., Koehler G. J., and Pathak P., Making Words Work: Using Financial Text as a Predictor of Financial Events, Decision Support Systems. (2010) 50, no. 1, 164–175, https://doi.org/10.1016/j.dss.2010.07.012, 2-s2.0-78049452076.
22 Loughran T. and Mcdonald B., When is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-ks, The Journal of Finance. (2011) 66, no. 1, 35–65, https://doi.org/10.1111/j.1540-6261.2010.01625.x, 2-s2.0-78650979144.
23 Tetlock P. C., Giving Content to Investor Sentiment: The Role of Media in the Stock Market, The Journal of Finance. (2007) 62, no. 3, 1139–1168, https://doi.org/10.1111/j.1540-6261.2007.01232.x, 2-s2.0-34248193914.
24 Larcker D. F. and Zakolyukina A. A., Detecting Deceptive Discussions in Conference Calls, Journal of Accounting Research. (2012) 50, no. 2, 495–540, https://doi.org/10.1111/j.1475-679x.2012.00450.x, 2-s2.0-84859829845.
25 Mayew W. J., Sethuraman M., and Venkatachalam M., MD&A Disclosure and the Firm’s Ability to Continue as a Going Concern, SSRN Electronic Journal. (2015) 90, no. 4, 1621–1651.
26 Hoberg G. and Lewis C. M., Do Fraudulent Firms Engage in Disclosure Herding?, Journal of Corporate Finance. (2017) 43, 58–85.
27 Zhang Y. and Ghorbani A. A., An Overview of Sentiment Analysis in Social Media and Its Applications in Disaster Relief, Sentiment Analysis and Ontology Engineering, 2020, Springer, 1–15.
28 Blei D. M. and Jordan M. I., Modeling Annotated Data, Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, 127–134, https://doi.org/10.1145/860435.860460.
29 Loughran T. and McDonald B., Textual Analysis in Accounting and Finance: a Survey, Journal of Accounting Research. (2016) 54, no. 4, 1187–1230, https://doi.org/10.1111/1475-679x.12123, 2-s2.0-84963625332.
30 Yadav A. K. S., Financial Statement Fraud Detection Using Optimized Deep Neural Network, Algorithms for Intelligent Systems. (2024) 131–141, https://doi.org/10.1007/978-981-99-8438-1_10.
31 Song X. P., Hu Z., Du J., and Sheng Z., Application of Machine Learning Methods to Risk Assessment of Financial Statement Fraud: Evidence From China, Journal of Forecasting. (2014) 33, no. 8, 611–626, https://doi.org/10.1002/for.2294, 2-s2.0-84910060462.
32 Mao X., Liu M., and Wang Y., Using GNN to Detect Financial Fraud Based on the Related Party Transactions Network, Procedia Computer Science. (2022) 214, 351–358, https://doi.org/10.1016/j.procs.2022.11.185.
33 Dong W., Liao S., and Zhang Z., Leveraging Financial Social Media Data for Corporate Fraud Detection, Journal of Management Information Systems. (2018) 35, no. 2, 461–487, https://doi.org/10.1080/07421222.2018.1451954, 2-s2.0-85047254879.
34 Dechow P. M., Ge W., Larson C. R., and Sloan R. G., Predicting Material Accounting Misstatements, Contemporary Accounting Research. (2011) 28, no. 1, 17–82, https://doi.org/10.1111/j.1911-3846.2010.01041.x, 2-s2.0-79952362046.
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.