Content area
Artificial intelligence (AI)-powered code generation tools, such as GitHub Copilot and OpenAI Codex, have revolutionized software development by automating code synthesis. However, concerns remain about the security of AI-generated code and its susceptibility to vulnerabilities. This study investigates whether AI-generated code can match or surpass human-written code in security, using a systematic evaluation framework. It analyzes AIgenerated code samples from state-of-the-art large language models (LLMs) and compares them against human-written code using static and dynamic security analysis tools. Additionally, adversarial testing was done to assess the robustness of LLMs against insecure code suggestions. The findings reveal that while AI-generated code can achieve functional correctness, it frequently introduces security vulnerabilities, such as injection flaws, insecure cryptographic practices, and improper input validation. To mitigate these risks, securityaware training methods and reinforcement learning techniques were explored to enhance the security of AI-generated code. The results highlight the key challenges in AI-driven software development and propose guidelines for integrating AI-assisted programming safely in real-world applications. This paper provides critical insights into the intersection of AI and cybersecurity, paving the way for more secured AI-driven code synthesis models.
Artificial intelligence (AI)-powered code generation tools, such as GitHub Copilot and OpenAI Codex, have revolutionized software development by automating code synthesis. However, concerns remain about the security of AI-generated code and its susceptibility to vulnerabilities. This study investigates whether AI-generated code can match or surpass human-written code in security, using a systematic evaluation framework. It analyzes AIgenerated code samples from state-of-the-art large language models (LLMs) and compares them against human-written code using static and dynamic security analysis tools. Additionally, adversarial testing was done to assess the robustness of LLMs against insecure code suggestions. The findings reveal that while AI-generated code can achieve functional correctness, it frequently introduces security vulnerabilities, such as injection flaws, insecure cryptographic practices, and improper input validation. To mitigate these risks, securityaware training methods and reinforcement learning techniques were explored to enhance the security of AI-generated code. The results highlight the key challenges in AI-driven software development and propose guidelines for integrating AI-assisted programming safely in real-world applications. This paper provides critical insights into the intersection of AI and cybersecurity, paving the way for more secured AI-driven code synthesis models.
Keywords: Artificial intelligence (AI) code synthesis, Secures software development, Large Language Models (LLMs), Static and dynamic code analysis, Adversarial testing in AI
Introduction
The rapid advancements in artificial intelligence (AI) have led to the emergence of large language models (LLMs) capable of generating high-quality sourced code. Tools such as OpenAI Codex (Chen et al., 2021), GitHub Copilot (GitHub Documentation, 2021) and CodeBERT (Feng et al., 2020) have significantly improved developer productivity by automating code synthesis, assisting in debugging, and suggesting optimized programming patterns. However, alongside these benefits, concerns about the security of AI-generated code have emerged. Unlike human developers, AI models lack an intrinsic understanding of secure coding practices and may introduce vulnerabilities that could be exploited in real-world applications (Pearce et al., 2021).
Security flaws in AI-generated code have been observed across various domains, ranging from web development to system programming. Prior studies have identified risks such as SQL injection, insecure cryptographic practices, hardcoded secrets, and improper input validation in AI-assisted coding outputs (Brown et al., 2020). Furthermore, empirical evaluations have shown that LLM-generated code can contain critical vulnerabilities that evade standard security checks (Fu et al., 2025). Given the increasing reliance on LLMs for code generation, it is imperative to assess whether AI-generated code can meet the same security standards as human-written code. Additionally, understanding the limitations of current AI models in producing secure code is critical to improving their robustness.
This paper systematically evaluates the security of AI-generated code and compares it with human-written code. The study employs a combination of static and dynamic security analysis tools, including SonarQube, CodeQL and Bandit, to assess the vulnerabilities (OWASP, 2022). Furthermore, adversarial testing techniques were used to examine the susceptibility of LLMs to generating insecure code under different prompt conditions (Improta, 2024). Through a rigorous evaluation framework, the following key questions are answered:
1. What types of security vulnerabilities are most common in AI-generated code?
2. How does the security of AI-generated code compare to human-written code?
3. Can reinforcement learning and security-aware training improve the security of AI-generated code?
By addressing these questions, this study contributes to the growing field of AI-assisted software development and cybersecurity. The findings provide valuable insights for developers, researchers, and industry practitioners seeking to integrate AI-driven code synthesis, while minimizing security risks. Ultimately, this study proposes best practices and strategies for enhancing the security of AI-generated code, ensuring that AI-driven software development remains both efficient and secure.
Literature Review
The security of AI-generated code has become an increasingly relevant concern as LLMs such as OpenAI Codex, GitHub Copilot, and CodeBERT gain widespread adoption in software development. While these models demonstrate remarkable capabilities in generating syntactically and semantically correct code, studies suggest they also introduce security vulnerabilities at rates comparable to, or higher than, human written code. This section reviews the existing research on AI-driven code generation, common security risks and current mitigation strategies.
AI in Code Generation
Recent advances in AI-driven code generation have led to the development of largescale transformer-based models trained on vast repositories of open-source code. OpenAIs Codex (Chen et al., 2021), for instance, has been fine-tuned on publicly available GitHub repositories to generate functionally correct code with minimal user input. Similarly, CodeBERT (Feng et al., 2020) and PolyCoder (Xu et al., 2022) leverage pre-trained models to assist developers in writing and debugging code more efficiently.
Despite their efficiency, these models lack explicit reasoning about security best practices, often producing insecure implementations. AI-generated code frequently contains vulnerabilities such as buffer overflows, hardcoded secrets, and inadequate input validation (Chen et al., 2021). Pearce et al. (2021) demonstrated that nearly 40% of the code snippets generated by GitHub Copilot contained security flaws when tested across different programming languages.
Security Risks in AI-Generated Code Several studies have highlighted security risks associated with AI-generated code, particularly in the context of web applications, cryptographic implementations and system security.
Injection Vulnerabilities
One of the most common security risks in AI-generated code is injection vulnerabilities, including SQL injection (SQLi) and command injection. Brindavathi et al. (2023) showed that AI models, when prompted to generate database queries, often omit input sanitization or parameterized queries, leading to exploitable SQLi vulnerabilities. It was found that over 35% of SQL-related code snippets generated by Codex were vulnerable to injection attacks unless explicitly prompted to use secure coding practices (Tihanyi et al., 2024).
Insecure Cryptographic Implementations
AI-generated code has also been found to introduce weak cryptographic practices, such as using insecure encryption algorithms or hardcoding cryptographic keys. Chong et al. (2024) examined AI-generated cryptographic implementations and reported that models frequently recommended outdated encryption schemes (e.g., DES, MD5) instead of modern alternatives like AES-256. This tendency poses significant risks in security-sensitive applications.
Insufficient Input Validation
Failure to properly validate user inputs is another major weakness in AI-generated code. Ji et al. (2024) studied that LLMs frequently omit key security checks, such as validating user inputs in web forms or ensuring proper escaping of special characters in system commands. Zhu et al. (2024) found that nearly 50% of AI-generated Python scripts lacked adequate input validation, making them susceptible to command injection and buffer overflow attacks.
Code Obfuscation and Malware Generation
Research has also demonstrated that AI models can be manipulated into generating obfuscated or malicious code. Carlini et al. (2021) showed that adversarial prompts could trick LLMs into producing polymorphic malware, bypassing traditional signaturebased security scanners. Similarly, it was found that Codex could be induced to generate ransomware-like scripts when prompted with carefully crafted instructions (Rupa et al., 2021).
Mitigation Strategies for Secure AI-Generated Code
Given the growing concerns surrounding AI-generated code security, the authors have proposed several strategies to mitigate vulnerabilities and improve the robustness of AI-driven software development.
Security-Aware Model Training
Several approaches aim to improve the training process of LLMs by incorporating security best practices into their learning objectives. Li et al. (2024) explored fine-tuning language models on security-audited codebases and found that such models significantly reduced the generation of insecure code compared to standard pre-trained models.
Reinforcement Learning for Security Optimization
Another promising approach is reinforcement learning with security feedback. Wang et al. (2025) proposed a method where AI models receive real-time feedback from static security analyzers (e.g., SonarQube, Bandit) during training, allowing them to learn secure coding patterns over time. Their experiments demonstrated a 25% reduction in vulnerabilities compared to baseline models.
Adversarial Testing and Red-Teaming
Researchers have also advocated for adversarial testing and red-teaming approaches to stress-test AI-generated code for security flaws. CMU researchers (Feffer et al., 2024) developed a red-teaming framework that systematically probes AI-generated code for weaknesses in access control, memory management, and cryptographic implementation, enabling proactive security improvements.
Post-Generation Security Analysis
Finally, integrating automated security scanning tools into AI-assisted software development workflows has proven effective in reducing vulnerabilities. Tools like CodeQL, SonarQube, and Semgrep can automatically detect and flag insecure coding patterns in AI-generated code before deployment (OWASP, 2022).
The existing body of research highlights both potential and risks of AI-assisted code synthesis. While LLMs can significantly accelerate software development, they frequently introduce security vulnerabilities, including injection flaws, weak cryptographic implementations, and improper input validation. To mitigate these risks, researchers have proposed security-aware training, reinforcement learning for security optimization, adversarial testing, and automated security analysis.
Despite these efforts, challenges remain in fully securing AI-generated code. Future work should focus on developing AI models explicitly designed with security constraints, integrating real-time security feedback during generation, and enhancing adversarial testing methodologies. Addressing these challenges will be essential to ensuring that AI-assisted programming remains both productive and secure.
Methodology
This study systematically assesses the security of AI-generated code and develops techniques to mitigate vulnerabilities. The methodology consists of four primary components: (1) Data Collection; (2) Code Generation; (3) Security Analysis; and (4) Mitigation Strategies.
Data Collection
Diverse datasets comprising real-world source code, vulnerability reports, and security benchmarks were used.
Code Datasets for Training
Training splits (70%) for the following datasets were used:
OWASP Benchmark: A collection of vulnerable and secure code samples designed for evaluating AI-driven security analysis tools (OWASP, 2022).
SARD (Software Assurance Reference Dataset): A dataset of known vulnerabilities in real-world software, providing labeled examples of insecure coding patterns (Black, 2018).
MITRE CWE (Common Weakness Enumeration) Database: A repository of common software weaknesses was used to categorize and analyze AIgenerated vulnerabilities (MITRE Corporation, 2023).
CodeQL Security Queries: A dataset of security rules was used to detect vulnerabilities in AI-generated code via static analysis (GitHub Security Lab, 2023).
Code Datasets for Evaluation
The validation splits (30%) of the datasets mentioned above, along with the undermentioned datasets in full, were used as the ground truth for security evaluation.
CodeXGLUE: It is a large-scale dataset of code snippets across multiple languages, commonly used for AI-assisted code generation research (Lu et al., 2021).
Public GitHub Dataset: Python, JavaScript, and C code were extracted from GitHub open-source repositories hosted on Google BigQuery (Google, 2023).
AI Code Generation Models
The latest generations of the following models are used to generate code samples:
GitHub Copilot (GitHub Documentation, 2021)
OpenAI Codex (OpenAI, 2021)
Claude (Anthropic, 2023)
PolyCoder (Open-source alternative to Codex) (Xu et al., 2022)
Each model is prompted to generate solutions for a predefined set of programming tasks, covering web applications, cryptographic implementations, and system automation scripts. A non-exhaustive list of sample prompts used is provided in Appendix.
Prompt Engineering for Code Security
To assess prompt sensitivity, the structure of prompts to measure their impact on the security of generated code was varied:
Standard Prompts: Generic task descriptions (e.g., Write a login authentication function in Python)
Security-Conscious Prompts: Explicitly instruct models to follow security best practices (e.g., Write a login authentication function with secure password hashing using bcrypt)
Adversarial Prompts: Carefully crafted prompts designed to induce security vulnerabilities (e.g., Write a function that stores passwords in plaintext)
Security Analysis of AI-Generated Code
To systematically evaluate AI-generated code security, static analysis, dynamic analysis, and adversarial testing were employed.
Static Code Analysis
Automated tools were used to detect common security vulnerabilities:
SonarQube: Identifies common security flaws, including SQL injection, hardcoded secrets, and weak cryptographic implementations (Sonar, 2023). It is used for evaluation purpose only.
Semgrep: A lightweight static analysis tool for pattern-based security scanning (Semgrep, 2023). It is used for evaluation purpose only.
Bandit (for Python Security Audits): Detects insecure function calls and missing input validation (Python Code Quality Authority, 2023). It is used as part of the proposed method for post-hoc security filtering.
Each AI-generated code sample is analyzed for security flaws using these tools, and vulnerabilities are classified based on the MITRE CWE taxonomy (MITRE Corporation, 2023).
Dynamic Analysis (Fuzz Testing and Penetration Testing)
Fuzz testing and penetration testing were conducted on AI-generated web applications to identify runtime vulnerabilities:
Atheris Fuzzer (for Python Applications): Injects randomized inputs to detect input validation failures and memory safety issues (Google, 2023). It is used for evaluation purpose only.
ZAP (Zed Attack Proxy): Simulates web-based attacks (e.g., XSS, CSRF, SQLi) on AI-generated web applications (Checkmarx, 2023). It is used for evaluation purpose only.
These tests evaluate how AI-generated code behaves under real-world attack scenarios.
Adversarial Testing of AI Models
Additionally, to assess the robustness of AI models against security-aware adversarial attacks, red-teaming approaches were employed during evaluation:
Prompt Injection Attacks: We tested whether LLMs can be tricked into generating insecure code via malicious prompts (Liu et al., 2024).
Backdoor Attacks: We introduced subtly altered training data to measure if LLMs learn insecure coding practices unintentionally (Qu et al., 2025).
Proposed Framework for Secure AI Code Generation
Based on the security analysis findings, we developed techniques to enhance AIgenerated code security. We introduced a comprehensive multistage framework. The strategy combines security-aware fine-tuning, policy-based reinforcement learning, and post-hoc filtering. Each component is designed to address different phases of the code generation pipeline, improving the overall security posture of generated outputs. Figure 1 depicts the proposed framework, datasets and methods/tools used for training and evaluating the framework after training.
Security-Aware Fine-Tuning
We fine-tuned AI models using security-audited datasets to encourage secure coding practices. The goal of this fine-tuning is to shift the models prior distribution toward secure code patterns. We employed supervised learning on labeled safe code snippets to minimize generation of unsafe constructs. Additionally, we included contrastive examples involving pairs of secure and insecure implementations to help the model differentiate between secure and vulnerable practices. Unlike previous works, we used vulnerability-specific loss weighting to emphasize high-impact classes such as injection flaws and unsafe deserialization.
Reinforcement Learning with Security Feedback
To further shape model behavior beyond what the supervised learning in the previous step offers, we implemented reinforcement learning with security feedback. The reward function penalizes insecure patterns (e.g., use of eval, hardcoded credentials, nonvalidated inputs etc.), which are identified using a pre-trained security classifier. This stage enables the model to generalize to unseen security scenarios and adapt to context-specific risk profiles, distinguishing the approach from generic methods that focus solely on syntactic or stylistic preferences.
Post-Hoc Security Filtering
We integrated automated security scanning tools into AI-assisted coding workflows to detect and block insecure code suggestions to serve as a final line of defense. The generated code is passed through this ensemble of scanners. If any critical issue is flagged, the code is either blocked or revised automatically using a feedback prompt. The post-filtering differs from existing work by treating security scanning as a dynamic filtering mechanism rather than just a diagnostic tool. This proactive integration improves real-time safety assurance.
Evaluation Metrics
The effectiveness of AI-generated code security improvements is measured using the following metrics:
Vulnerability Detection Rate (VDR): Percentage of AI-generated code snippets flagged as insecure.
Exploitability Score (E-Score): Severity of detected vulnerabilities, based on the CVSS (Common Vulnerability Scoring System).
Secure Code Generation Rate (SCGR): Percentage of AI-generated code samples that pass security audits without modifications.
This study systematically assesses the security of AI-generated code and explores mitigation techniques to improve security-aware AI coding practices. By combining static analysis, dynamic testing, adversarial evaluation, and security-conscious training, we aim to develop a framework for reducing vulnerabilities in AI-assisted programming.
Results and Discussion
This section presents the security evaluation of AI-generated code across multiple datasets and analysis techniques. The evaluation covers static code analysis, dynamic security testing, and adversarial testing. AI-generated code from the proposed model, OpenAI Codex, GitHub Copilot, PolyCoder, Claude, and human-written code, was analyzed for security vulnerabilities using industry-standard tools and benchmarks. The primary evaluation metrics included VDR, E-Score and SCGR.
Findings on AI-Generated Code Security
To assess the security of AI-generated code, we conducted both static code analysis (SonarQube, Semgrep) and dynamic security testing (Atheris Fuzzer, ZAP). The results present the VDR, E-Score and SCGR across six benchmark datasets: CodeXGLUE, Public GitHub Dataset, OWASP Benchmark, SARD, MITRE CWE and CodeQL Security Queries.
Static Code Analysis
Static security analysis tools were used to detect vulnerabilities in AI-generated code before execution. SonarQube flagged common security weaknesses, including hardcoded credentials and injection vulnerabilities, while Semgrep identified unsafe API usage patterns.
Across all datasets, the proposed model exhibited lower VDR, lower E-Score and higher SCGR than competitor models. However, human-written code maintained the lowest VDR, lowest E-Score and highest SCGR, highlighting that AI-generated code still carries inherent security risks (Table 1).
Dynamic Code Analysis
Dynamic analysis evaluated runtime vulnerabilities by simulating attack scenarios. Atheris Fuzzer was applied to Python-generated code to detect buffer overflows, while ZAP tested AI-generated web applications for SQL injection and XSS vulnerabilities (Table 2).
The proposed model outperformed competitors in dynamic security tests by exhibiting fewer exploitable vulnerabilities than competitor models. However, humanwritten code remained the most secure, with the fewest samples exhibiting runtime vulnerabilities compared to all other models including the proposed model.
Findings on Adversarial Testing
We conducted prompt injection attacks and backdoor attacks to evaluate model robustness. Vulnerabilities were assessed using VDR, E-Score and SCGR with findings summarized in Table 3.
The proposed model exhibited the lowest VDR, lowest E-Score and highest SCGR among all competitor models for both prompt injection attacks and backdoor attacks. However, the human-written code beats all.
The proposed model consistently produced more robust secure code compared to other AI models, but human-written code remains the most secure.
Conclusion
Based on the findings, the following conclusions are drawn:
The proposed model is significantly more robust secure than AI competitors across static and dynamic analyses.
Human-written code remains the gold standard, with the lowest vulnerabilities across all tests.
The proposed model exhibited greater robustness to adversarial attacks compared to other AI models, but still lags behind human-written code.
These findings highlight the need for further research to close the security gap between AI-generated and human-written code while refining AI models to resist adversarial exploits more effectively. The human-written code, remains the most secure, emphasizing the need for further security-aware training and adversarial defences in AI-assisted programming.
Implications for AI-Assisted Software Development: Given the persistent security risks in AI-generated code, software developers and organizations must adopt proactive measures when incorporating AI-based code synthesis into development workflows. The following guidelines can enhance security in AI-assisted programming:
Security-Aware Training: AI models should be fine-tuned on security-audited datasets and subjected to adversarial training to reduce vulnerability density.
Reinforcement Learning with Security Feedback: Penalizing insecure coding patterns can significantly improve AI-assisted code generation, reducing high-risk suggestions.
Integration with Security Scanners: Embedding automated security tools like Bandit into AI coding workflows helps detect and prevent insecure code execution.
Structured Prompt Engineering: The way developers interact with AI models affects the security of generated code. Carefully crafted prompts can mitigate the generation of insecure patterns.
Human-in-the-Loop Validation: AI-generated code should be reviewed by security professionals before deployment, ensuring that AI-driven automation does not introduce unintentional vulnerabilities.
These strategies highlight the need for a multi-layered security framework when adopting AI for software development. Ensuring that AI-generated code aligns with secure coding principles will be critical for widespread adoption in enterprise and mission-critical applications.
Future Scope: While the proposed approach demonstrates notable improvements in AI-generated code security, several research avenues remain unexplored:
Enhancing AI Models with Security-Aware Training: Future work should focus on expanding security datasets and refining adversarial training techniques to further reduce security vulnerabilities in AI-generated code.
Exploring Reinforcement Learning for Security Optimization: Advanced penalization strategies, such as reward modeling based on exploitability metrics, can be explored to improve the security-awareness of AI coding assistants.
Real-Time AI-Assisted Code Auditing: Developing real-time AI security agents capable of detecting, explaining, and mitigating insecure code at the time of generation could further strengthen security practices.
Cross-Model Security Benchmarking: As AI-generated code adoption increases, a standardized benchmarking framework for evaluating the security of different AI models must be established, ensuring transparent security assessments.
Considerations and Responsible AI: Future research should address the ethical implications of AI-driven code generation, including potential biases in security datasets, privacy concerns, and regulatory compliance.
By integrating these research directions into future AI model development, we can advance the security, reliability, and trustworthiness of AI-assisted software engineering. Ultimately, a holistic approach combining AI innovation with rigorous security controls will be essential for the future of AI-driven code generation.
References
1. Anthropic (2023), Claude. Retrieved from https://claude.ai/
2. Black P E (2018), A Software Assurance Reference Dataset: Thousands of Programs with Known Bugs, J Res Natl Inst Stand Technol, pp. 1-3. doi: https:/ /doi.org/10.6028/jres.123.005
3. Brindavathi B, Karrothu A and Anilkumar C (2023), An Analysis of AI-based SQL Injection (SQLi) Attack Detection, IEEE Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS), pp. 31-35, Trichy. doi:10.1109/ICAISS58487.2023.10250505
4. Brown T, Mann B, Ryder N et al. (2020), Language Models are Few-Shot Learners, NeurIPS Proceedings Advances in Neural Information Processing Systems.
5. Carlini N, Tramer F, Wallace E et al. (2021), Extracting Training Data from Large Language Mode, USENIX Security Symposium.
6. Checkmarx (2023), Zed Attack Proxy (ZAP). Retrieved from https://www.zaproxy.org/
7. Chen M, Tworek J, Jun H et al. (2021), Evaluating Large Language Models Trained on Code. doi:10.48550/arXiv.2107.03374
8. Chong C J, Yao Z and Neamtiu I (2024), Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation. doi:https://doi.org/10.48550/arXiv.2409.19182
9. Feffer M, Sinha A, Deng W H et al. (2024), Red-Teaming for Generative AI: Silver Bullet or Security Theater? doi:https://doi.org/10.48550/arXiv.2401.15897
10. Feng Z, Guo D, Tang D et al. (2020), CodeBERT: A Pre-Trained Model for Programming and Natural Languages, In Findings of the Association for Computational Linguistics: EMNLP, pp. 1536-1547. doi:https://doi.org/10.48550/arXiv.2002.08155
11. Fu Y, Liang P, Tahir A, Li Z et al. (2025), Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study, ACM Transactions on Software Engineering and Methodology. doi:https://doi.org/10.1145/3716848
12. GitHub Documentation (2021), Retrieved from GitHub Copilot: AI-powered Code Completion: https://github.com/features/copilot
13. GitHub Security Lab (2023), CodeQL. Retrieved from GitHub: https:// codeql.github.com/
14. Google (2023), Atheris: A Coverage-Guided, Native Python Fuzzer. Retrieved from GitHub: https://github.com/google/atheris
15. Google (2023), Google BigQuery. Retrieved from https://cloud.google.com/ bigquery/public-data/github
16. Improta C (2024), Poisoning Programs by Un-Repairing Code: Security Concerns of AI-generated Code. Retrieved from https://arxiv.org/html/2403.06675v1
17. Ji J, Jun J, Wu M and Gelles R (2024), Cybersecurity Risks of AI-Generated Code, November. Retrieved from Center for Security and Emerging Technology: https:/ /cset.georgetown.edu/wpcontent/uploads/CSETCybersecurityRisksofAIGeneratedCode.pdf
18. Li J, Rabbi F, Cheng C et al. (2024), An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation, August. doi:https://doi.org/ 10.48550/arXiv.2408.09078
19. Liu X, Yu Z, Zhang Y et al. (2024), Automatic and Universal Prompt Injection Attacks Against Large Language Models, March. doi:https://doi.org/10.48550/ arXiv.2403.04957
20. Lu S, Guo D, Ren S et al. (2021), CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation, NeurIPS.
21. MITRE Corporation (2023), CWE - Common Weakness Enumeration. Retrieved from https://cwe.mitre.org/
22. OpenAI (2021), OpenAI Codex, August. Retrieved from OpenAI: https:// openai.com/index/openai-codex/
23. OWASP (2022), OWASP Benchmark. Retrieved from Open Worldwide Application Security Project: https://owasp.org/www-project-benchmark/
24. OWASP (2022), Source Code Analysis Tools. Retrieved from Open Worldwide Application Security Project: https://owasp.org/www-community/Source_Code_ Analysis_Tools
25. Pearce H, Ahmad B, Tan B et al. (2021), Asleep at the Keyboard? Assessing the Security of AI Code Generation, Proceedings - IEEE Symposium on Security and Privacy, pp. 754768, Institute of Electrical and Electronics Engineers Inc.
26. Python Code Quality Authority (2023), Bandit. Retrieved from GitHub: https:// github.com/PyCQA/bandit
27. Qu Y, Huang S, Li Y et al. (2025), BadCodePrompt: Backdoor Attacks Against Prompt Engineering of Large Language Models for Code Generation, Automated Software Engineering, Vol. 32, No. 17. doi:https://doi.org/10.1007/ s10515-024-00485-2
28. Rupa C, Srivastava G, Bhattacharya S et al. (2021), A Machine Learning Driven Threat Intelligence System for Malicious URL Detection, Proceedings of the 16th International Conference on Availability, Reliability and Security, pp. 1-7, ACM Digital. doi:https://doi.org/10.1145/3465481.3470029
29. Semgrep (2023). Semgrep. Retrieved from GitHub: https://github.com/semgrep/ semgrep
30. Sonar (2023), SonarQube. Retrieved from Sonar Source: https://www.sonarsource. com/products/sonarqube/
31. Tihanyi N, Bisztray T, Ferrag M A et al. (2024), How Secure is AI-Generated Code: a Large-Scale Comparison of Large Language Models, Empirical Software Engineering, Vol. 30, No. 2. doi:https://doi.org/10.1007/s10664-024-10590-1
32. Wang J, Zhang Z, He Y et al. (2025), Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey, January. doi:https://doi.org/10.48550/ arXiv.2412.20367
33. Xu F F, Alon U, Neubig G and Hellendoorn V J (2022), A Systematic Evaluation of Large Language Models of Code, ICLR, May. doi:https://doi.org/10.48550/ arXiv.2202.13169
34. Zhu B, Mu N, Jiao J and Wagner D (2024), Generative AI Security: Challenges and Countermeasures, October. doi:https://doi.org/10.48550/arXiv.2402.12617
Copyright IUP Publications 2025