Enhancing queries for code generation with

Full text

Turn on search term navigation

Introduction

Large Language Models (LLMs) have transformed code generation and programming assistance¹, with DeepSeek particularly excelling in logical reasoning and programming tasks². However, several critical challenges persist in AI-driven programming: The first challenge is that the precise design of the query for optimal code generation remains difficult³. Although models produce syntactically correct code, outcomes heavily depend on input prompts, complicating systematic evaluation and enhancement. The second challenge is developing models to produce code that meets specific criteria, which remains problematic^4,5. Current practices rely on manual prompt crafting, missing automated refinement opportunities. The third challenge is that the lack of robust training methods for incremental improvement in LLM makes optimization difficult^{6, 7–8}. Recent research addresses these through manual refinement and template-based methods^{9, 10–11}, prompt engineering, and chain-of-thought prompting^{12, 13–14}. However, these provide static, rather than adaptive, feedback-learning solutions. Existing query enhancement strategies are based on predetermined rules, failing to capture the intricate relationship between natural language and code^15,16 and lacking systematic refinement based on the quality of generated code^{17, 18–19}. To address these challenges, we propose a reinforcement learning framework for query optimization with DeepSeek, using LoRA fine-tuning of Qwen to dynamically refine queries based on similarity between generated and reference code^{12,20, 21, 22, 23, 24, 25, 26–27}. Through empirical analysis using DS1000²⁸, we demonstrate significant quality improvements in programming tasks^{2,29, 30, 31, 32–33}, with extensive evaluation in programming contexts that establishes the foundations for future research^{34, 35, 36, 37, 38–39}. The main contributions of this paper include:

We propose an RL-based method to refine queries for DeepSeek code generation, learning from the results of generated code.
We use a dual-model design: a learnable refiner (Qwen+LoRA) and a fixed generator (DeepSeek), with LoRA applied to attention projections for efficiency.
We introduce a multi-aspect reward that combines text similarity (BLEU/ROUGE-L/F1/Overlap) and execution signals (unit tests, syntax penalty) to reflect practical code quality.

Large language models for code generation and query enhancement

LLMs have advanced code generation from natural language to executable code^{1, 2–3,6}. Performance still depends heavily on query quality⁵. Earlier work improves queries via pseudo-relevance feedback, knowledge methods, and semantic parsing^{24,40, 41–42}, and addresses the vocabulary problem^17,18,26. Recent LLM techniques add dense retrieval, rewriting, knowledge enhancement, and explanations^32,38,43,44, with Self-RAG, CoT, and few shots prompting as strong baselines^12,15,16,20. We follow this line, but make the refiner parametric and trainable with RL.

Parameter-efficient fine-tuning (PEFT) with LoRA

PEFT reduces computation while preserving performance^{20, 21, 22–23}. LoRA adds a low-rank update to frozen weights,

reducing trainable parameters and avoiding forgetting^{26,27,29, 30, 31, 32–33}. Previous work shows that LoRA maintains general language ability while adapting to code^{35, 36, 37, 38–39,41,42,44, 45, 46–47}. We apply LoRA to Qwen attention projections ( ) for efficient query refinement^{13,14,34,40,43,48, 49, 50, 51, 52, 53, 54, 55, 56–57}.

Generalized preference automation (GPRO) for automated prompt optimization

GPRO frames prompt optimization as RL over prompts^{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11–12,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32–33,35, 36–37}. Our work differs in three ways: (1) a dual-model pipeline (learnable refiner + fixed generator), (2) LoRA-targeted attention for efficiency, and (3) the option to integrate execution-aware signals alongside text metrics. This design aims to adapt queries from outcome feedback rather than only template engineering.

Reinforcement learning for LLM optimization

RL improves LLM beyond supervised learning through preference modeling and policy optimization^{1,2,4,5,7, 8, 9, 10, 11, 12, 13–14,28,34,38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56–57}. We adopt a lightweight REINFORCE setup with a baseline and regularization of entropy for stability, focusing on training only the refiner while keeping the generator fixed, making the method practical for code generation scenarios.

Methods

The DeepSeek-Chat and Qwen models were used as fixed generators and parametric refiners in this study. Their use was limited to generating code based on queries and refining queries based on rewards, respectively. They were not involved in the conceptualization, writing, or analysis of the manuscript.

Figure 1 [Images not available. See PDF.]

Overview of our approach.

Figure 2 [Images not available. See PDF.]

Query enhancement example showing original and enhanced queries alongside their corresponding code.

In this section, we introduce our reinforcement learning approach for query enhancement: we formalize the task, define the reward, and detail LoRA-based training on Qwen. Figure 1 outlines the pipeline: Qwen (with LoRA) refines the original query q in , DeepSeek generates code c, we compare c with reference to compute the reward R, and update the refiner. We use the following notation inline: q (original query), (enhanced query), c (generated code), (reference code), R (reward); are attention projections, and LoRA applies a low-rank update . Figure 2 shows a concrete case: The refined query yields a more precise code, evaluated against the ground truth via ROUGE-L.

Problem formulation

We model query enhancement as a reinforcement learning problem. Given a query in natural language q, the refiner produces an enhanced query . Let be queries with reference code . We define a refiner and a fixed generator . The goal is to learn to make c close to with the reward :

We implement with Qwen (full fine-tuning or LoRA) and train it by RL. We use REINFORCE with a learned baseline b(q) and regularization of entropy: Rewards are normalized per batch to reduce variance. We optimize by gradient descent with the clip norm 1.0: .

Query enhancement

We study two ways to refine queries: full fine-tuning and LoRA. The complete fine-tuning updates all weights to W, but it is heavy and may forget pretraining. LoRA is efficient:

where , , and . We apply LoRA to attention projections ( ). The refiner uses a simple template p(q) to transform q into . We compare our RL-based refiner with two strong baselines: Chain-of-Thought (CoT)¹² and Retrieval-Augmented Generation (RAG)⁵⁸. CoT modifies the prompts with hand-crafted reasoning; RAG augments the prompts with retrieved top-k snippets. Our method learns a parametric refiner (Qwen+LoRA) from outcome rewards, while the generator (DeepSeek) is fixed. In fairness, all methods use the same generator, decoding, and metrics (CSS, precision, recall, F1).Table 1

Query enhancement methods.

Method	Learnable	Retrieval	Execution-aware	Change
CoT¹²	–	–	–	Prompt
RAG⁵⁸	–		–	Prompt+Context
RL4QE		–		Refiner

Table 1 contrasts CoT and RAG, which modify prompts heuristically or through the retrieved context (nonlearnable, text only), with RL4QE, which trains a parametric refiner using outcome rewards and execution signals. Here, execution-aware denotes the use of execution outcomes (e.g., syntax / runtime errors, timeouts) as reward or guidance during training/evaluation. RL4QE integrates these signals and guides DeepSeek via LoRA on attention projections.

Reward function

We define a scalar reward. For RQ1, we isolate the text-only rewards by running four separate settings (one per metric): Overlap, ROUGE-L, BLEU-4, and F1.

where and .

Overlap (token-set coverage of reference):

ROUGE-L (LCS-based F-measure; higher is better):

BLEU-4 (smooth 1–4 gram precision with brevity penalty):

F1 (token-level overlap, order-agnostic):

The text reward for a run is one of the above:

Execution-aware reward (not used in RQ1) combines unit test signals:

In RQ1, we set and compare the four text rewards (one per run). In other experiments, we may include the execution feedback with .

Reinforcement learning

We train the refiner with REINFORCE. The pipeline is: (Qwen ) (DeepSeek g) update . For efficiency, we separate evaluation (without gradients) and training (with gradients): in evaluation, we generate , c, and compute R under torch.no_grad(), then update using the stored .

We use a learned baseline b(q) and regular entropy:

Rewards are normalized per batch to reduce variance. We optimize by gradient descent with clip-norm 1.0 and optional gradient accumulation:

The LoRA fine-tuning updates only the projections of the target attention ( ), reducing memory. We keep the generator g fixed. The best checkpoint is selected by the validation reward.

Algorithm

We train a refiner (Qwen) with rewards from the code generated by DeepSeek.

Algorithm 1 provides an overview of the training process, which involves evaluation to collect rewards without using gradients, followed by updating the refiner via REINFORCE incorporating entropy. This process employs gradient accumulation (G) and a clip norm of 1.0. The optimal checkpoint is selected based on the average reward obtained during validation. The iterative cycle includes the refiner (Qwen), a fixed generator g (DeepSeek), and the reward R, integrating LoRA into the attention projections to improve efficiency.

Results

Research questions

We carried out a comprehensive series of experiments using the DS1000 dataset to assess the effectiveness of our reinforcement learning-driven query enhancement method for code generation, focusing on three critical research questions.

RQ1: Which reward function design performs best for query enhancement?
RQ2: How does LoRA fine-tuning compare to full fine-tuning?
RQ3: How does our approach perform on different foundation models?

Experimental setting

Table 2

Experimental configuration and dataset information.

Category	Component	Specification
Models	Query Enhancement (Base)	Qwen-7B
	LoRA Configuration	Rank , , dropout rate 0.1
	Alternative Models	Qwen
	Fine-tuning Methods	LoRA, Full fine-tuning
	Code Generation	DeepSeek-Chat via API
Hardware	GPU	NVIDIA RTX 3090 (24GB VRAM)
	Memory	128GB DDR4 RAM
Dataset	Source	Code generation tasks
	Training/Testing/Validation	600/200/200 samples
	Libraries	NumPy, Pandas, PyTorch, TensorFlow, Matplotlib
Reward Functions	Overlap	Token overlap between generated and reference code
	ROUGE-L	Longest common subsequence F-measure
	BLEU	n-gram precision with smoothing
	F1	Token-based precision and recall
	Code Metrics	Syntax, completeness, and correctness
Training Parameters	Learning Rate	(Adam)
	Gradient Accumulation	Configurable steps (default: 1)
	Training Epochs	Early stopping applied
Evaluation Metrics	CSS	Code semantic similarity
	Precision	Generation accuracy
	Recall	Reference coverage
	F1 Score	Harmonic mean of precision/recall

We designed experiments on DS-1000 (800 train, 200 test) to answer three research questions. RQ1 compares reward functions (Overlap, ROUGE-L, BLEU-4, F1); RQ2 contrasts LoRA with full fine-tuning; RQ3 evaluates multiple model architectures Qwen-1.8B-Chat, Qwen1.5-0.5B, 1.8B, 4B-Chat, Qwen2-0.5B, 1.5B-Instruct, and Qwen2.5-0.5B, 1.5B, 3B-Instruct. All runs use NVIDIA 3090 GPUs, Adam (lr=1e-4), gradient accumulation (4 steps), and 10 epochs. Table 2 details the complete configuration.

Evaluation metrics

These metrics are to report the performance of the model. The Code Similarity Score (CSS) measures the sequence correspondence between the generated code c and the reference :

Token sequences are obtained by a simple tokenizer:

The sequence match ratio normalizes the longest common (contiguous) token subsequence:

Token-level metrics assess overlap regardless of order:

All metrics are in [0, 1]. CSS reflects structural similarity via longest contiguous matches, while Precision/Recall/F1 capture exact token overlap.

Reproducibility

We ensure reproducibility with fixed seeds for Python/NumPy/PyTorch and deterministic flags where available; scripts and configs are released. All reported tables (e.g., Tables 4, 5, 6) use a disclosed fixed seed in captions, and multi-seed runs can be reproduced with the provided configurations. Execution-based evaluation follows a sandboxed harness (2 s per test, safe handling of syntax errors, and timeouts) and reports the unit test pass rate. For external benchmarking, we use the MBPP-lite split under the same generator and decoding setup and metrics (CSS, Precision, Recall, F1), with the complete harness and configurations released for verification.

Experimental analysis

RQ1: Which reward function design performs best for query enhancement?

Figure 3 [Images not available. See PDF.]

Training curves of Qwen-7B with LoRA ( ) under the Token Overlap metric.

Figure 4 [Images not available. See PDF.]

Training curves of Qwen-7B with LoRA ( ) under the ROUGE-L score.

Figure 5 [Images not available. See PDF.]

Training curves of Qwen-7B with LoRA ( ) under the BLEU-4 score.

Figure 6 [Images not available. See PDF.]

Training curves of Qwen-7B with LoRA ( ) under the F1 score.

Table 3

LoRA hyperparameter configuration for Qwen-7B.

Parameter	Value
LoRA rank (r)	8 (recommended for 7B models)
LoRA alpha ( )	32 (scaling factor )
Dropout rate	0.1 (applied to LoRA weights)
Bias training	Disabled (none)
Target modules	c_attn, c_proj, w1, w2
Precision	FP16 (half precision)
Quantization	8-bit (LLM.int8() scheme)

We evaluated four text evaluation metrics: Overlap, ROUGE-L, BLEU-4, and F1 using the LoRA configuration provided in Table 3. As evidenced by the training curves depicted in Figs. 3, 4, 5, and 6, BLEU-4 consistently produces the most stable and overall rewards. For the Qwen-7B model, which was trained on a comprehensive dataset of 100 samples, the F1 score slightly exceeds BLEU-4. However, when using the smaller Qwen-1.8B model, trained with a more limited dataset of 10 samples, BLEU-4 shows greater robustness. Therefore, we propose BLEU-4 as the default reward metric, recommending F1 as a viable alternative for applications involving larger datasets and models.

RQ2: How does LoRA fine-tuning compare to full fine-tuning?

Table 4

Performance comparison of reward metrics on Qwen models with LoRA.

Reward Metric		CSS	Precision	Recall	F1
Overlap	Qwen-7B (r=8, =1.0)	–	–	–	–
	Qwen-1.8B (r=8, =0.1)	0.0812	0.1173	0.1859	0.1352
ROUGE-L	Qwen-7B	–	–	–	–
	Qwen-1.8B	0.0695	0.0855	0.1950	0.1116
BLEU-4	Qwen-7B	0.0965	0.1250	0.2770	0.1530
	Qwen-1.8B	0.1398	0.1471	0.5608	0.2120
F1	Qwen-7B	0.1016	0.1220	0.2830	0.1510
	Qwen-1.8B	0.0625	0.1107	0.2197	0.1404

Table 5

Full fine-tuning vs LoRA on Qwen-1.8B under BLEU-based reward function.

Method	CSS	Precision	Recall	F1
Full Finetuning (Qwen-1.8B)	0.1078	0.1644	0.2028	0.1635
LoRA (Qwen-1.8B, )	0.0850	0.1565	0.2054	0.1634
LoRA (Qwen-1.8B, )	0.1398	0.1471	0.5608	0.2120
LoRA (Qwen-1.8B, )	0.1091	0.1376	0.5046	0.1988

We compare LoRA with full fine-tuning using the results in Tables 4 and 5. Table 4 (LoRA only) shows that, in Qwen-1.8B, BLEU-4 produces the strongest overall metrics (CSS 0.1398, Precision 0.1471, Recall 0.5608, F1 0.2120). Table 5 directly contrasts the methods under the BLEU-based reward: LoRA with surpasses full fine-tuning in CSS (0.1398 vs. 0.1078), Recall (0.5608 vs. 0.2028), and F1 (0.2120 vs. 0.1635), while full fine-tuning has the highest precision (0.1644). The rank is essentially on par with full fine tuning in F1 (0.1634 vs. 0.1635), and improves over full fine tuning but underperforms (F1 0.1988 vs. 0.2120), indicating diminishing returns beyond .

RQ3: How does our approach perform on different foundation models?

Table 6

Comparative analysis of foundation models with LoRA ( , ).

Model variant	Precision	Recall	F1	CSS
Qwen-1.8B-Chat	0.0969	0.2677	0.1121	0.0799
Qwen1.5-0.5B-Chat	0.0972	0.1908	0.1166	0.0660
Qwen1.5-1.8B-Chat	0.2301	0.6590	0.3264	0.2154
Qwen1.5-4B-Chat	0.1692	0.3365	0.2091	0.1125
Qwen2-0.5B-Instruct	0.1778	0.7240	0.2704	0.1204
Qwen2-1.5B-Instruct	0.1769	0.7699	0.2778	0.1455
Qwen2.5-0.5B-Instruct	0.2041	0.7490	0.3070	0.1949
Qwen2.5-1.5B-Instruct	0.1501	0.7492	0.2408	0.1152
Qwen2.5-3B-Instruct	0.2250	0.7088	0.3193	0.1956

We evaluated our approach in all foundation models with LoRA ( , ). Table 6 shows that architecture matters more than size: Qwen1.5-1.8B-Chat achieves the best balance (CSS 0.2154, F1 0.3264), outperforming larger models (e.g., Qwen1.5-4B-Chat, Qwen2.5-3B-Instruct). The Qwen2 / 2.5 variants produce high recall (up to 0.7699 for Qwen2-1.5B-Instruct) but less balanced precision / CSS, while the Qwen1.5 series is more consistent, especially the 1.8B model (metrics scaled to [0,1]; bold denotes the best column in Table 6).

Discussion

Reinforcement learning improves code generation by refining queries^42,44,45, helping to close the query gap between natural language and effective prompts^41,46,47. It turns vague inputs into precise technical prompts^{48, 49–50}, taking advantage of established prompting and feedback patterns^43,51. A multi-aspect reward is beneficial^{52, 53–54}, while LoRA maintains efficiency with performance comparable to full fine-tuning^40,55,56. Differences between model families highlight the importance of code-pre-trained foundations^13,57. Recent advances in RL offer practical extensions: attention-prioritized replay for sample efficiency⁵⁹ and weighted mean field Q-learning for stable aggregation of enhanced queries⁶⁰. Empirically, rank balances capacity, stability, and compute: lower ranks underfit attention projections; higher ranks raise parameters and variance, yielding diminishing returns. Limitations include DS1000 domain and horizon coverage^5,7,8 and generator sensitivity^10,11,28. Human feedback can further improve efficiency and reward design^9,14,25. Future work will explore more efficient RL^16,19,20, hybrid strategies^{22, 23–24}, broader domains^26,27,29, personalization^{30, 31–32}, and interactive systems^33,35,36 to make AI assistance more accessible across skill levels^{12,37, 38–39}. Moreover, we introduced RL4QE, a practical framework that learns to refine queries with a parametric refiner (Qwen+LoRA) while keeping the code generator (DeepSeek) fixed. Across DS-1000 experiments, BLEU-4 emerges as a strong default reward (F1 competitive on a larger scale), and LoRA with outperforms full fine-tuning on most metrics while using fewer trainable parameters. The approach is transferable across foundation models and is easy to integrate (LoRA on attention projections). We release code, seeds, and harnesses to support reproducibility and external verification.

Acknowledgements

This work was supported by Natural Science Project of Guangdong University of Science and Technology (GKY-2025BSQDK-3).

Author contributions

Dawei Yuan conceived the research idea, designed the methodology, and drafted the manuscript. Guojun Liangcontributed to the processing and analysis of the data. Tingting Li assisted with the development of the methodology and the review of the manuscript. Suping Liu supervised the project and revised the manuscript. All authors read and approved the final manuscript.

Data availability

The datasets and code used in this study are publicly available at: https://github.com/davidyuan666/RL4QE and https://doi.org/10.6084/m9.figshare.28767299.v2. The DS1000 used for the evaluation can be accessed from the original repository.

Declarations

Competing interests

The authors declare no competing interests.

References

Dale

<article-title>GPT-3: What’s it good for?

Nat. Lang. Eng.2021271113118

10.1017/S1351324920000601

Yin

Zhao

Riddell

Feng

Shen

Yin

Liu

Yavuz

Xiong

<article-title>L2ceval: Evaluating language-to-code generation capabilities of large language models

Trans. Assoc. Comput. Ling.202412131113293.

Coello

CEA

Alimam

Kouatly

<article-title>Effectiveness of chatgpt in coding: A comparative analysis of popular large language models

Digital202441114125

10.3390/digital4010005

Guo

Bao

Wang

Gao

Xiao

Liu

Dong

Liu

Wenjun

<article-title>A comprehensive evaluation framework for deep model robustness

Pattern Recogn.2023137

10.1016/j.patcog.2023.109308

109308

Liu, A., Huang, T., Liu, X., Xu, Y., Ma, Y., Chen, X., Maybank, S.J. & Tao, D. Spatiotemporal attacks for embodied agents. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 122–138, (2020).

Ramler, R., Moser, M., Fischer, L., Nissl, M. & Heinzl, R. Industrial experience report on ai-assisted coding in professional software development. In Proceedings of the 1st International Workshop on Large Language Models for Code, pages 1–7, (2024).

Liu, A., Wang, J., Liu, X., Cao, B., Zhang, C. & Yu, H. Bias-based universal adversarial patch attack for automatic check-out. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 395–410, (2020).

Liu

Hang

Zhang

Liu

Tao

<article-title>Training robust deep neural networks via adversarial noise propagation

IEEE Trans. Image Process.20213057695781

2021ITIP...30.5769L

10.1109/TIP.2021.3082317

34161231

Zhang

Liu

Yitao

Hang

<article-title>Interpreting and improving adversarial robustness of deep neural networks with neuron sensitivity

IEEE Trans. Image Process.20203012911304

2021ITIP...30.1291Z

10.1109/TIP.2020.3042083

33290221

10.

Liu, A., Guo, J., Wang, J., Liang, S., Tao, R., Zhou, W., Liu, C., Liu, X. & Tao, D. X-Adv: Physical adversarial object attacks against x-ray prohibited item detection. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 3781–3798, (2023).

11.

Liu

Tang

Chen

Huang

Qin

Liu

Tao

<article-title>Towards defending multiple -norm bounded adversarial perturbations via gated batch normalization

Int. J. Comput. Vis.2024132618811898

10.1007/s11263-023-01884-w

12.

Wei

Wang

Schuurmans

Bosma

Xia

Chi

Zhou

<article-title>Chain-of-thought prompting elicits reasoning in large language models

Adv. Neural Inf. Process. Syst.202235248242483713.

Liu, A., Tang, S., Liang, S., Gong, R., Wu, B., Liu, X. & Tao, D. Exploring the relationship between architecture and adversarially robust generalization. arXiv preprintarXiv:2209.14105, (2022).

14.

Liu, S., Wang, J., Liu, A., Li, Y., Gao, Y., Liu, X. & Tao, D. Harnessing perceptual adversarial patches for crowd counting. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2055–2069, (2022).

15.

George

<article-title>AI supremacy at the price of privacy: Examining the tech giants’ race for data dominance

Partners Univ. Innov. Res. Publ.202531264316.

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, (2023).

17.

Shiri

<article-title>Introduction to modern information retrieval

Libr. Rev.2004539462463

10.1108/00242530410565256

18.

Belkin

Oddy

Brooks

<article-title>ASK for information retrieval: Part I. Background and theory

J. Doc.19823826171

10.1108/eb026722

19.

Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., & Kolesnikov, A. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10925–10934, (2022).

20.

Brown

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

<article-title>Language models are few-shot learners

Adv. Neural. Inf. Process. Syst.2020331877190121.

Aggarwal, C.C., & Aggarwal, C.C. Information retrieval and search engines. Machine Learning for Text, pp. 259–304, (2018).

22.

Dageville, B., Cruanes, T., Zukowski, M., Antonov, V., Avanes, A., Bock, J., Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J. et al. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data, pp. 215–226, (2016).

23.

Dai, Z., & Callan, J. Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 985–988, (2019).

24.

Dalton, J., Dietz, L., & Allan, J. Entity query feature expansion using knowledge base links. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 365–374, (2014).

25.

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T. & Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6491–6501, (2024).

26.

Furnas

Landauer

Gomez

Dumais

<article-title>The vocabulary problem in human-system communication

Commun. ACM19873011964971

10.1145/32206.32212

27.

Gao, L., Dai, Z., Chen, T., Fan, Z., Durme, B.V. & Callan, J. Complement lexical retrieval model with semantic residual embeddings. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Proceedings, Part I, pages 146–160, (2021).

28.

Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, S.W., Fried, D., Wang, S., & Yu, T. DS-1000: A natural and reliable benchmark for data science code generation. arXiv preprint arXiv:2211.11586, (2022).

29.

Huang

Lin

Demner-Fushman

<article-title>Evaluation of pico as a knowledge representation for clinical questions

In AMIA Annual Symposium Proceedings2006200635930.

Ingwersen

<article-title>Cognitive perspectives of information retrieval interaction: elements of a cognitive ir theory

J. Doc.1996521350

10.1108/eb026960

31.

Järvelin

Kekäläinen

<article-title>Cumulated gain-based evaluation of ir techniques

ACM Trans. Inf. Syst.2002204422446

10.1145/582415.582418

32.

Karpukhin, V., Oguz, B., Min, S., Lewis, P.S.H., Wu, L., Edunov, S., Chen, D., & Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of EMNLP, pp. 6769–6781, (2020).

33.

Khattab

Potts

Zaharia

<article-title>Relevance-guided supervision for openqa with colbert

Trans. Assoc. Comput. Ling.2021992994434.

Mirzayanov, M., Pavlova, O., MAVRIN, P., Melnikov, R., Plotnikov, A., Parfenov, V. & Stankevich, A. Codeforces as an educational platform for learning programming in digitalization. Olympiads Inf.14(133–142), 14 (2020).

35.

Krovetz

Croft

<article-title>Lexical ambiguity and information retrieval

ACM Trans. Inf. Syst.1992102115141

10.1145/146802.146810

36.

Kwiatkowski

Palomaki

Redfield

Collins

Parikh

Alberti

Epstein

Polosukhin

Devlin

Lee

<article-title>Natural questions: a benchmark for question answering research

Trans. Assoc. Comput. Ling.2019745346637.

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H. & Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626 (2023).

38.

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, (2023).

39.

Lin, J., Ma, X., Lin, S.C., Yang, J.H., Pradeep, R. & Nogueira, R. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2356–2362, (2021).

40.

Zhai, C., Lafferty, J. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management, pages 403–410, (2001).

41.

Metzler, D. & Croft, W.B. Latent concept expansion using markov random fields. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 311–318, (2007).

42.

Meij

Trieschnigg

Rijke

Kraaij

<article-title>Conceptual language models for domain-specific retrieval

Inf. Process. Manag.2010464448469

10.1016/j.ipm.2009.09.005

43.

Wu, Z., Luan, Y., Rashkin, H., Reitter, D., Hajishirzi, H., Ostendorf, M. & Tomar, G.S. Conqrr: Conversational query rewriting for retrieval with reinforcement learning. arXiv preprint. arXiv:2112.08558, (2021).

44.

Mackie, I., Chatterjee, S. & Dalton, J. Generative relevance feedback with large language models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2026–2031, (2023).

45.

Metzler, D. & Croft, W. B. A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 472–479, (2005).

46.

Riva

Malik

KMP

Burnie

Endicott

Busse

<article-title>What is your research question? an introduction to the picot format for clinicians

J. Can Chiropr. Assoc.2012563167

22997465

3430448

47.

Robertson

Zaragoza

<article-title>The probabilistic relevance framework: BM25 and beyond

Found. Trends Inf. Retr.200934333389

10.1561/1500000019

48.

Schardt

Adams

Owens

Keitz

Fontelo

<article-title>Utilization of the pico framework to improve searching pubmed for clinical questions

BMC Med. Inform. Decis. Mak.2007716

10.1186/1472-6947-7-16

49.

Schütze, H., Manning, C.D. & Raghavan, P. Introduction to information retrieval, volume 39. Cambridge University Press, (2008).

50.

Singhal

<article-title>Modern information retrieval: A brief overview

IEEE Data Eng. Bull.2001244354351.

Wang, X., Macdonald, C., Tonellotto, N. & Ounis, I. Pseudo-relevance feedback for multiple representation dense retrieval. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, pages 297–306, (2021).

52.

Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D. & Nie, J.Y. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 641–649, (2024).

53.

Yates, A., Nogueira, R. & Lin, J. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 1154–1156, (2021).

54.

Yu, H., Xiong, C. & Callan, J. Improving query representations for dense retrieval with pseudo relevance feedback. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3592–3596, (2021).

55.

Yu, S., Liu, J., Yang, J., Xiong, C., Bennett, P., Gao, J. & Liu, Z. Few-shot generative conversational query rewriting. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1933–1936, (2020).

56.

Zesch, T., Gurevych, I. & Mühlhäuser, M. Analyzing and accessing Wikipedia as a lexical semantic resource. Data Structures for Linguistic Resources and Applications, 197205, (2007).

57.

Zhou

Liu

Puxin

Iyer

Sun

Mao

Efrat

Ping

Lili

<article-title>Lima: Less is more for alignment

Adv. Neural. Inf. Process. Syst.202336550065502158.

Lewis

Perez

Piktus

Petroni

Karpukhin

Goyal

Küttler

Lewis

Yih

Rocktäschel

<article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks

Adv. Neural. Inf. Process. Syst.2020339459947459.

Chen

Wang

<article-title>Directly Attention loss adjusted prioritized experience replay

Complex Intell. Syst.2025116111

1:CAS:528:DC%2BB2MXnvVKqtQ%3D%3D

10.1007/s40747-025-01852-6

60.

Chen, Z., Li, H., Wang, Z. & Yan, B. Weighted Mean Field Q-Learning for Large Scale Multiagent Systems. IEEE Trans. Ind. Inf. (2025).

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Dale, R. GPT-3: What’s it good for?. Nat. Lang. Eng.; 2021; 27, 1 pp. 113-118. [DOI: https://dx.doi.org/10.1017/S1351324920000601]

2. Ni, A; Yin, P; Zhao, Y; Riddell, M; Feng, T; Shen, R; Yin, S; Liu, Y; Yavuz, S; Xiong, C et al. L2ceval: Evaluating language-to-code generation capabilities of large language models. Trans. Assoc. Comput. Ling.; 2024; 12, pp. 1311-1329.

3. Coello, CEA; Alimam, MN; Kouatly, R. Effectiveness of chatgpt in coding: A comparative analysis of popular large language models. Digital; 2024; 4, 1 pp. 114-125. [DOI: https://dx.doi.org/10.3390/digital4010005]

4. Guo, J; Bao, W; Wang, J; Ma, Y; Gao, X; Xiao, G; Liu, A; Dong, J; Liu, X; Wenjun, W. A comprehensive evaluation framework for deep model robustness. Pattern Recogn.; 2023; 137, [DOI: https://dx.doi.org/10.1016/j.patcog.2023.109308] 109308.

5. Liu, A., Huang, T., Liu, X., Xu, Y., Ma, Y., Chen, X., Maybank, S.J. & Tao, D. Spatiotemporal attacks for embodied agents. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 122–138, (2020).

6. Ramler, R., Moser, M., Fischer, L., Nissl, M. & Heinzl, R. Industrial experience report on ai-assisted coding in professional software development. In Proceedings of the 1st International Workshop on Large Language Models for Code, pages 1–7, (2024).

7. Liu, A., Wang, J., Liu, X., Cao, B., Zhang, C. & Yu, H. Bias-based universal adversarial patch attack for automatic check-out. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 395–410, (2020).

8. Liu, A; Liu, X; Hang, Yu; Zhang, C; Liu, Q; Tao, D. Training robust deep neural networks via adversarial noise propagation. IEEE Trans. Image Process.; 2021; 30, pp. 5769-5781.2021ITIP..30.5769L [DOI: https://dx.doi.org/10.1109/TIP.2021.3082317] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34161231]

9. Zhang, C; Liu, A; Liu, X; Yitao, X; Hang, Yu; Ma, Y; Li, T. Interpreting and improving adversarial robustness of deep neural networks with neuron sensitivity. IEEE Trans. Image Process.; 2020; 30, pp. 1291-1304.2021ITIP..30.1291Z [DOI: https://dx.doi.org/10.1109/TIP.2020.3042083] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33290221]

10. Liu, A., Guo, J., Wang, J., Liang, S., Tao, R., Zhou, W., Liu, C., Liu, X. & Tao, D. X-Adv: Physical adversarial object attacks against x-ray prohibited item detection. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 3781–3798, (2023).

11. Liu, A; Tang, S; Chen, X; Huang, L; Qin, H; Liu, X; Tao, D. <article-title>Towards defending multiple -norm bounded adversarial perturbations via gated batch normalization

Int. J. Comput. Vis.; 2024; 132, 6 pp. 1881-1898. [DOI: https://dx.doi.org/10.1007/s11263-023-01884-w]

Word count: 5775

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

We present a reinforcement learning framework that enhances natural language queries to improve DeepSeek code generation. A parametric refiner (Qwen with LoRA) is trained via REINFORCE while the generator remains fixed, using a scalar reward that can combine text similarity (BLEU-4, ROUGE-L, F1, Overlap) with execution signals (unit tests, syntax/timeout penalties). On the DS1000 benchmark (800 train / 200 test), RL4QE improves the code similarity by 34.3%. Ablations show that BLEU-4 is the most reliable text reward overall (with F1 competitive on a larger scale), and LoRA with rank outperforms complete fine-tuning on most metrics while being more parameter efficient. The approach is transferred across foundation models (e.g., Qwen1.5/2/2.5 variants), where architecture often matters more than size. RL4QE is easy to integrate in practice (LoRA in attention projections) and supports reproducibility.

Details

Title

Enhancing queries for code generation with reinforcement learning

Author

Yuan, Dawei¹; Liang, Guojun²; Li, Tingting¹; Liu, Suping¹

¹ School of Computer Science, Guangdong University of Science and Technology, 523083, Dongguan, China (ROR: https://ror.org/054fysp39) (GRID: grid.472284.f)
² School of Information Technology, Halmstad University, 30118, Halmstad, Sweden (ROR: https://ror.org/03h0qfp10) (GRID: grid.73638.39) (ISNI: 0000 0000 9852 2034)

Pages

37300

Section

Article

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-21271-4

ProQuest document ID

3264794527

Enhancing queries for code generation with reinforcement learning

Jump to:

Full text

Abstract

Details

Suggested sources