Introduction
Large Language Models (LLMs) have transformed code generation and programming assistance1, with DeepSeek particularly excelling in logical reasoning and programming tasks2. However, several critical challenges persist in AI-driven programming: The first challenge is that the precise design of the query for optimal code generation remains difficult3. Although models produce syntactically correct code, outcomes heavily depend on input prompts, complicating systematic evaluation and enhancement. The second challenge is developing models to produce code that meets specific criteria, which remains problematic4,5. Current practices rely on manual prompt crafting, missing automated refinement opportunities. The third challenge is that the lack of robust training methods for incremental improvement in LLM makes optimization difficult6, 7–8. Recent research addresses these through manual refinement and template-based methods9, 10–11, prompt engineering, and chain-of-thought prompting12, 13–14. However, these provide static, rather than adaptive, feedback-learning solutions. Existing query enhancement strategies are based on predetermined rules, failing to capture the intricate relationship between natural language and code15,16 and lacking systematic refinement based on the quality of generated code17, 18–19. To address these challenges, we propose a reinforcement learning framework for query optimization with DeepSeek, using LoRA fine-tuning of Qwen to dynamically refine queries based on similarity between generated and reference code12,20, 21, 22, 23, 24, 25, 26–27. Through empirical analysis using DS100028, we demonstrate significant quality improvements in programming tasks2,29, 30, 31, 32–33, with extensive evaluation in programming contexts that establishes the foundations for future research34, 35, 36, 37, 38–39. The main contributions of this paper include:
We propose an RL-based method to refine queries for DeepSeek code generation, learning from the results of generated code.
We use a dual-model design: a learnable refiner (Qwen+LoRA) and a fixed generator (DeepSeek), with LoRA applied to attention projections for efficiency.
We introduce a multi-aspect reward that combines text similarity (BLEU/ROUGE-L/F1/Overlap) and execution signals (unit tests, syntax penalty) to reflect practical code quality.
Large language models for code generation and query enhancement
LLMs have advanced code generation from natural language to executable code1, 2–3,6. Performance still depends heavily on query quality5. Earlier work improves queries via pseudo-relevance feedback, knowledge methods, and semantic parsing24,40, 41–42, and addresses the vocabulary problem17,18,26. Recent LLM techniques add dense retrieval, rewriting, knowledge enhancement, and explanations32,38,43,44, with Self-RAG, CoT, and few shots prompting as strong baselines12,15,16,20. We follow this line, but make the refiner parametric and trainable with RL.
Parameter-efficient fine-tuning (PEFT) with LoRA
PEFT reduces computation while preserving performance20, 21, 22–23. LoRA adds a low-rank update to frozen weights,
1
reducing trainable parameters and avoiding forgetting26,27,29, 30, 31, 32–33. Previous work shows that LoRA maintains general language ability while adapting to code35, 36, 37, 38–39,41,42,44, 45, 46–47. We apply LoRA to Qwen attention projections ( ) for efficient query refinement13,14,34,40,43,48, 49, 50, 51, 52, 53, 54, 55, 56–57.Generalized preference automation (GPRO) for automated prompt optimization
GPRO frames prompt optimization as RL over prompts1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11–12,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32–33,35, 36–37. Our work differs in three ways: (1) a dual-model pipeline (learnable refiner + fixed generator), (2) LoRA-targeted attention for efficiency, and (3) the option to integrate execution-aware signals alongside text metrics. This design aims to adapt queries from outcome feedback rather than only template engineering.
Reinforcement learning for LLM optimization
RL improves LLM beyond supervised learning through preference modeling and policy optimization1,2,4,5,7, 8, 9, 10, 11, 12, 13–14,28,34,38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56–57. We adopt a lightweight REINFORCE setup with a baseline and regularization of entropy for stability, focusing on training only the refiner while keeping the generator fixed, making the method practical for code generation scenarios.
Methods
The DeepSeek-Chat and Qwen models were used as fixed generators and parametric refiners in this study. Their use was limited to generating code based on queries and refining queries based on rewards, respectively. They were not involved in the conceptualization, writing, or analysis of the manuscript.
Figure 1 [Images not available. See PDF.]
Overview of our approach.
Figure 2 [Images not available. See PDF.]
Query enhancement example showing original and enhanced queries alongside their corresponding code.
In this section, we introduce our reinforcement learning approach for query enhancement: we formalize the task, define the reward, and detail LoRA-based training on Qwen. Figure 1 outlines the pipeline: Qwen (with LoRA) refines the original query q in , DeepSeek generates code c, we compare c with reference to compute the reward R, and update the refiner. We use the following notation inline: q (original query), (enhanced query), c (generated code), (reference code), R (reward); are attention projections, and LoRA applies a low-rank update . Figure 2 shows a concrete case: The refined query yields a more precise code, evaluated against the ground truth via ROUGE-L.
Problem formulation
We model query enhancement as a reinforcement learning problem. Given a query in natural language q, the refiner produces an enhanced query . Let be queries with reference code . We define a refiner and a fixed generator . The goal is to learn to make c close to with the reward :
2
We implement with Qwen (full fine-tuning or LoRA) and train it by RL. We use REINFORCE with a learned baseline b(q) and regularization of entropy: Rewards are normalized per batch to reduce variance. We optimize by gradient descent with the clip norm 1.0: .Query enhancement
We study two ways to refine queries: full fine-tuning and LoRA. The complete fine-tuning updates all weights to W, but it is heavy and may forget pretraining. LoRA is efficient:
3
where , , and . We apply LoRA to attention projections ( ). The refiner uses a simple template p(q) to transform q into . We compare our RL-based refiner with two strong baselines: Chain-of-Thought (CoT)12 and Retrieval-Augmented Generation (RAG)58. CoT modifies the prompts with hand-crafted reasoning; RAG augments the prompts with retrieved top-k snippets. Our method learns a parametric refiner (Qwen+LoRA) from outcome rewards, while the generator (DeepSeek) is fixed. In fairness, all methods use the same generator, decoding, and metrics (CSS, precision, recall, F1).Table 1Query enhancement methods.
Method | Learnable | Retrieval | Execution-aware | Change |
|---|---|---|---|---|
CoT12 | – | – | – | Prompt |
RAG58 | – |
| – | Prompt+Context |
RL4QE |
| – |
| Refiner |
Table 1 contrasts CoT and RAG, which modify prompts heuristically or through the retrieved context (nonlearnable, text only), with RL4QE, which trains a parametric refiner using outcome rewards and execution signals. Here, execution-aware denotes the use of execution outcomes (e.g., syntax / runtime errors, timeouts) as reward or guidance during training/evaluation. RL4QE integrates these signals and guides DeepSeek via LoRA on attention projections.
Reward function
We define a scalar reward. For RQ1, we isolate the text-only rewards by running four separate settings (one per metric): Overlap, ROUGE-L, BLEU-4, and F1.
4
where and .Overlap (token-set coverage of reference):
5
ROUGE-L (LCS-based F-measure; higher is better):6
BLEU-4 (smooth 1–4 gram precision with brevity penalty):7
F1 (token-level overlap, order-agnostic):8
9
10
The text reward for a run is one of the above:11
Execution-aware reward (not used in RQ1) combines unit test signals:12
13
In RQ1, we set and compare the four text rewards (one per run). In other experiments, we may include the execution feedback with .Reinforcement learning
We train the refiner with REINFORCE. The pipeline is: (Qwen ) (DeepSeek g) update . For efficiency, we separate evaluation (without gradients) and training (with gradients): in evaluation, we generate , c, and compute R under
We use a learned baseline b(q) and regular entropy:
14
Rewards are normalized per batch to reduce variance. We optimize by gradient descent with clip-norm 1.0 and optional gradient accumulation:15
The LoRA fine-tuning updates only the projections of the target attention ( ), reducing memory. We keep the generator g fixed. The best checkpoint is selected by the validation reward.Algorithm
We train a refiner (Qwen) with rewards from the code generated by DeepSeek.
Algorithm 1 provides an overview of the training process, which involves evaluation to collect rewards without using gradients, followed by updating the refiner via REINFORCE incorporating entropy. This process employs gradient accumulation (G) and a clip norm of 1.0. The optimal checkpoint is selected based on the average reward obtained during validation. The iterative cycle includes the refiner (Qwen), a fixed generator g (DeepSeek), and the reward R, integrating LoRA into the attention projections to improve efficiency.
Results
Research questions
We carried out a comprehensive series of experiments using the DS1000 dataset to assess the effectiveness of our reinforcement learning-driven query enhancement method for code generation, focusing on three critical research questions.
RQ1: Which reward function design performs best for query enhancement?
RQ2: How does LoRA fine-tuning compare to full fine-tuning?
RQ3: How does our approach perform on different foundation models?
Experimental setting
Table 2
Experimental configuration and dataset information.
Category | Component | Specification |
|---|---|---|
Models | Query Enhancement (Base) | Qwen-7B |
LoRA Configuration | Rank , , dropout rate 0.1 | |
Alternative Models | Qwen | |
Fine-tuning Methods | LoRA, Full fine-tuning | |
Code Generation | DeepSeek-Chat via API | |
Hardware | GPU | NVIDIA RTX 3090 (24GB VRAM) |
Memory | 128GB DDR4 RAM | |
Dataset | Source | Code generation tasks |
Training/Testing/Validation | 600/200/200 samples | |
Libraries | NumPy, Pandas, PyTorch, TensorFlow, Matplotlib | |
Reward Functions | Overlap | Token overlap between generated and reference code |
ROUGE-L | Longest common subsequence F-measure | |
BLEU | n-gram precision with smoothing | |
F1 | Token-based precision and recall | |
Code Metrics | Syntax, completeness, and correctness | |
Training Parameters | Learning Rate | (Adam) |
Gradient Accumulation | Configurable steps (default: 1) | |
Training Epochs | Early stopping applied | |
Evaluation Metrics | CSS | Code semantic similarity |
Precision | Generation accuracy | |
Recall | Reference coverage | |
F1 Score | Harmonic mean of precision/recall |
We designed experiments on DS-1000 (800 train, 200 test) to answer three research questions. RQ1 compares reward functions (Overlap, ROUGE-L, BLEU-4, F1); RQ2 contrasts LoRA with full fine-tuning; RQ3 evaluates multiple model architectures Qwen-1.8B-Chat, Qwen1.5-0.5B, 1.8B, 4B-Chat, Qwen2-0.5B, 1.5B-Instruct, and Qwen2.5-0.5B, 1.5B, 3B-Instruct. All runs use NVIDIA 3090 GPUs, Adam (lr=1e-4), gradient accumulation (4 steps), and 10 epochs. Table 2 details the complete configuration.
Evaluation metrics
These metrics are to report the performance of the model. The Code Similarity Score (CSS) measures the sequence correspondence between the generated code c and the reference :
16
Token sequences are obtained by a simple tokenizer:17
The sequence match ratio normalizes the longest common (contiguous) token subsequence:18
Token-level metrics assess overlap regardless of order:19
20
21
All metrics are in [0, 1]. CSS reflects structural similarity via longest contiguous matches, while Precision/Recall/F1 capture exact token overlap.Reproducibility
We ensure reproducibility with fixed seeds for Python/NumPy/PyTorch and deterministic flags where available; scripts and configs are released. All reported tables (e.g., Tables 4, 5, 6) use a disclosed fixed seed in captions, and multi-seed runs can be reproduced with the provided configurations. Execution-based evaluation follows a sandboxed harness (2 s per test, safe handling of syntax errors, and timeouts) and reports the unit test pass rate. For external benchmarking, we use the MBPP-lite split under the same generator and decoding setup and metrics (CSS, Precision, Recall, F1), with the complete harness and configurations released for verification.
Experimental analysis
RQ1: Which reward function design performs best for query enhancement?
Figure 3 [Images not available. See PDF.]
Training curves of Qwen-7B with LoRA ( ) under the Token Overlap metric.
Figure 4 [Images not available. See PDF.]
Training curves of Qwen-7B with LoRA ( ) under the ROUGE-L score.
Figure 5 [Images not available. See PDF.]
Training curves of Qwen-7B with LoRA ( ) under the BLEU-4 score.
Figure 6 [Images not available. See PDF.]
Training curves of Qwen-7B with LoRA ( ) under the F1 score.
Table 3LoRA hyperparameter configuration for Qwen-7B.
Parameter | Value |
|---|---|
LoRA rank (r) | 8 (recommended for 7B models) |
LoRA alpha ( ) | 32 (scaling factor ) |
Dropout rate | 0.1 (applied to LoRA weights) |
Bias training | Disabled (none) |
Target modules | |
Precision | FP16 (half precision) |
Quantization | 8-bit (LLM.int8() scheme) |
We evaluated four text evaluation metrics: Overlap, ROUGE-L, BLEU-4, and F1 using the LoRA configuration provided in Table 3. As evidenced by the training curves depicted in Figs. 3, 4, 5, and 6, BLEU-4 consistently produces the most stable and overall rewards. For the Qwen-7B model, which was trained on a comprehensive dataset of 100 samples, the F1 score slightly exceeds BLEU-4. However, when using the smaller Qwen-1.8B model, trained with a more limited dataset of 10 samples, BLEU-4 shows greater robustness. Therefore, we propose BLEU-4 as the default reward metric, recommending F1 as a viable alternative for applications involving larger datasets and models.
RQ2: How does LoRA fine-tuning compare to full fine-tuning?
Table 4
Performance comparison of reward metrics on Qwen models with LoRA.
Reward Metric | CSS | Precision | Recall | F1 | |
|---|---|---|---|---|---|
Overlap | Qwen-7B (r=8, =1.0) | – | – | – | – |
Qwen-1.8B (r=8, =0.1) | 0.0812 | 0.1173 | 0.1859 | 0.1352 | |
ROUGE-L | Qwen-7B | – | – | – | – |
Qwen-1.8B | 0.0695 | 0.0855 | 0.1950 | 0.1116 | |
BLEU-4 | Qwen-7B | 0.0965 | 0.1250 | 0.2770 | 0.1530 |
Qwen-1.8B | 0.1398 | 0.1471 | 0.5608 | 0.2120 | |
F1 | Qwen-7B | 0.1016 | 0.1220 | 0.2830 | 0.1510 |
Qwen-1.8B | 0.0625 | 0.1107 | 0.2197 | 0.1404 | |
Full fine-tuning vs LoRA on Qwen-1.8B under BLEU-based reward function.
Method | CSS | Precision | Recall | F1 |
|---|---|---|---|---|
Full Finetuning (Qwen-1.8B) | 0.1078 | 0.1644 | 0.2028 | 0.1635 |
LoRA (Qwen-1.8B, ) | 0.0850 | 0.1565 | 0.2054 | 0.1634 |
LoRA (Qwen-1.8B, ) | 0.1398 | 0.1471 | 0.5608 | 0.2120 |
LoRA (Qwen-1.8B, ) | 0.1091 | 0.1376 | 0.5046 | 0.1988 |
We compare LoRA with full fine-tuning using the results in Tables 4 and 5. Table 4 (LoRA only) shows that, in Qwen-1.8B, BLEU-4 produces the strongest overall metrics (CSS 0.1398, Precision 0.1471, Recall 0.5608, F1 0.2120). Table 5 directly contrasts the methods under the BLEU-based reward: LoRA with surpasses full fine-tuning in CSS (0.1398 vs. 0.1078), Recall (0.5608 vs. 0.2028), and F1 (0.2120 vs. 0.1635), while full fine-tuning has the highest precision (0.1644). The rank is essentially on par with full fine tuning in F1 (0.1634 vs. 0.1635), and improves over full fine tuning but underperforms (F1 0.1988 vs. 0.2120), indicating diminishing returns beyond .
RQ3: How does our approach perform on different foundation models?
Table 6
Comparative analysis of foundation models with LoRA ( , ).
Model variant | Precision | Recall | F1 | CSS |
|---|---|---|---|---|
Qwen-1.8B-Chat | 0.0969 | 0.2677 | 0.1121 | 0.0799 |
Qwen1.5-0.5B-Chat | 0.0972 | 0.1908 | 0.1166 | 0.0660 |
Qwen1.5-1.8B-Chat | 0.2301 | 0.6590 | 0.3264 | 0.2154 |
Qwen1.5-4B-Chat | 0.1692 | 0.3365 | 0.2091 | 0.1125 |
Qwen2-0.5B-Instruct | 0.1778 | 0.7240 | 0.2704 | 0.1204 |
Qwen2-1.5B-Instruct | 0.1769 | 0.7699 | 0.2778 | 0.1455 |
Qwen2.5-0.5B-Instruct | 0.2041 | 0.7490 | 0.3070 | 0.1949 |
Qwen2.5-1.5B-Instruct | 0.1501 | 0.7492 | 0.2408 | 0.1152 |
Qwen2.5-3B-Instruct | 0.2250 | 0.7088 | 0.3193 | 0.1956 |
We evaluated our approach in all foundation models with LoRA ( , ). Table 6 shows that architecture matters more than size: Qwen1.5-1.8B-Chat achieves the best balance (CSS 0.2154, F1 0.3264), outperforming larger models (e.g., Qwen1.5-4B-Chat, Qwen2.5-3B-Instruct). The Qwen2 / 2.5 variants produce high recall (up to 0.7699 for Qwen2-1.5B-Instruct) but less balanced precision / CSS, while the Qwen1.5 series is more consistent, especially the 1.8B model (metrics scaled to [0,1]; bold denotes the best column in Table 6).
Discussion
Reinforcement learning improves code generation by refining queries42,44,45, helping to close the query gap between natural language and effective prompts41,46,47. It turns vague inputs into precise technical prompts48, 49–50, taking advantage of established prompting and feedback patterns43,51. A multi-aspect reward is beneficial52, 53–54, while LoRA maintains efficiency with performance comparable to full fine-tuning40,55,56. Differences between model families highlight the importance of code-pre-trained foundations13,57. Recent advances in RL offer practical extensions: attention-prioritized replay for sample efficiency59 and weighted mean field Q-learning for stable aggregation of enhanced queries60. Empirically, rank balances capacity, stability, and compute: lower ranks underfit attention projections; higher ranks raise parameters and variance, yielding diminishing returns. Limitations include DS1000 domain and horizon coverage5,7,8 and generator sensitivity10,11,28. Human feedback can further improve efficiency and reward design9,14,25. Future work will explore more efficient RL16,19,20, hybrid strategies22, 23–24, broader domains26,27,29, personalization30, 31–32, and interactive systems33,35,36 to make AI assistance more accessible across skill levels12,37, 38–39. Moreover, we introduced RL4QE, a practical framework that learns to refine queries with a parametric refiner (Qwen+LoRA) while keeping the code generator (DeepSeek) fixed. Across DS-1000 experiments, BLEU-4 emerges as a strong default reward (F1 competitive on a larger scale), and LoRA with outperforms full fine-tuning on most metrics while using fewer trainable parameters. The approach is transferable across foundation models and is easy to integrate (LoRA on attention projections). We release code, seeds, and harnesses to support reproducibility and external verification.
Acknowledgements
This work was supported by Natural Science Project of Guangdong University of Science and Technology (GKY-2025BSQDK-3).
Author contributions
Dawei Yuan conceived the research idea, designed the methodology, and drafted the manuscript. Guojun Liangcontributed to the processing and analysis of the data. Tingting Li assisted with the development of the methodology and the review of the manuscript. Suping Liu supervised the project and revised the manuscript. All authors read and approved the final manuscript.
Data availability
The datasets and code used in this study are publicly available at: https://github.com/davidyuan666/RL4QE and https://doi.org/10.6084/m9.figshare.28767299.v2. The DS1000 used for the evaluation can be accessed from the original repository.
Declarations
Competing interests
The authors declare no competing interests.
References
1.Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1. Dale, R. GPT-3: What’s it good for?. Nat. Lang. Eng.; 2021; 27,
2. Ni, A; Yin, P; Zhao, Y; Riddell, M; Feng, T; Shen, R; Yin, S; Liu, Y; Yavuz, S; Xiong, C et al. L2ceval: Evaluating language-to-code generation capabilities of large language models. Trans. Assoc. Comput. Ling.; 2024; 12, pp. 1311-1329.
3. Coello, CEA; Alimam, MN; Kouatly, R. Effectiveness of chatgpt in coding: A comparative analysis of popular large language models. Digital; 2024; 4,
4. Guo, J; Bao, W; Wang, J; Ma, Y; Gao, X; Xiao, G; Liu, A; Dong, J; Liu, X; Wenjun, W. A comprehensive evaluation framework for deep model robustness. Pattern Recogn.; 2023; 137, [DOI: https://dx.doi.org/10.1016/j.patcog.2023.109308] 109308.
5. Liu, A., Huang, T., Liu, X., Xu, Y., Ma, Y., Chen, X., Maybank, S.J. & Tao, D. Spatiotemporal attacks for embodied agents. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 122–138, (2020).
6. Ramler, R., Moser, M., Fischer, L., Nissl, M. & Heinzl, R. Industrial experience report on ai-assisted coding in professional software development. In Proceedings of the 1st International Workshop on Large Language Models for Code, pages 1–7, (2024).
7. Liu, A., Wang, J., Liu, X., Cao, B., Zhang, C. & Yu, H. Bias-based universal adversarial patch attack for automatic check-out. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 395–410, (2020).
8. Liu, A; Liu, X; Hang, Yu; Zhang, C; Liu, Q; Tao, D. Training robust deep neural networks via adversarial noise propagation. IEEE Trans. Image Process.; 2021; 30, pp. 5769-5781.2021ITIP..30.5769L [DOI: https://dx.doi.org/10.1109/TIP.2021.3082317] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34161231]
9. Zhang, C; Liu, A; Liu, X; Yitao, X; Hang, Yu; Ma, Y; Li, T. Interpreting and improving adversarial robustness of deep neural networks with neuron sensitivity. IEEE Trans. Image Process.; 2020; 30, pp. 1291-1304.2021ITIP..30.1291Z [DOI: https://dx.doi.org/10.1109/TIP.2020.3042083] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33290221]
10. Liu, A., Guo, J., Wang, J., Liang, S., Tao, R., Zhou, W., Liu, C., Liu, X. & Tao, D. X-Adv: Physical adversarial object attacks against x-ray prohibited item detection. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 3781–3798, (2023).
11. Liu, A; Tang, S; Chen, X; Huang, L; Qin, H; Liu, X; Tao, D. <article-title>Towards defending multiple -norm bounded adversarial perturbations via gated batch normalization
Int. J. Comput. Vis.; 2024; 132,You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
We present a reinforcement learning framework that enhances natural language queries to improve DeepSeek code generation. A parametric refiner (Qwen with LoRA) is trained via REINFORCE while the generator remains fixed, using a scalar reward that can combine text similarity (BLEU-4, ROUGE-L, F1, Overlap) with execution signals (unit tests, syntax/timeout penalties). On the DS1000 benchmark (800 train / 200 test), RL4QE improves the code similarity by 34.3%. Ablations show that BLEU-4 is the most reliable text reward overall (with F1 competitive on a larger scale), and LoRA with rank
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 School of Computer Science, Guangdong University of Science and Technology, 523083, Dongguan, China (ROR: https://ror.org/054fysp39) (GRID: grid.472284.f)
2 School of Information Technology, Halmstad University, 30118, Halmstad, Sweden (ROR: https://ror.org/03h0qfp10) (GRID: grid.73638.39) (ISNI: 0000 0000 9852 2034)




