Content area
With the widespread adoption of the internet, online advertising has grown exponentially. To enhance ad recommendation efficiency, various Multi-Armed Bandit (MAB) algorithms have been deployed. Among these, the Thompson ε-Greedy algorithm integrates the ε-Greedy policy with Thompson Sampling. To optimize the algorithm, specifically, to reduce cumulative regret and improve arm selection accuracy, this paper analyzes the parameter ε in the ε-Greedy framework. This paper argues that fixing ε wastes environmental information learned over time. As the number of rounds increases, environmental understanding deepens, and ε should decay with both the rounds and the selection count of the current best arm T_(t,arm_max), since a higher selection count implies greater confidence in its optimality. Two primary decay modes are considered: linear and nonlinear decay. The study analyzes both modes and optimizes their parameters using genetic algorithms. Results demonstrate that after introducing parameter T_(t,arm_max), nonlinear ε-decay achieves lower cumulative regret under optimal parameter settings, whereas linear decay shows no such improvement.