In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this
modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB
algorithm the regret in K-armed bandits after T trials is bounded by const ·
, where Δ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper
bound on the regret of const ·