k-armed-bandit.Rmd
Multi-armed Bandits, also called \(K\)-armed Bandits, describe a type of reinforcement learning problem whereby an agent selects one of K possible arms for any number of trials. After each selection, the agent experiences a reinforcement which influences the agent’s action on the next trial. Multi-armed bandits can help formalize a number of real-world problems. Consider a developer who wants to try different application icons and find which one is most appealing to customers and maximizes the number of downloads. They might use a multi-armed bandit to learn which \(k\)-icon maximizes positive reinforcements (number of downloads).
Similar ideas are explained in greater detail in Alksandrs Slivkins textbook, Introduction to Multi-Armed Bandits. In the introduction, Slivkins describes the key points of bandit algorithms, highlighting the tradeoff between “exploration and exploitation: making optimal near-term decisions based on the available information.” Exploration involves selecting new arms; exploitation involves selecting the arm that previously led to the best reinforcement.
This tradeoff is often formalized with a decision-making policy, describes how an agent makes a choice. In this package, we have implemented three widely-used policies: greedy, epsilon-greedy, and softmax.
epsilon
, which describes the propensity to explore the action (arm) space. The higher the epsilon, the more likely the agent is to select a random action; the lower epsilon, the more likely the agent is to select the exploitative action (one with highest expected value).tau
, which describes the propensity to explore the action space. The higher the tau (temperature), the more random the actions; the lower the tau (temperature), the more exploitative the actions (e.g., lower temperature increases the probability of taking the action with the highest expected value).