All Projects → ondrejba → bandits

ondrejba / bandits

Licence: MIT license
Comparison of bandit algorithms from the Reinforcement Learning bible.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to bandits

RL-code-resources
A collection of Reinforcement Learning GitHub code resources divided by frameworks and environments
Stars: ✭ 51 (+218.75%)
Mutual labels:  reinforcement-learning-algorithms, reinforcement-learning-agent
Fruit-API
A Universal Deep Reinforcement Learning Framework
Stars: ✭ 61 (+281.25%)
Mutual labels:  reinforcement-learning-algorithms
POMDP
Implementing a RL algorithm based upon a partially observable Markov decision process.
Stars: ✭ 31 (+93.75%)
Mutual labels:  reinforcement-learning-algorithms
king-pong
Deep Reinforcement Learning Pong Agent, King Pong, he's the best
Stars: ✭ 23 (+43.75%)
Mutual labels:  reinforcement-learning-algorithms
reinforcement-learning-resources
A curated list of awesome reinforcement courses, video lectures, books, library and many more.
Stars: ✭ 38 (+137.5%)
Mutual labels:  reinforcement-learning-algorithms
Reinforcement-Learning-on-google-colab
Reinforcement Learning algorithm's using google-colab
Stars: ✭ 33 (+106.25%)
Mutual labels:  reinforcement-learning-algorithms
reinforced-race
A model car learns driving along a track using reinforcement learning
Stars: ✭ 37 (+131.25%)
Mutual labels:  reinforcement-learning-algorithms
ReinforcementLearning Sutton-Barto Solutions
Solutions and figures for problems from Reinforcement Learning: An Introduction Sutton&Barto
Stars: ✭ 20 (+25%)
Mutual labels:  sutton-book
onn
Online Deep Learning: Learning Deep Neural Networks on the Fly / Non-linear Contextual Bandit Algorithm (ONN_THS)
Stars: ✭ 139 (+768.75%)
Mutual labels:  reinforcement-learning-algorithms
yarll
Combining deep learning and reinforcement learning.
Stars: ✭ 84 (+425%)
Mutual labels:  reinforcement-learning-algorithms
rl-algorithms
Reinforcement learning algorithms
Stars: ✭ 40 (+150%)
Mutual labels:  reinforcement-learning-algorithms
Deep Reinforcement Learning
Repo for the Deep Reinforcement Learning Nanodegree program
Stars: ✭ 4,012 (+24975%)
Mutual labels:  reinforcement-learning-algorithms
UAV-DDPG
Code for paper "Computation Offloading Optimization for UAV-assisted Mobile Edge Computing: A Deep Deterministic Policy Gradient Approach"
Stars: ✭ 133 (+731.25%)
Mutual labels:  reinforcement-learning-algorithms
Reinforcement-Learning-CheatSheet
Cheatsheet of Reinforcement Learning (Based on Sutton-Barto Book - 2nd Edition)
Stars: ✭ 22 (+37.5%)
Mutual labels:  reinforcement-learning-algorithms
deeprl-continuous-control
Learning Continuous Control in Deep Reinforcement Learning
Stars: ✭ 14 (-12.5%)
Mutual labels:  reinforcement-learning-algorithms
Neural-Fictitous-Self-Play
Scalable Implementation of Neural Fictitous Self-Play
Stars: ✭ 52 (+225%)
Mutual labels:  reinforcement-learning-algorithms
agentmodels.org
Modeling agents with probabilistic programs
Stars: ✭ 66 (+312.5%)
Mutual labels:  reinforcement-learning-algorithms
rlberry
An easy-to-use reinforcement learning library for research and education.
Stars: ✭ 124 (+675%)
Mutual labels:  reinforcement-learning-algorithms
l2rpn-baselines
L2RPN Baselines a repository to host baselines for l2rpn competitions.
Stars: ✭ 57 (+256.25%)
Mutual labels:  reinforcement-learning-algorithms
Master-Thesis
Deep Reinforcement Learning in Autonomous Driving: the A3C algorithm used to make a car learn to drive in TORCS; Python 3.5, Tensorflow, tensorboard, numpy, gym-torcs, ubuntu, latex
Stars: ✭ 33 (+106.25%)
Mutual labels:  reinforcement-learning-algorithms

Bandits

Experiments with bandit algorithms from the 2nd chapter of Sutton and Barto's Reinforcement Learning: An Introduction.

Results

You can generate each plot with the command written under it.

Stationary Environment

The experimental setup follows the one described in the book:

The values for each action are drawn from a normal distribution with zero mean and unit variance and do not change during the experiment. The bandits take 1000 steps in the environment choosing from 10 actions at each step. Furthermore, the bandits observe rewards with noise drawn from normal distribution with zero mean and unit variance added to the action values. The experiments are repeated 2000 times.

I plot the average reward and the percentage of times the bandit chose the optimal actions.

Comparison from the book

I replicated Figure 2.1 from the book to check my implementation. ε-greedy bandit outperforms a greedy bandit in this simple testbed.

plot_from_book_1 plot_from_book_2

python -m scripts.compare_bandits_stationary images/book_1 -a epsilon epsilon epsilon -s 0.0 0.01 0.1 -l "ε=0", "ε=0.01" "ε=0.1" -t "ε-greedy bandits"

Epsilon-greedy bandits

Next, I compare ε-greedy bandits with different exploration settings. ε=0.1 performs the best.

epsilon_1 epsilon_2

python -m scripts.compare_bandits_stationary images/epsilon -a epsilon epsilon epsilon epsilon epsilon epsilon -s 0.0 0.01 0.1 0.2 0.5 1.0 -l "ε=0" "ε=0.01" "ε=0.1" "ε=0.2" "ε=0.5" "ε=1.0" -t "ε-greedy bandits"

Softmax bandits

Another type of bandit presented in the book is the softmax bandit. Softmax bandits should perform better than ε-greedy bandits because they avoid bad actions during exploration. However, they are quite sensitive to the temperature (τ) parameter setting.

softmax_1 softmax_2

python -m scripts.compare_bandits_stationary images/softmax -a softmax softmax softmax softmax softmax -s 0.1 0.2 0.5 1.0 2.0 -l "τ=0.1" "τ=0.2" "τ=0.5" "τ=1.0" "τ=2.0" -t "softmax bandits"

Optimistic initialization

Optimistic Initialization is an alternative to ε-greedy or softmax exploration policies. It outperforms the ε-greedy bandit in this simple environment but has some drawback (e.g. it cannot track non-stationary rewards). Interestingly, the optimistically initialized bandit chooses the optimal action with lower frequency than the ε-greedy bandit but still achieves higher average reward.

optimistic_init_1 optimistic_init_2

python -m scripts.compare_bandits_stationary images/optimistic_init -a epsilon epsilon -s 0.0 0.1 -i 5.0 0.0 -l "ε=0, init=5" "ε=0.1, init=0" -t "Optimistic Initialization"

Final Comparison

Finally, I compare the best ε-greedy, softmax and optimistically initialized bandits. The softmax bandit wins by a small margin.

epsilon_vs_softmax_vs_optimistic_1 epsilon_vs_softmax_vs_optimistic_2

python -m scripts.compare_bandits_stationary images/epsilon_vs_softmax_vs_optimistic -a epsilon epsilon softmax -s 0.1 0.0 0.2 -l "ε=0.1, init=0", "ε=0, init=5" "τ=0.2, init=0" -i 0.0 5.0 0.0

Non-stationary Environment

In this experiment, all action values start at 0. After all agents perform a single action, the action values take a small random step drawn from a normal distribution. Therefore, the action values change as the bandits interact with the environment.

I compare the ε-greedy bandit from the previous section with a modified version that uses a constant α during sample averaging. Constant α value causes it to prioritize recent rewards, which models the non-stationary environment better.

The agents take 5000 steps in the environment instead of 1000, so that we can see the gap between the two agents increase.

non_stationary_bandits_1 non_stationary_bandits_2

python -m scripts.compare_bandits_nonstationary images/nonstationary -a epsilon epsilon -s 0.1 0.1 --alphas 0.0 0.1 -l "α=1/k" "α=0.1" -t "ε-greedy bandits, ε=0.1"

Advanced Bandits in a Stationary Environment

UCB

UCB bandits establish an upper bound on regret–how much we loose for not playing the optimal action. UCB1 beats ε-greedy while not having any parameters to tune.

ucb1_1 ucb1_2

python -m scripts.compare_bandits_stationary images/ucb1 -a ucb1 epsilon -s 0.0 0.1 -l "UCB1" "ε-greedy (ε=0.1)"

UCB2 tightens the bound on the regret but introduces an additional parameter α.

ucb2_alpha_1 ucb2_alpha_2

python -m scripts.compare_bandits_stationary images/ucb2_alpha -a ucb2 ucb2 ucb2 ucb2 -s 0.001 0.01 0.1 0.5 -l "α=0.001" "α=0.01" "α=0.1" "α=0.5" -t UCB2

UCB2 reaches optimal performance faster than UCB1..

ucb2_1 ucb2_2

python -m scripts.compare_bandits_stationary images/ucb2 -a ucb2 ucb1 epsilon -s 0.5 0.0 0.1 -l "UCB2 (α=0.5)" "UCB1" "ε-greedy (ε=0.1)"

Setup

Install Python 3 and all packages listed in requirements.txt.

Usage

Each scripts contains documentation for all arguments.

For stationary experiments, execute:

python -m scripts.compare_bandits_stationary -h

and for non-stationary experiments:

python -m scripts.compare_bandits_nonstationary -h
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].