All Projects → mtrazzi → Rl Book Challenge

mtrazzi / Rl Book Challenge

Licence: mit
self-studying the Sutton & Barto the hard way

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Rl Book Challenge

Policy Gradient
Minimal Monte Carlo Policy Gradient (REINFORCE) Algorithm Implementation in Keras
Stars: ✭ 135 (-7.53%)
Mutual labels:  reinforcement-learning
Awesome Deep Learning Papers For Search Recommendation Advertising
Awesome Deep Learning papers for industrial Search, Recommendation and Advertising. They focus on Embedding, Matching, Ranking (CTR prediction, CVR prediction), Post Ranking, Transfer, Reinforcement Learning, Self-supervised Learning and so on.
Stars: ✭ 136 (-6.85%)
Mutual labels:  reinforcement-learning
Allenact
An open source framework for research in Embodied-AI from AI2.
Stars: ✭ 144 (-1.37%)
Mutual labels:  reinforcement-learning
Ai plays snake
AI trained using Genetic Algorithm and Deep Learning to play the game of snake
Stars: ✭ 137 (-6.16%)
Mutual labels:  reinforcement-learning
Safe learning
Safe reinforcement learning with stability guarantees
Stars: ✭ 140 (-4.11%)
Mutual labels:  reinforcement-learning
Data Analysis
主要是爬虫与数据分析项目总结,外加建模与机器学习,模型的评估。
Stars: ✭ 142 (-2.74%)
Mutual labels:  matplotlib
Reinforcement learning in python
Implementing Reinforcement Learning, namely Q-learning and Sarsa algorithms, for global path planning of mobile robot in unknown environment with obstacles. Comparison analysis of Q-learning and Sarsa
Stars: ✭ 134 (-8.22%)
Mutual labels:  reinforcement-learning
Sumo Rl
A simple interface to instantiate Reinforcement Learning environments with SUMO for Traffic Signal Control. Compatible with Gym Env from OpenAI and MultiAgentEnv from RLlib.
Stars: ✭ 145 (-0.68%)
Mutual labels:  reinforcement-learning
Flappy Es
Flappy Bird AI using Evolution Strategies
Stars: ✭ 140 (-4.11%)
Mutual labels:  reinforcement-learning
Data Science Question Answer
A repo for data science related questions and answers
Stars: ✭ 2,000 (+1269.86%)
Mutual labels:  reinforcement-learning
Zhihu Spider
一个获取知乎用户主页信息的多线程Python爬虫程序。
Stars: ✭ 137 (-6.16%)
Mutual labels:  matplotlib
Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (-4.11%)
Mutual labels:  reinforcement-learning
Cherry
A PyTorch Library for Reinforcement Learning Research
Stars: ✭ 143 (-2.05%)
Mutual labels:  reinforcement-learning
Savn
Learning to Learn how to Learn: Self-Adaptive Visual Navigation using Meta-Learning (https://arxiv.org/abs/1812.00971)
Stars: ✭ 135 (-7.53%)
Mutual labels:  reinforcement-learning
Machin
Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...
Stars: ✭ 145 (-0.68%)
Mutual labels:  reinforcement-learning
Ml Cheatsheet
A constantly updated python machine learning cheatsheet
Stars: ✭ 136 (-6.85%)
Mutual labels:  matplotlib
Deep Learning Resources
A Collection of resources I have found useful on my journey finding my way through the world of Deep Learning.
Stars: ✭ 141 (-3.42%)
Mutual labels:  matplotlib
Tensor2tensor
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Stars: ✭ 11,865 (+8026.71%)
Mutual labels:  reinforcement-learning
Articulations Robot Demo
Stars: ✭ 145 (-0.68%)
Mutual labels:  reinforcement-learning
Gb
A minimal C implementation of Nintendo Gameboy - An fast research environment for Reinforcement Learning
Stars: ✭ 143 (-2.05%)
Mutual labels:  reinforcement-learning

In this repo

  1. Python replication of all the plots from Reinforcement Learning: An Introduction
  2. Solution for all of the exercises
  3. Anki flashcards summary of the book

1. Replicate all the figures

To reproduce a figure, say figure 2.2, do:

cd chapter2
python figures.py 2.2

Chapter 2

  1. Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed
  2. Figure 2.3: Optimistic initial action-value estimates
  3. Figure 2.4: Average performance of UCB action selection on the 10-armed testbed
  4. Figure 2.5: Average performance of the gradient bandit algorithm
  5. Figure 2.6: A parameter study of the various bandit algorithms

Chapter 4

  1. Figure 4.2: Jack’s car rental problem (value function, policy)
  2. Figure 4.3: The solution to the gambler’s problem (value function, policy)

Chapter 5

  1. Figure 5.1: Approximate state-value functions for the blackjack policy
  2. Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES
  3. Figure 5.3: Weighted importance sampling
  4. Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates
  5. Figure 5.5: A couple of right turns for the racetrack task (1, 2, 3)

Chapter 6

  1. Figure 6.1: Changes recommended in the driving home example by Monte Carlo methods (left) and TD methods (right)
  2. Example 6.2: Random walk (comparison)
  3. Figure 6.2: Performance of TD(0) and constant MC under batch training on the random walk task
  4. Example 6.5: Windy Gridworld
  5. Example 6.6: Cliff Walking
  6. Figure 6.3: Interim and asymptotic performance of TD control methods (comparison)
  7. Figure 6.5: Comparison of Q-learning and Double Q-learning

Chapter 7

  1. Figure 7.2: Performance of n-step TD methods on 19-state random walk (comparison)
  2. Figure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step methods

Chapter 8

  1. Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps
  2. Figure 8.3: Policies found by planning and nonplanning Dyna-Q agents
  3. Figure 8.4: Average performance of Dyna agents on a blocking task
  4. Figure 8.5: Average performance of Dyna agents on a shortcut task
  5. Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task
  6. Figure 8.7: Comparison of efficiency of expected and sample updates
  7. Figure 8.8: Relative efficiency of different update distributions

Chapter 9

  1. Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task
  2. Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task
  3. Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task (comparison)
  4. Figure 9.10: State aggregation vs. Tile coding on 1000-state random walk task (comparison)

Chapter 10

  1. Figure 10.1: The cost-to-go function for Mountain Car task in one run (428 steps; 12, 104, 1000, 9000 episodes)
  2. Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task
  3. Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task
  4. Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa
  5. Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task

Chapter 11

  1. Figure 11.2: Demonstration of instability on Baird’s counterexample
  2. Figure 11.5: The behavior of the TDC algorithm on Baird’s counterexample
  3. Figure 11.6: The behavior of the ETD algorithm in expectation on Baird’s counterexample

Chapter 12

  1. Figure 12.3: Off-line λ-return algorithm on 19-state random walk
  2. Figure 12.6: TD(λ) algorithm on 19-state random walk
  3. Figure 12.8: True online TD(λ) algorithm on 19-state random walk
  4. Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car
  5. Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car

Chapter 13

  1. Figure 13.1: REINFORCE on the short-corridor grid world
  2. Figure 13.2: REINFORCE with baseline on the short-corridor grid-world

2. Solution to all of the exercises (text answers)

To reproduce the results of an exercise, say exercise 2.5 do:

cd chapter2
python figures.py ex2.5

Chapter 2

  1. Exercise2.5: Difficulties that sample-average methods have for nonstationary problems

  2. Exercise2.11: Figure analogous to Figure 2.6 for the nonstationary case

Chapter 4

  1. Exercise 4.7: Modified Jack's car rental problem (value function, policy)

  2. Exercise 4.9: Gambler’s problem with ph = 0.25 (value function, policy) and ph = 0.55 (value function, policy)

Chapter 5

  1. Exercise 5.14: Modified MC Control on the racetrack (1, 2)

Chapter 6

  1. Exercise 6.4: Wider range of values alpha
  2. Exercise 6.5: High alpha, 99ffect of initialization
  3. Exercise 6.9: Windy Gridworld with King’s Moves
  4. Exercise 6.10: Stochastic Wind
  5. Exercise 6.13: Double Expected Sarsa vs. Expected Sarsa

Chapter 7

  1. Exercise7.2: Sum of TD error vs. n-step TD on 19-states random walk
  2. Exercise7.3: 19 states vs. 5 states, left-side outcome of -1
  3. Exercise7.7: Off-policy action-value prediction on a not-so-random walk
  4. Exercise7.10: Off-policy action-value prediction on a not-so-random walk

Chapter 8

  1. Exercise8.1: n-step sarsa on the maze task
  2. Exercise8.4: Gridworld experiment to test the exploration bonus

Chapter 11

  1. Exercise11.3: One-step semi-gradient Q-learning to Baird’s counterexample

3. Anki flashcards (cf. this blog)

Appendix

Dependencies

numpy
matplotlib
seaborn

Credits

All of the code and answers are mine, except for mountain car's tile coding (url in the book).

This README is inspired from ShangtongZhang's repo.

Design choices

  1. All of the chapters are self-contained.
  2. The environments use a gym-like API with methods:
s = env.reset()
s_p, r, d, dict = env.step(a)

How long did it take

The entire thing (plots, exercises, anki cards (including reviewing)) took about 400h of focused work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].