All Projects → omerbsezer → Reinforcement_learning_tutorial_with_demo

omerbsezer / Reinforcement_learning_tutorial_with_demo

Reinforcement Learning Tutorial with Demo: DP (Policy and Value Iteration), Monte Carlo, TD Learning (SARSA, QLearning), Function Approximation, Policy Gradient, DQN, Imitation, Meta Learning, Papers, Courses, etc..

Projects that are alternatives of or similar to Reinforcement learning tutorial with demo

Easy Rl
强化学习中文教程,在线阅读地址:https://datawhalechina.github.io/easy-rl/
Stars: ✭ 3,004 (+579.64%)
Mutual labels:  reinforcement-learning, deep-reinforcement-learning, policy-gradient, q-learning, a3c, imitation-learning
Reinforcement Learning With Tensorflow
Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学
Stars: ✭ 6,948 (+1471.95%)
Mutual labels:  tutorial, reinforcement-learning, policy-gradient, actor-critic, q-learning, a3c
Reinforcement Learning
Minimal and Clean Reinforcement Learning Examples
Stars: ✭ 2,863 (+547.74%)
Mutual labels:  reinforcement-learning, deep-reinforcement-learning, policy-gradient, actor-critic, a3c
Hands On Reinforcement Learning With Python
Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow
Stars: ✭ 640 (+44.8%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, policy-gradient, q-learning
Deep-Reinforcement-Learning-With-Python
Master classic RL, deep RL, distributional RL, inverse RL, and more using OpenAI Gym and TensorFlow with extensive Math
Stars: ✭ 222 (-49.77%)
Mutual labels:  deep-reinforcement-learning, q-learning, policy-gradient, a3c, actor-critic
Reinforcementlearning Atarigame
Pytorch LSTM RNN for reinforcement learning to play Atari games from OpenAI Universe. We also use Google Deep Mind's Asynchronous Advantage Actor-Critic (A3C) Algorithm. This is much superior and efficient than DQN and obsoletes it. Can play on many games
Stars: ✭ 118 (-73.3%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, actor-critic, a3c
Deep Reinforcement Learning
Repo for the Deep Reinforcement Learning Nanodegree program
Stars: ✭ 4,012 (+807.69%)
Mutual labels:  dynamic-programming, jupyter-notebook, reinforcement-learning, deep-reinforcement-learning
Deeprl Tutorials
Contains high quality implementations of Deep Reinforcement Learning algorithms written in PyTorch
Stars: ✭ 748 (+69.23%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, actor-critic
Drq
DrQ: Data regularized Q
Stars: ✭ 268 (-39.37%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, actor-critic
Rl Course Experiments
Stars: ✭ 73 (-83.48%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, policy-gradient
Torch Ac
Recurrent and multi-process PyTorch implementation of deep reinforcement Actor-Critic algorithms A2C and PPO
Stars: ✭ 70 (-84.16%)
Mutual labels:  reinforcement-learning, deep-reinforcement-learning, actor-critic, a3c
Basic reinforcement learning
An introductory series to Reinforcement Learning (RL) with comprehensive step-by-step tutorials.
Stars: ✭ 826 (+86.88%)
Mutual labels:  jupyter-notebook, tutorial, reinforcement-learning, q-learning
Openai lab
An experimentation framework for Reinforcement Learning using OpenAI Gym, Tensorflow, and Keras.
Stars: ✭ 313 (-29.19%)
Mutual labels:  reinforcement-learning, deep-reinforcement-learning, policy-gradient, actor-critic
2048 Deep Reinforcement Learning
Trained A Convolutional Neural Network To Play 2048 using Deep-Reinforcement Learning
Stars: ✭ 169 (-61.76%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, q-learning
Pytorch sac
PyTorch implementation of Soft Actor-Critic (SAC)
Stars: ✭ 174 (-60.63%)
Mutual labels:  jupyter-notebook, reinforcement-learning, deep-reinforcement-learning, actor-critic
rl implementations
No description or website provided.
Stars: ✭ 40 (-90.95%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, actor-critic, meta-learning
Slm Lab
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
Stars: ✭ 904 (+104.52%)
Mutual labels:  reinforcement-learning, deep-reinforcement-learning, policy-gradient, a3c
Pytorch A3c
PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".
Stars: ✭ 879 (+98.87%)
Mutual labels:  reinforcement-learning, deep-reinforcement-learning, actor-critic, a3c
Pytorch Rl
Tutorials for reinforcement learning in PyTorch and Gym by implementing a few of the popular algorithms. [IN PROGRESS]
Stars: ✭ 121 (-72.62%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient, actor-critic
Explorer
Explorer is a PyTorch reinforcement learning framework for exploring new ideas.
Stars: ✭ 54 (-87.78%)
Mutual labels:  deep-reinforcement-learning, q-learning, policy-gradient, actor-critic

Reinforcement Learning (RL) Tutorial

There are many RL tutorials, courses, papers in the internet. This one summarizes all of the RL tutorials, RL courses, and some of the important RL papers including sample code of RL algorithms. It will continue to be updated over time.

Keywords: Dynamic Programming (Policy and Value Iteration), Monte Carlo, Temporal Difference (SARSA, QLearning), Approximation, Policy Gradient, DQN, Imitation Learning, Meta-Learning, RL papers, RL courses, etc.

NOTE: This tutorial is only for education purpose. It is not academic study/paper. All related references are listed at the end of the file.

Table of Contents

What is RL?

Machine learning mainly consists of three methods: Supervised Learning, Unsupervised Learning and Reinforcement Learning. Supervised Learning provides mapping functionality between input and output using labelled dataset. Some of the supervised learning methods: Linear Regression, Support Vector Machines, Neural Networks, etc. Unsupervised Learning provides grouping and clustering functionality. Some of the unsupervised learning methods: K-Means, DBScan, etc. Reinforcement Learning is different from supervised and unsupervised learning. RL provides behaviour learning.

"A reinforcement learning algorithm, or agent, learns by interacting with its environment. The agent receives rewards by performing correctly and penalties for performing incorrectly. The agent learns without intervention from a human by maximizing its reward and minimizing its penalty" *. RL agents are used in different applications: Robotics, self driving cars, playing atari games, managing investment portfolio, control problems. I am believing that like many AI laboratories do, reinforcement learning with deep learning will be a core technology in the future.

rl-agent-env [Sutton & Barto Book: RL: An Introduction]

Markov Decision Process :

  • It consists of five tuples: status, actions, rewards, state transition probability, discount factor.
  • Markov decision processes formally describe an environment for reinforcement learning.
  • There are 3 techniques for solving MDPs: Dynamic Programming (DP) Learning, Monte Carlo (MC) Learning, Temporal Difference (TD) Learning.

mdps [David Silver Lecture Notes]

Markov Property :

A state St is Markov if and only if P[St+1|St] =P[St+1|S1,...,St]

RL Components :

Rewards :

  • A reward Rt is a scalar feedback signal.
  • The agent’s job is to maximise cumulative reward

State Transition Probability :

  • p(s′,r|s,a).= Pr{St=s′,Rt=r|St-1=s,At−1=a},

Discount Factor :

  • The discount γ∈[0,1] is the present value of future rewards.

Return :

  • The return Gt is the total discounted reward from time-step t.

return [David Silver Lecture Notes]

Value Function :

  • Value function is a prediction of future reward. How good is each state and/or action.
  • The value function v(s) gives the long-term value of state s
  • Vπ(s) =Eπ[Rt+1+γRt+22Rt+3+...|St=s]
  • Value function has two parts: immediate reward and discounted value of successor state.

value_function [David Silver Lecture Notes]

Policy (π) :

A policy is the agent’s behaviour. It is a map from state to action.

  • Deterministic policy: a=π(s).
  • Stochastic policy: π(a|s) =P[At=a|St=s].

State-Value Function :

state-value [David Silver Lecture Notes]

Action-Value Function :

action- value [David Silver Lecture Notes]

Optimal Value Functions :

optimal-value [David Silver Lecture Notes]

Planning vs RL :

Planning:

  • Rules of the game are known.
  • A model of the environment is known.
  • The agent performs computations with its mode.
  • The agent improves its policy.

RL:

  • The environment is initially unknown.
  • The agent interacts with the environment.
  • The agent improves its policy.

Exploration and Exploitation :

  • Reinforcement learning is like trial-and-error learning.
  • The agent should discover a good policy.
  • Exploration finds more information about the environment (Gather more information).
  • Exploitation exploits known information to maximise reward (Make the best decision given current information).

Prediction & Control Problem (Pattern of RL algorithms) :

  • Prediction: evaluate the future (Finding value given a policy).
  • Control: optimise the future (Finding optimal/best policy).

Grid World :

  • Grid World is a game for demonstration. 12 positions, 11 states, 4 actions. Our aim is to find optimal policy.
  • Demo Code: gridWorldGame.py

grid-world

optimal-policy-grid

Dynamic Programming Method (DP): Full Model :

  • Dynamic Programming is a very general solution method for problems which have two properties: 1.Optimal substructure, 2.Overlapping subproblems.
  • Markov decision processes satisfy both properties. Bellman equation gives recursive decomposition. Value function stores and reuses solutions.
  • In DP method, full model is known, It is used for planning in an MDP.
  • There are 2 methods: Policy Iteration, Value Iteration.
  • DP uses full-width backups.
  • DP is effective for medium-sized problems (millions of states).
  • For large problems DP suffers Bellman’s curse of dimensionality.
  • Disadvantage of DP: requires full model of environment, never learns from experience.

Policy Iteration (with Pseudocode) :

policy iteration [David Silver Lecture Notes]

policy-iteration

Policy Evaluation (with Pseudocode) :

  • Problem: evaluate a given policy π.
  • Solution: iterative application of Bellman expectation backup.
  • v1 → v2→ ... → vπ.

iterative-policy-evaluation

Policy Improvement :

policy-improvement

Value Iteration (with Pseudocode) :

  • Policy iteration has 2 inner loop. However, value iteration has a better solution.
  • It combines policy evaluation and policy improvement into one step.
  • Problem: find optimal policy π.
  • Solution: iterative application of Bellman optimality backup.

value-iteration

Monte Carlo (MC) Method :

  • Demo Code: monte_carlo_demo.ipynb
  • MC methods learn directly from episodes of experience.
  • MC is model-free : no knowledge of MDP transitions / rewards.
  • MC uses the simplest possible idea: value = mean return.
  • Episode must terminate before calculating return.
  • Average return is calculated instead of using true return G.
  • First Visit MC: The first time-step t that state s is visited in an episode.
  • Every Visit MC: Every time-step t that state s is visited in an episode.

MC Calculating Returns (with Pseudocode) :

mc-calculating-returns

First-Visit MC (with Pseudocode) (MC Prediction Problem) :

first-visit-mc

MC Exploring-Starts (with Pseudocode) (MC Control Problem) :

  • Demo Code: monte_carlo_es_demo.ipynb
  • State s and Action a is randomly selected for all starting points.
  • Use Q instead of V
  • Update the policy after every episode, keep updating the same Q in-place.

mc-control1

MC Epsilon Greedy (without Exploring Starts) :

  • Demo Code: monte_carlo_epsilon_greedy_demo.ipynb
  • MC Exploring start is infeasible, because in real problems we can not calculate all edge cases (ex: in self-driving car problem, we can not calculate all cases).
  • Randomly selection for all starting points in code is removed.
  • Change policy to sometimes be random.
  • This random policy is Epsilon-Greedy (like multi-armed bandit problem)

Temporal Difference (TD) Learning Method :

  • Demo Code: td0_prediction.ipynb
  • TD methods learn directly from episodes of experience.
  • TD updates a guess towards a guess
  • TD learns from incomplete episodes, by bootstrapping.
  • TD uses bootstrapping like DP, TD learns experience like MC (combines MC and DP).

MC - TD Difference :

  • MC and TD learn from experience.
  • TD can learn before knowing the final outcome.
  • TD can learn online after every step. MC must wait until end of episode before return is known.
  • TD can learn without the final outcome.
  • TD can learn from incomplete sequences. MC can only learn from complete sequences.
  • TD works in continuing environments. MC only works for episodic environments.
  • MC has high variance, zero bias. TD has low variance, some bias.

mc-td-dif1 mc-td-dif2 mc-td-dif3 [David Silver Lecture Notes]

MC - TD - DP Difference in Visual :

mc-td-dp [David Silver Lecture Notes]

SARSA (TD Control Problem, On-Policy) :

  • Demo Code: SARSA_demo.ipynb
  • In on-policy learning the Q(s,a) function is learned from actions, we took using our current policy π.

updatingwithsarsa sarsa-algo [David Silver Lecture Notes]

Q-Learning (TD Control Problem, Off-Policy) :

  • Demo Code: q_learning_demo.ipynb
  • Looks like SARSA, instead of choosing a' based on argmax of Q, Q(s,a) is updated directly with max over Q(s',a')
  • In off-policy learning the Q(s,a) function is learned from different actions (for example, random actions). We even don't need a policy at all.

qfunction

updatingwithqlearning qlearning-algo [David Silver Lecture Notes]

Function Approximation :

  • Demo Code: func_approx_q_learning_demo.ipynb
  • Reinforcement learning can be used to solve large problems, e.g Backgammon: 1020 states; Computer Go: 10170 states; Helicopter: continuous state space.
  • So far we have represented value function by a lookup table, Every state s has an entry V(s) or every state-action pair s, a has an entry Q(s,a).
  • There are too many states and/or actions to store in memory. It is too slow to learn the value of each state individually. Tabulated Q may not fit memory.
  • Solution for large MDPs:

func-appr

  • Differentiable function approximators can be used: Linear combinations of features, Neural Networks.

value-func-appr

func-appr2 [David Silver Lecture Notes]

Feature Vector :

feature-vectors [David Silver Lecture Notes]

Open AI Gym Environment :

gym

Policy-Based Methods :

  • DP, MC and TD Learning methods are value-based methods (Learnt Value Function, Implicit policy).
  • In value-based methods, a policy was generated directly from the value function (e.g. using epsilon-greedy)
  • In policy-based, we will directly parametrise the policy ( πθ(s,a) =P[a|s,θ) ).
  • Policy Gradient method is a policy-based method (No Value Function, Learnt Policy).

policy-based

Advantages:

  • Better convergence properties,
  • Effective in high-dimensional or continuous action spaces,
  • Can learn stochastic policies.

Disadvantages:

  • Typically converge to a local rather than global optimum.
  • Evaluating a policy is typically inefficient and high variance.

Policy Objective Functions :

  • Policy based reinforcement learning is an optimisation problem.
  • Find θ that maximises J(θ).

policy-objective-func [David Silver Lecture Notes]

Policy-Gradient :

policy-gradient

score-function

policy-gradient-theorem [David Silver Lecture Notes]

Monte-Carlo Policy Gradient (REINFORCE) :

reinforce [David Silver Lecture Notes]

Actor-Critic :

  • Actor-Critic method is a policy-based method (Learnt Value Function, Learnt Policy).

actor-critique-intro [David Silver Lecture Notes]

  • The critic is solving a familiar problem: policy evaluation.

Action-Value Actor-Critic :

action-value-actor-critic

advantage_func [David Silver Lecture Notes]

  • The advantage function can significantly reduce variance of policy gradient.
  • So the critic should really estimate the advantage function.

Actor-critic algorithm: A3C :

a3c

Different Policy Gradients :

policy-gradient-summary [David Silver Lecture Notes]

Model-Based RL :

model-based-rl

model-based-rl2 [David Silver Lecture Notes]

  • Favourite planning algorithms: Value iteration,Policy iteration,Tree search,etc..
  • Sample-based Modeling: A simple but powerful approach to planning. Use the model only to generate samples. Sample experience from model.
  • Apply model-free RL to samples, e.g.: Monte-Carlo control, SARSA, Q-Learning.
  • Model-based RL is only as good as the estimated model.
  • When the model is inaccurate, planning process will compute a suboptimal policy: 1.when model is wrong, use model-free RL; 2.reason explicitly about model uncertainty.

Real and Simulated Experience :

real-simulated-exp

dyna-arch [David Silver Lecture Notes]

Dyna-Q Algorithm :

dynaq [David Silver Lecture Notes]

Sim-Based Search :

sim-based-search [David Silver Lecture Notes]

MC-Tree-Search :

  • AlphaGo- Supervised learning + policy gradients + value functions + Monte Carlo tree search D. Silver, A. Huang, C. J.Maddison, A. Guez, L. Sifre, et al. “Mastering the game of Go with deep neural networks and tree search”. Nature (2016).
  • Highly selective best-first search.
  • Evaluates states dynamically (unlike e.g. DP).
  • Uses sampling to break curse of dimensionality.
  • Computationally efficient, anytime, parallelisable.
  • Works for “black-box” models (only requires samples).

mc-tree-search [David Silver Lecture Notes]

Temporal-Difference Search :

  • Simulation-based search.
  • Using TD instead of MC (bootstrapping).
  • MC tree search applies MC control to sub-MDP from now.
  • TD search applies Sarsa to sub-MDP from now.
  • For simulation-based search, bootstrapping is also helpful.
  • TD search is usually more efficient than MC search.
  • TD(λ) search can be much more efficient than MC search.

td-search [David Silver Lecture Notes]

RL in Games :

rl-in-games [David Silver Lecture Notes]

Deep Q Learning (Deep Q-Networks: DQN) :

  • Gradient descent is simple and appealing. But it is not sample efficient.
  • Batch methods seek to find the best fitting value function.
  • Given the agent’s experience (“training data”)

Experience Replay :

dqn-experience-replay

DQN in Atari :

-V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al. “Playing Atari with Deep Reinforcement Learning”. (2013)

dqn-in-atari

Imitation Learning :

imitation-learning1

imitation-learning2

Dagger: Dataset Aggregation :

Paper: https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf

dagger

PLATO: Policy Learning with Adaptive Trajectory Optimization :

plato

One-Shot Imitation Learning :

Meta-Learning :

  • Meta-learning = learning to learn
  • Supervised meta-learning = supervised learning with datapoints that are entire datasets
  • If we can meta-learn a faster reinforcement learner, we can learn new tasks efficiently!
  • What can a meta-learned learner do differently? 1.Explore more intelligently, 2.Avoid trying actions that are know to be useless, 3.Acquire the right features more quickly.
  • The promise of meta-learning: use past experience to simply acquire a much more efficient deep RL algorithm

meta-learning

POMDPs (Partial Observable MDP) :

pomdps [David Silver Lecture Notes]

Resources :

Important RL Papers :

References :

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].