All Projects → pat-coady → Trpo

pat-coady / Trpo

Licence: mit
Trust Region Policy Optimization with TensorFlow and OpenAI Gym

Projects that are alternatives of or similar to Trpo

Lagom
lagom: A PyTorch infrastructure for rapid prototyping of reinforcement learning algorithms.
Stars: ✭ 364 (+6.12%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient, mujoco
Pytorch Rl
This repository contains model-free deep reinforcement learning algorithms implemented in Pytorch
Stars: ✭ 394 (+14.87%)
Mutual labels:  reinforcement-learning, policy-gradient, mujoco
Drq
DrQ: Data regularized Q
Stars: ✭ 268 (-21.87%)
Mutual labels:  jupyter-notebook, reinforcement-learning, mujoco
Text summurization abstractive methods
Multiple implementations for abstractive text summurization , using google colab
Stars: ✭ 359 (+4.66%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Reinforcement learning tutorial with demo
Reinforcement Learning Tutorial with Demo: DP (Policy and Value Iteration), Monte Carlo, TD Learning (SARSA, QLearning), Function Approximation, Policy Gradient, DQN, Imitation, Meta Learning, Papers, Courses, etc..
Stars: ✭ 442 (+28.86%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Deep Algotrading
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading
Stars: ✭ 173 (-49.56%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Hands On Reinforcement Learning With Python
Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow
Stars: ✭ 640 (+86.59%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Pytorch Rl
Tutorials for reinforcement learning in PyTorch and Gym by implementing a few of the popular algorithms. [IN PROGRESS]
Stars: ✭ 121 (-64.72%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Rl Course Experiments
Stars: ✭ 73 (-78.72%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Pytorch sac
PyTorch implementation of Soft Actor-Critic (SAC)
Stars: ✭ 174 (-49.27%)
Mutual labels:  jupyter-notebook, reinforcement-learning, mujoco
Multihopkg
Multi-hop knowledge graph reasoning learned via policy gradient with reward shaping and action dropout
Stars: ✭ 202 (-41.11%)
Mutual labels:  jupyter-notebook, reinforcement-learning, policy-gradient
Rl learn
我的强化学习笔记和学习材料📖 still updating ... ...
Stars: ✭ 234 (-31.78%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Aleph star
Reinforcement learning with A* and a deep heuristic
Stars: ✭ 235 (-31.49%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Deep-Reinforcement-Learning-CS285-Pytorch
Solutions of assignments of Deep Reinforcement Learning course presented by the University of California, Berkeley (CS285) in Pytorch framework
Stars: ✭ 104 (-69.68%)
Mutual labels:  policy-gradient, mujoco
Nn
🧑‍🏫 50! Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
Stars: ✭ 5,720 (+1567.64%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Rad
RAD: Reinforcement Learning with Augmented Data
Stars: ✭ 268 (-21.87%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Popular Rl Algorithms
PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..
Stars: ✭ 266 (-22.45%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Trading Bot
Stock Trading Bot using Deep Q-Learning
Stars: ✭ 273 (-20.41%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Dinoruntutorial
Accompanying code for Paperspace tutorial "Build an AI to play Dino Run"
Stars: ✭ 285 (-16.91%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Applied Reinforcement Learning
Reinforcement Learning and Decision Making tutorials explained at an intuitive level and with Jupyter Notebooks
Stars: ✭ 229 (-33.24%)
Mutual labels:  jupyter-notebook, reinforcement-learning

Trust Region Policy Optimization with Generalized Advantage Estimation

By Patrick Coady: Learning Artificial Intelligence

Summary

NOTE: The code has been refactored to use TensorFlow 2.0 and PyBullet (instead of MuJoCo). See the tf1_mujoco branch for old version.

The project's original goal was to use the same algorithm to "solve" 10 MuJoCo robotic control environments. And, specifically, to achieve this without hand-tuning the hyperparameters (network sizes, learning rates, and TRPO settings) for each environment. This is challenging because the environments range from a simple cart pole problem with a single control input to a humanoid with 17 controlled joints and 44 observed variables. The project was successful, nabbing top spots on almost all of the AI Gym MuJoCo leaderboards.

With the release of TensorFlow 2.0, I decided to dust off this project and upgrade the code. And, while I was at it, I moved from the paid MuJoCo simulator to the free PyBullet simulator.

Here are the key points:

  • Trust Region Policy Optimization [1] [2]
  • Value function approximated with 3 hidden-layer NN (tanh activations):
    • hid1 size = obs_dim x 10
    • hid2 size = geometric mean of hid1 and hid3 sizes
    • hid3 size = 5
  • Policy is a multi-variate Gaussian parameterized by a 3 hidden-layer NN (tanh activations):
    • hid1 size = obs_dim x 10
    • hid2 size = geometric mean of hid1 and hid3 sizes
    • hid3 size = action_dim x 10
    • Diagonal covariance matrix variables are separately trained
  • Generalized Advantage Estimation (gamma = 0.995, lambda = 0.98) [3] [4]
  • ADAM optimizer used for both neural networks
  • The policy is evaluated for 20 episodes between updates, except:
    • 50 episodes for Reacher
    • 5 episodes for Swimmer
    • 5 episodes for HalfCheetah
    • 5 episodes for HumanoidStandup
  • Value function is trained on current batch + previous batch
  • KL loss factor and ADAM learning rate are dynamically adjusted during training
  • Policy and Value NNs built with TensorFlow

PyBullet Gym Environments

HumanoidDeepMimicBulletEnv-v1
CartPoleBulletEnv-v1
MinitaurBulletEnv-v0
MinitaurBulletDuckEnv-v0
RacecarBulletEnv-v0
RacecarZedBulletEnv-v0
KukaBulletEnv-v0
KukaCamBulletEnv-v0
InvertedPendulumBulletEnv-v0
InvertedDoublePendulumBulletEnv-v0
InvertedPendulumSwingupBulletEnv-v0
ReacherBulletEnv-v0
PusherBulletEnv-v0
ThrowerBulletEnv-v0
StrikerBulletEnv-v0
Walker2DBulletEnv-v0
HalfCheetahBulletEnv-v0
AntBulletEnv-v0
HopperBulletEnv-v0
HumanoidBulletEnv-v0
HumanoidFlagrunBulletEnv-v0
HumanoidFlagrunHarderBulletEnv-v0

Using

I ran quick checks on three of the above environments and successfully stabilized a double-inverted pendulum and taught the "half cheetah" to run.

python train.py InvertedPendulumBulletEnv-v0
python train.py InvertedDoublePendulumBulletEnv-v0 -n 5000
python train.py HalfCheetahBulletEnv-v0 -n 5000 -b 5

Videos

During training, videos are periodically saved automatically to the /tmp folder. These can be enjoyable to view, and also instructive.

Dependencies

References

  1. Trust Region Policy Optimization (Schulman et al., 2016)
  2. Emergence of Locomotion Behaviours in Rich Environments (Heess et al., 2017)
  3. High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., 2016)
  4. GitHub Repository with several helpful implementation ideas (Schulman)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].