All Projects → CherryPieSexy → imitation_learning

CherryPieSexy / imitation_learning

Licence: other
PyTorch implementation of some reinforcement learning algorithms: A2C, PPO, Behavioral Cloning from Observation (BCO), GAIL.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to imitation learning

Pytorch A2c Ppo Acktr Gail
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
Stars: ✭ 2,632 (+2730.11%)
Mutual labels:  deep-reinforcement-learning, proximal-policy-optimization, ppo, advantage-actor-critic, a2c
Tianshou
An elegant PyTorch deep reinforcement learning library.
Stars: ✭ 4,109 (+4318.28%)
Mutual labels:  policy-gradient, imitation-learning, ppo, a2c
Easy Rl
强化学习中文教程,在线阅读地址:https://datawhalechina.github.io/easy-rl/
Stars: ✭ 3,004 (+3130.11%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, imitation-learning, ppo
Deep-Reinforcement-Learning-With-Python
Master classic RL, deep RL, distributional RL, inverse RL, and more using OpenAI Gym and TensorFlow with extensive Math
Stars: ✭ 222 (+138.71%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo, a2c
Deep RL with pytorch
A pytorch tutorial for DRL(Deep Reinforcement Learning)
Stars: ✭ 160 (+72.04%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c, gail
Lagom
lagom: A PyTorch infrastructure for rapid prototyping of reinforcement learning algorithms.
Stars: ✭ 364 (+291.4%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo
Reinforcement learning tutorial with demo
Reinforcement Learning Tutorial with Demo: DP (Policy and Value Iteration), Monte Carlo, TD Learning (SARSA, QLearning), Function Approximation, Policy Gradient, DQN, Imitation, Meta Learning, Papers, Courses, etc..
Stars: ✭ 442 (+375.27%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, imitation-learning
Deeprl
Modularized Implementation of Deep RL Algorithms in PyTorch
Stars: ✭ 2,640 (+2738.71%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c
Minimalrl
Implementations of basic RL algorithms with minimal lines of codes! (pytorch based)
Stars: ✭ 2,051 (+2105.38%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c
Deep-Reinforcement-Learning-Notebooks
This Repository contains a series of google colab notebooks which I created to help people dive into deep reinforcement learning.This notebooks contain both theory and implementation of different algorithms.
Stars: ✭ 15 (-83.87%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c
Hands On Reinforcement Learning With Python
Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow
Stars: ✭ 640 (+588.17%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo
LWDRLC
Lightweight deep RL Libraray for continuous control.
Stars: ✭ 14 (-84.95%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo
Ppo Pytorch
Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
Stars: ✭ 325 (+249.46%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo
Reinforcement Learning
Learn Deep Reinforcement Learning in 60 days! Lectures & Code in Python. Reinforcement Learning + Deep Learning
Stars: ✭ 3,329 (+3479.57%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c
Deep reinforcement learning course
Implementations from the free course Deep Reinforcement Learning with Tensorflow and PyTorch
Stars: ✭ 3,232 (+3375.27%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c
Pytorch Rl
PyTorch implementation of Deep Reinforcement Learning: Policy Gradient methods (TRPO, PPO, A2C) and Generative Adversarial Imitation Learning (GAIL). Fast Fisher vector product TRPO.
Stars: ✭ 658 (+607.53%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo
Deeprl algorithms
DeepRL algorithms implementation easy for understanding and reading with Pytorch and Tensorflow 2(DQN, REINFORCE, VPG, A2C, TRPO, PPO, DDPG, TD3, SAC)
Stars: ✭ 97 (+4.3%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo
Rainy
☔ Deep RL agents with PyTorch☔
Stars: ✭ 39 (-58.06%)
Mutual labels:  deep-reinforcement-learning, ppo, a2c
rl implementations
No description or website provided.
Stars: ✭ 40 (-56.99%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, a2c
Slm Lab
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
Stars: ✭ 904 (+872.04%)
Mutual labels:  deep-reinforcement-learning, policy-gradient, ppo

PyTorch Reinforcement and Imitation Learning

This repository contains parallel PyTorch implementation of some Reinforcement and Imitation Learning algorithms: A2C, PPO, BCO, GAIL, V-trace. Short description:

  • Advantage Actor-Critic (A2C) - a synchronous variant of A3C
  • Proximal Policy Optimization (PPO) - one of the most popular RL algorithms PPO, Truly PPO, Implementation Matters, A Large-Scale Empirical Study of PPO
  • Behavioral Cloning from Observation (BCO) - technique to clone expert behavior into agent using only expert states, BCO (works bad for me, not supported now)
  • Generative Adversarial Imitation Learning (GAIL) - algorithm to mimic expert policy using discriminator as reward model GAIL

Each algorithm supports vector/image/dict observation spaces and discrete/continuous/tuple action spaces. Data gathering and training on it controlled by separate processes, parallelism scheme is described in this file. Code is written with a focus on on-policy algorithms; Recurrent policies are also supported.

Current Functionality

Each algorithm supports discrete (Categorical, Bernoulli, GumbelSoftmax) and continuous (Beta, Normal, tanh(Normal)) policy distributions, and there is additional 'Tuple' distribution that can be used for mixing distributions above. For continuous action spaces, Beta distribution works best in my experiments (tested on BipedalWalker and Humanoid environments).

Environments with vector, image, or dict observation spaces are supported. Recurrent policies are supported.

Several returns estimation algorithms supported: 1-step, n-step, GAE and V-Trace (Introduced in IMPALA paper).

As found in paper Implementation Matters, PPO algo works mostly because of "code-level" optimizations. Here I implemented most of them:

  • Value function clipping (works better without it)
  • Observation normalization & clipping
  • Reward normalization/scaling & clipping
  • Orthogonal initialization of neural network weights
  • Gradient clipping
  • Learning rate annealing (will be added... sometime)

In addition, I implemented roll-back loss from Truly PPO paper, which works well.

How to use

Clone the repo, install python module:

git clone https://github.com/CherryPieSexy/imitation_learning.git
cd imitation_learning/
pip install -e .

Training example

Each experiment is described in a config, look at the config. To run the experiment execute the command:

python configs/cart_pole/cart_pole_ppo_annotated.py

Training results (including training config, tensorboard logs, and model checkpoints) will be saved in the log_dir folder.

Obtained policy:

cartpole

Testing example

Results of trained policy may be shown with cherry_rl/test.py script. To run in from any folder execute:

python -m cherry_rl.test -f ${PATH_TO_LOG_DIR} -p ${CHECKPOINT_ID}

This script is able to:

  • just show how policy acts in the environment
  • measure mean reward and episode len over a requested number of episodes
  • record demo file with trajectories

Execute python -m cherry_rl.test -h to see a detailed description of available arguments.

Code structure

.
├── cherry_rl                        # folder with code
    ├── algorithms                      # algorithmic part of code
        ├── nn                          # folder with neural networks definitions.
            ├── agent_model.py          # special module for the agent.
            └── ...                     # various nn models: actor-critics, convolutional & recurrent encoders.
        ├── optimizers                  # folder with RL optimizers. Each optimizer shares
            ├── model_optimizer.py      # base optimizer for all models.
            ├── actor_critic_optimizer.py
            └── ...                     # core algorithms: a2c.py, ppo.py, bco.py
        ├── parallel
            ├── readme.md               # description of used parallelism scheme
            └── ...                     # modules responsible for parallel rollout gathering and training.
        ├── returns_estimator.py        # special module for estimating returns. Supported estimators: 
        └── ...                         # all other algorithmic modules that do not fit in any other folder. 
    ├── utils
        ├── vec_env.py                  # vector env (copy of OpenAI code, but w/o automatic resetting)
        └── ...                         # environment wrappers and other utils.
    └── test.py                         # script for watching trained agent and recording demo.
├── configs                             # subfolder name = environment, script name = algo
    ├── cart_pole
        ├── cart_pole_demo_10_ep.pickle  # demo file for training BCO or GAIL
        ├── cart_pole_a2c.py
        ├── cart_pole_ppo.py
        ├── cart_pole_ppo_gru.py        # recurrent policy
        ├── cart_pole_ppo_annotated.py  # ppo training script with comments
        ├── cart_pole_bco.py
        └── cart_pole_gail.py
    ├── bipedal                         # folder with similar scripts as cart_pole
    ├── humanoid
    └── car_racing

Modular neural network definition

Each agent have optional make_obs_encoder and obs_normalizer_size arguments. Observation encoder is a neural network (i.e. nn.Module), it is applied directly to observation, typically an image. Observation normalizer is a running mean-variance estimator which standardizes observations, it applied before encoder. Most of the times actor-critic trains better on such zero-mean unit-variance observations or embeddings.

To train your own neural network architecture you can just import or define it in the config, initialize it in make_ac_model function, and pass as make_actor_critic argument into AgentModel.

Trained environments

GIFs of some of the results:

BipedalWalker-v3: mean reward ~333, 0 fails over 1000 episodes, config.

bipedal

Humanoid-v3: mean reward ~11.3k, 14 fails over 1000 episodes, config.

humanoid

Experiments with Humanoid done in mujoco v2 which have integration bug that makes the environment easier. For academic purposes, it is correct to use mujoco v1.5

CarRacing-v0: mean reward = 894 ± 32, 26 fails over 100 episodes (episode is considered failed if reward < 900), config.

car_racing

Further plans

  • Try Motion Imitation DeepMimic paper algo
  • Add self-play trainer with PPO as backbone algo
  • ...
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].