Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → adik993 → ppo-pytorch

adik993 / ppo-pytorch

Licence: other

Proximal Policy Optimization(PPO) with Intrinsic Curiosity Module(ICM)

Programming Languages

139335 projects - #7 most used programming language

Labels

reinforcement-learning deep-learning pytorch icm proximal-policy-optimization ppo mountaincar-v0 cartpole-v1 intrinsic-curiosity-module generalized-advantage-estimation pendulum-v0

Projects that are alternatives of or similar to ppo-pytorch

Reinforcement Learning With Tensorflow

Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学

Stars: ✭ 6,948 (+8271.08%)

Mutual labels: proximal-policy-optimization, ppo

reinforcement learning ppo rnd

Deep Reinforcement Learning by using Proximal Policy Optimization and Random Network Distillation in Tensorflow 2 and Pytorch with some explanation

Stars: ✭ 33 (-60.24%)

Mutual labels: proximal-policy-optimization, ppo

imitation learning

PyTorch implementation of some reinforcement learning algorithms: A2C, PPO, Behavioral Cloning from Observation (BCO), GAIL.

Stars: ✭ 93 (+12.05%)

Mutual labels: proximal-policy-optimization, ppo

Relational Deep Reinforcement Learning

No description or website provided.

Stars: ✭ 44 (-46.99%)

Mutual labels: proximal-policy-optimization, ppo

Pytorch A2c Ppo Acktr Gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Stars: ✭ 2,632 (+3071.08%)

Mutual labels: proximal-policy-optimization, ppo

Deep-Reinforcement-Learning-for-Automated-Stock-Trading-Ensemble-Strategy-ICAIF-2020

Live Trading. Please star.

Stars: ✭ 1,251 (+1407.23%)

Mutual labels: ppo

model-free-algorithms

TD3, SAC, IQN, Rainbow, PPO, Ape-X and etc. in TF1.x

Stars: ✭ 56 (-32.53%)

Mutual labels: ppo

No description or website provided.

Stars: ✭ 14 (-83.13%)

Mutual labels: ppo

Street-fighter-A3C-ICM-pytorch

Curiosity-driven Exploration by Self-supervised Prediction for Street Fighter III Third Strike

Stars: ✭ 149 (+79.52%)

Mutual labels: icm

Deep RL with pytorch

A pytorch tutorial for DRL(Deep Reinforcement Learning)

Stars: ✭ 160 (+92.77%)

Mutual labels: ppo

Reinforcement learning algorithms implemented for Tensorflow 2.0+ [DQN, DDPG, AE-DDPG, SAC, PPO, Primal-Dual DDPG]

Stars: ✭ 160 (+92.77%)

Mutual labels: ppo

Reinforcement Learning

Deep Reinforcement Learning Algorithms implemented with Tensorflow 2.3

Stars: ✭ 61 (-26.51%)

Mutual labels: ppo

Lightweight deep RL Libraray for continuous control.

Stars: ✭ 14 (-83.13%)

Mutual labels: ppo

Scalable and Elastic Deep Reinforcement Learning Using PyTorch. Please star. 🔥

Stars: ✭ 2,074 (+2398.8%)

Mutual labels: ppo

Material-of-MCM-ICM

LaTeX template, Outstanding papers of last years and some useful books about MATLAB

Stars: ✭ 57 (-31.33%)

Mutual labels: icm

xingtian is a componentized library for the development and verification of reinforcement learning algorithms

Stars: ✭ 229 (+175.9%)

Mutual labels: ppo

ReinforcementLearningZoo.jl

juliareinforcementlearning.org/

Stars: ✭ 46 (-44.58%)

Mutual labels: ppo

Explorer is a PyTorch reinforcement learning framework for exploring new ideas.

Stars: ✭ 54 (-34.94%)

Mutual labels: ppo

TD-Regularized Actor-Critic Methods

Stars: ✭ 28 (-66.27%)

Mutual labels: ppo

Deep-Reinforcement-Learning-Notebooks

This Repository contains a series of google colab notebooks which I created to help people dive into deep reinforcement learning.This notebooks contain both theory and implementation of different algorithms.

Stars: ✭ 15 (-81.93%)

Mutual labels: ppo

View All Similar Projects ➔

Proximal Policy Optimization(PPO) in PyTorch

This repository contains implementation of reinforcement learning algorithm called Proximal Policy Optimization(PPO). It also implements Intrinsic Curiosity Module(ICM).

CartPole-v1 (PPO)	MountainCar-v0 (PPO + ICM)	Pendulum-v0 (PPO + ICM)

What is PPO

PPO is an online policy gradient algorithm built with stability in mind. It optimizes clipped surrogate function to make sure new policy is close to the previous one.

Since it's online algorithm it uses the experience gathered to update the policy and then discards the experience(there is no replay buffer), because of that it does well in environments that has dense reward like CartPole-V1 where you get the reward immediately, but it struggles to learn the policy for the environments with sparse reward like MountainCar-v0 where we get the positive reward only when we reach the top which is a rare event. For a offline algorithms like DQN it is much easier to solve sparse reward problems, because of the fact they can store this rare events in the replay buffer and use it multiple times for training.

In order to make the learning of sparse reward problems easier we need to introduce the curiosity concept

What is curiosity

Curiosity is the concept of calculating additional reward for agent called intrinsic reward apart from the reward from the environment itself called extrinsic reward. There are many ideas of how to define the curiosity, but in this project the idea of Intrinsic Curiosity Module(ICM) is used. Authors define the curiosity as a measure of suprise the encountered state brings to the agent. They achieve that by encoding the states into the latent vector and then implementing two models. The forward model that given the encoded state and the action predicts the next state and the inverse model that given encoded state and encoded next state tries to predict the action that must have been taken to transit from one state to the other. The intrinsic reward is calculated as a distance between the actual encoded next state vector and the forward model's prediction of the next state. One may wonder what is the inverse model for if it's not used for calculating the reward. The authors explain that with the example of the agent exploring the environment and seeing the tree with the leafs moving in the wind. The leafs are out of agent's control, but still he would be curious about them. To avoid it the inverse model was introduced that makes sure agent is curious about states he have the control of.

How to run

First make sure to install all dependencies listed in the requirements.txt. Then run one of the following or use them as an example to run the algorithm on any other environment:

CartPole-v1 python run_cartpole.py
MountainCar-v0 python run_mountain_car.py
Pendulum-v0 python run_pendulum.py

Implementation details

The agent(PPO) explores(Runner) multiple environments at once(MultiEnv) for a specified number of steps. If the Curiosity was plugged in the reward is augmented with the intrinsic reward from the curiosity module. If the normalize_state or normalize_reward is enabled the normalization is performed(Normalizer) on the states and rewards respectively. Then the discounted reward(Reward) and discounted advantage(Advantage) is calculated on the rewards gathered. That data is split into n_mini_batches and used to perform n_optimization_epochs of training with Adam optimizer using learning_rate. Most of the classes accept the Reporter argument which can be used to plug in the TensorBoardReporter used to publish data to tensorboard for live tracking of the learning progress.

Normalize or not

Normalization may help on some complicated continous problems like Pendulum-v0, but may hurt the performance on the simple discrete environments like CartPole-v1.

TODO

Early stopping
Model saving
CNN
LSTM

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 83

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗