All Projects β†’ wassname β†’ World Models Sonic Pytorch

wassname / World Models Sonic Pytorch

Licence: mit
Attempt at reinforcement learning with curiosity for Sonic the Hedgehog games. Number 149 on OpenAI retro contest leaderboard, but more work needed

Projects that are alternatives of or similar to World Models Sonic Pytorch

Amazon Sagemaker Examples
Example πŸ““ Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
Stars: ✭ 6,346 (+23403.7%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Deeprl Tutorials
Contains high quality implementations of Deep Reinforcement Learning algorithms written in PyTorch
Stars: ✭ 748 (+2670.37%)
Mutual labels:  jupyter-notebook, reinforcement-learning
David Silver Reinforcement Learning
Notes for the Reinforcement Learning course by David Silver along with implementation of various algorithms.
Stars: ✭ 623 (+2207.41%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Practical rl
A course in reinforcement learning in the wild
Stars: ✭ 4,741 (+17459.26%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Deeplearning Trader
backtrader with DRL ( Deep Reinforcement Learning)
Stars: ✭ 24 (-11.11%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Tensorflow Book
Accompanying source code for Machine Learning with TensorFlow. Refer to the book for step-by-step explanations.
Stars: ✭ 4,448 (+16374.07%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Reinforcement Learning 2nd Edition By Sutton Exercise Solutions
Solutions of Reinforcement Learning, An Introduction
Stars: ✭ 713 (+2540.74%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Reinforcement learning tutorial with demo
Reinforcement Learning Tutorial with Demo: DP (Policy and Value Iteration), Monte Carlo, TD Learning (SARSA, QLearning), Function Approximation, Policy Gradient, DQN, Imitation, Meta Learning, Papers, Courses, etc..
Stars: ✭ 442 (+1537.04%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Basic reinforcement learning
An introductory series to Reinforcement Learning (RL) with comprehensive step-by-step tutorials.
Stars: ✭ 826 (+2959.26%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Coursera
Quiz & Assignment of Coursera
Stars: ✭ 774 (+2766.67%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Rl Book
Source codes for the book "Reinforcement Learning: Theory and Python Implementation"
Stars: ✭ 464 (+1618.52%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Awesome Ai Books
Some awesome AI related books and pdfs for learning and downloading, also apply some playground models for learning
Stars: ✭ 855 (+3066.67%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Courses
Quiz & Assignment of Coursera
Stars: ✭ 454 (+1581.48%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Ml Mipt
Open Machine Learning course at MIPT
Stars: ✭ 480 (+1677.78%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Tensor House
A collection of reference machine learning and optimization models for enterprise operations: marketing, pricing, supply chain
Stars: ✭ 449 (+1562.96%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Hands On Reinforcement Learning With Python
Master Reinforcement and Deep Reinforcement Learning using OpenAI Gym and TensorFlow
Stars: ✭ 640 (+2270.37%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Learning Deep Learning
Paper reading notes on Deep Learning and Machine Learning
Stars: ✭ 388 (+1337.04%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Qlearning trading
Learning to trade under the reinforcement learning framework
Stars: ✭ 431 (+1496.3%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Hands On Meta Learning With Python
Learning to Learn using One-Shot Learning, MAML, Reptile, Meta-SGD and more with Tensorflow
Stars: ✭ 768 (+2744.44%)
Mutual labels:  jupyter-notebook, reinforcement-learning
Rainbow Is All You Need
Rainbow is all you need! A step-by-step tutorial from DQN to Rainbow
Stars: ✭ 938 (+3374.07%)
Mutual labels:  jupyter-notebook, reinforcement-learning

STATUS: This doesn't work well and neither did similar independent attempts. They all got ~2000 performance which is equivalent to learning to press right and occasionally jump. Applying this model to Sonic appears to be quite challenging, but may work on simpler games.

world-models-sonic-pytorch

https://github.com/wassname/world-models-sonic-pytorch

My attempt to implementing an unsupervised dynamics model with ideas from a few papers

I use a dynamics model, an inverse model, and a VAE. For the controller I use Proximal Policy Optimization (PPO). I also use a form of Curiosity as an axillary reward.

The end result was a score of 2337/9000 and place ~149 on the leader board to the OpenAI Retro Content. This is the score you can get from constantly running and jumping to the right, sometimes at the right times. So while it shows some interesting behavior it's far from the median score of ~4000 or the top score of 6000.

Independent attempts to apply world models (and curiosity) to sonic got similar results 2. I believe this approach may not be well suited to the game.

If you are trying to use and understand my code, I encourage you to check out dylandjian's write up since he tried a world model’s approach, explained it well, and won "Best Write-up"!

If anyone finds the reason why this isn't converging please let me know, I've rewritten it and tried many things but perhaps Sonic is just too complex for world models.

Curiosity reward

I use a form of Curiosity as an axillary reward. The theory is that we like to see/listen to things that we can learn from, but then we don't want to see them again because we've learnt all we could. There's more about that theory here. The goal is that this may teach Sonic to move backward and up and down. Without this he tends to get stuck when it needs to backtrack, because it's only rewarded for going right.

One way to frame this in reinforcement learning is by rewarding a controller for finding novel states and giving them to the world model. Then we measuring how much the loss reduces before and after training. That's out reward (There are probably better ways to frame it).

So I tried that... but the agent liked to stand in place and fill a whole rollout with one frame. Until it had learnt all it could. Then it would casually stroll a little way, and overfit to it's new environment. This can be fixed with tweaking, such as adding a boredom threshold or decreasing the curiosity reward, but I haven't found anything reliable.

Sources of code

Setup

Running

I have included the pretrained models in releases

  • 04_train_PPO_v4-all-curiosity.ipynb to train from scratch. Run it with verbose=True to see the performance in a live video.

Hyperparameters

Loss weights

The model uses joint training for the VAE, forward, and inverse models (https://arxiv.org/abs/1803.10122). This introduces a few new hyperperameters, but at of 20180520 there is no information on how to set these. The parameters are lambda_vae and lambda_finv.

We want changes to be within an order of magnitude, and we preffer loss VAE to be optimised preferentially, then mdnrnn, then finv. So we want to set it so that loss_vae is large.

For example, if the mdnn is optimised over the VAE, the VAE will learn to output blank images, which the mdnrnn will predict with perfect accuracy. Likewise if the finv is optimised preferentially, the model will only learn to encode the actions in blank images. There are unsatisying local minima.

To set them, you should run for a few epochs with them set to 1, then record the three components of the loss. For example you might get loss_vae=20,000, loss_mdnrnn=3, loss_finv=3. In this case I would set lambda_vae=1/1,00, and the other to one. Keep and eye on the balance between them and make sure they don't get too unbalanced, eventually my unbalanced losses were around loss_vae=2000, loss_mdnrnn=-2, loss_finv=0.1. This means the loss reduction of each was 1800, 5, and 2.5, and the balances loss reductions were 18, 5, and 2.9. All values within an order of mangitude and in an order which follows our preferences.

Learning rate

Other hyperparamers can sometimes needs to be tweaked. A small learning rate may be needed to initially train the VAE, say 1e-5. Then a higher one may be needed to get the MDRNN to convert, say 3e-4.

Overall it can take quite a small learning rate to train multiple network simultaneously without being to high on any of them.

Curiosity weights

I haven't found optimal setting for these but you can leave the settings at default values.

Gradient clipping

Too low and all your updates get normalized to look the same, so even when there is a lot to learn you model learns the same amount from suprising and unsuprising batches. Set it too large and outliers will update your model too far off track. To set this you should look at the variable "grad_norm" which is logged in tensorboard. You can set the value to something between the mean and the maximum values you observe. This depends heavily on reward scaling, model archetecture, and model convergance. Ideally this hyperparamer would be removed and the value would instead be calibrated by a moving average of gradient norms but I haven't seen anyone do this yet.

Entropy weight

Make sure the PPO entropy_loss doesn't go to ~0. This would mean the entropy loss isn't working, it's mean to punish the agent for being overly certain, and therefore keep it exploring. But a value at near zero (when compared to the policy_loss) means it's stopped exploring and has probobly got stuck in a local mixima such as always pressing right. If so increase your entropy_loss_weight by 10 and keep monitoring it. The potential decrease in the entropy_loss should be comparable to the policy_loss.

For sonic I found this parameter was particularly important since there is a large false minima where it just presses right. So we need entropy weight to be higher than normal but not so high i can't vary it's actions without incurring a penalty. In the end 0.04 worked while 1 was too high and 0.01 was too low.

Misc

  • The controller can't learn much untill you freeze or drastically slow the world-model learning (lr<<1e-5) (TODO confirm this on recent code)
  • PPO mini batch size should be higher than normal (20+) or it may not learn (supported by https://arxiv.org/abs/1804.03720)

Details

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].