All Projects → kachayev → gym-microrts-paper-sb3

kachayev / gym-microrts-paper-sb3

Licence: other
RL agent to play μRTS with Stable-Baselines3 and PyTorch

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to gym-microrts-paper-sb3

stadium
A graphical interface for reinforcement learning and gym-based environments. Integrates tensorboard and various configuration utilities for ease of usage.
Stars: ✭ 26 (+23.81%)
Mutual labels:  gym-environment, ppo
trading gym
a unified environment for supervised learning and reinforcement learning in the context of quantitative trading
Stars: ✭ 36 (+71.43%)
Mutual labels:  gym-environment, ppo
WarKingdoms
Unity RTS Prototype (Warcraft 3 Style)
Stars: ✭ 108 (+414.29%)
Mutual labels:  real-time-strategy
rocket-league-gym
A Gym-like environment for Reinforcement Learning in Rocket League
Stars: ✭ 107 (+409.52%)
Mutual labels:  gym-environment
Relational Deep Reinforcement Learning
No description or website provided.
Stars: ✭ 44 (+109.52%)
Mutual labels:  ppo
rl trading
No description or website provided.
Stars: ✭ 14 (-33.33%)
Mutual labels:  ppo
obstacle-env
An environment for an obstacle avoidance task
Stars: ✭ 30 (+42.86%)
Mutual labels:  gym-environment
Rainy
☔ Deep RL agents with PyTorch☔
Stars: ✭ 39 (+85.71%)
Mutual labels:  ppo
model-free-algorithms
TD3, SAC, IQN, Rainbow, PPO, Ape-X and etc. in TF1.x
Stars: ✭ 56 (+166.67%)
Mutual labels:  ppo
bandits
Comparison of bandit algorithms from the Reinforcement Learning bible.
Stars: ✭ 16 (-23.81%)
Mutual labels:  reinforcement-learning-agent
Reinforcement Learning
Deep Reinforcement Learning Algorithms implemented with Tensorflow 2.3
Stars: ✭ 61 (+190.48%)
Mutual labels:  ppo
imitation learning
PyTorch implementation of some reinforcement learning algorithms: A2C, PPO, Behavioral Cloning from Observation (BCO), GAIL.
Stars: ✭ 93 (+342.86%)
Mutual labels:  ppo
Deep-Reinforcement-Learning-for-Automated-Stock-Trading-Ensemble-Strategy-ICAIF-2020
Live Trading. Please star.
Stars: ✭ 1,251 (+5857.14%)
Mutual labels:  ppo
Explorer
Explorer is a PyTorch reinforcement learning framework for exploring new ideas.
Stars: ✭ 54 (+157.14%)
Mutual labels:  ppo
reinforcement learning ppo rnd
Deep Reinforcement Learning by using Proximal Policy Optimization and Random Network Distillation in Tensorflow 2 and Pytorch with some explanation
Stars: ✭ 33 (+57.14%)
Mutual labels:  ppo
RL-code-resources
A collection of Reinforcement Learning GitHub code resources divided by frameworks and environments
Stars: ✭ 51 (+142.86%)
Mutual labels:  reinforcement-learning-agent
gym-mtsim
A general-purpose, flexible, and easy-to-use simulator alongside an OpenAI Gym trading environment for MetaTrader 5 trading platform (Approved by OpenAI Gym)
Stars: ✭ 196 (+833.33%)
Mutual labels:  gym-environment
gym-cryptotrading
OpenAI Gym Environment API based Bitcoin trading environment
Stars: ✭ 111 (+428.57%)
Mutual labels:  gym-environment
ElegantRL
Scalable and Elastic Deep Reinforcement Learning Using PyTorch. Please star. 🔥
Stars: ✭ 2,074 (+9776.19%)
Mutual labels:  ppo
td-reg
TD-Regularized Actor-Critic Methods
Stars: ✭ 28 (+33.33%)
Mutual labels:  ppo

Gym-μRTS with Stable-Baselines3/PyTorch

This repo contains an attempt to reproduce Gridnet PPO with invalid action masking algorithm to play μRTS using Stable-Baselines3 library. Apart from reproducibility, this might open access to a diverse set of well tested algorithms, and toolings for training, evaluations, and more.

Original paper: Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-time Strategy Games.

Original code: gym-microrts-paper.

demo.gif

Install

Prerequisites:

  • Python 3.7.1+
  • Java 8.0+
  • FFmpeg (for video capturing)
git clone https://github.com/kachayev/gym-microrts-paper-sb3
cd gym-microrts-paper-sb3
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Note that I use newer version of gym-microrts compared to the one that was originally used for the paper.

Training

To traing an agent:

$ python ppo_gridnet_diverse_encode_decode_sb3.py

If everything is setup correctly, you'll see typicall SB3 verbose logging:

Using cuda device
---------------------------------
| microrts/          |          |
|    avg_exec_time   | 0.00409  |
|    num_calls       | 256      |
|    total_exec_time | 1.05     |
| time/              |          |
|    fps             | 560      |
|    iterations      | 1        |
|    time_elapsed    | 10       |
|    total_timesteps | 6144     |
---------------------------------
-----------------------------------------
| microrts/               |             |
|    avg_exec_time        | 0.00321     |
|    num_calls            | 512         |
|    total_exec_time      | 1.64        |
| time/                   |             |
|    fps                  | 164         |
|    iterations           | 2           |
|    time_elapsed         | 74          |
|    total_timesteps      | 12288       |
| train/                  |             |
|    approx_kl            | 0.001475019 |
|    clip_fraction        | 0.0575      |
|    clip_range           | 0.1         |
|    entropy_loss         | -1.46       |
|    explained_variance   | 0.00712     |
|    learning_rate        | 0.00025     |
|    loss                 | 0.0579      |
|    n_updates            | 4           |
|    policy_gradient_loss | -0.0032     |
|    value_loss           | 0.261       |
-----------------------------------------

By default, all settings are set as close as possible to the original implementation from the paper as possible. Thought the script supports flexible params:

$ python ppo_gridnet_diverse_encode_decode_sb3.py \
  --total-timesteps 10_000 \
  --bot-envs coacAI=8 randomBiasedAI=8 \
  --num-selfplay-envs 12 \
  --batch-size 2048 \
  --n-epochs 10

A trained agent is automatically saved to agents/ folder (or any other folder provided as --exp-folder parameter). Now you can use enjoy.py to test it out in action:

$ python enjoy.py \
  --agent-file agents/ppo_gridnet_diverse_encode_decode_sb3__1__1640241051.zip \
  --max-steps 1_000
  --bot-envs randomBiasedAI=1

Training progress is automatically logged to TensorBoard. Watch the progress locally:

$ tensorboard --logdir runs/
$ open http://localhost:6006

To profile code use cProfile:

$ python -m cProfile -s cumulative enjoy.py \
  --agent-file agents/ppo_gridnet_diverse_encode_decode_sb3__1__1640241051.zip \
  --max-steps 4_000
  --bot-envs workerRushAI=1

As soon as correctness of the implementation is verified, I will provide details on how to use RL Baselines3 Zoo for training and evaluations.

Implementational Caveats

A few notes / pain points regarding the implementation of the alrogithms, and the process of integrating it with stable-baselines3:

  • Gym does not ship a space for "array of multidiscrete" use case (let's be honest, it's not very common). But it gives an option for defining your space when necessary. A new space, when defined, is not easy to integrate into SB3. In a few different places SB3 raises NotImplementedError facing unknown space (example 1, example 2).
  • Seems like switching to fully rolled out MutliDiscrete space definition has a significant performance penalty. Still investigating if this can be improved.
  • Invalid masking is implemented by passing masks into observations from the wrapper (the observation space is replaced with gym.spaces.Dict to hold both observations and masks). By doing it this way, masks are now available for policy, and fit rollout buffer layout. Masking is implemented by setting logits into -inf (or to a rather small number).

Look for xxx(hack) comments in the code for more details.

More Experimentation

Additional experimentation with implementation details (those that are not present in the original paper) are now moved to separate scripts (to avoid confusion).

Linear Critic

The idea is to have the critic (value approximation) to be done as an affine transformation rather than a 2-layers NN. In addition to the change, CNN output is now L2-normalized.

$ python ppo_gridnet_linear_critic.py \
  --total-timesteps 10_000_000 \
  --bot-envs coacAI=24 randomBiasedAI=24 \
  --num-selfplay-envs 0 \
  --batch-size 2048 \
  --n-epochs 10

Linear Actor

After a quick analysis of embeddings space produced by encoder, some observations:

  • enocder embeddings carry weak signal for reconstructing features of the environemnt (using linear probes)
  • embeddings alongside a single trajectory do not exibit smoothness

Hypothetically this means the encoder is "collapsed" with the actor network (decisions are made mostly on the encoder side). Practically this means weaker generalization. To test out the hypothesis, ppo_gridnet_linear_actor implements policy network as a simple linear controller applied to all cells on the map (leveraging the fact that encoder produces 256-dimensional vector).

$ python ppo_gridnet_linear_actor.py \
  --total-timesteps 10_000_000 \
  --bot-envs lightRushAI=12 workerRushAI=12 \
  --num-selfplay-envs 0 \
  --batch-size 2048 \
  --n-epochs 10
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].