All Projects → BlackHC → mdp

BlackHC / mdp

Licence: Apache-2.0 license
Make it easy to specify simple MDPs that are compatible with the OpenAI Gym.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
TeX
3793 projects

Projects that are alternatives of or similar to mdp

Gymfc
A universal flight control tuning framework
Stars: ✭ 210 (+600%)
Mutual labels:  openai-gym, rl
CartPole
Run OpenAI Gym on a Server
Stars: ✭ 16 (-46.67%)
Mutual labels:  openai-gym, rl
Mushroom Rl
Python library for Reinforcement Learning.
Stars: ✭ 442 (+1373.33%)
Mutual labels:  openai-gym, rl
Noreward Rl
[ICML 2017] TensorFlow code for Curiosity-driven Exploration for Deep Reinforcement Learning
Stars: ✭ 1,176 (+3820%)
Mutual labels:  openai-gym, rl
Reinforcement learning
Implementation of selected reinforcement learning algorithms in Tensorflow. A3C, DDPG, REINFORCE, DQN, etc.
Stars: ✭ 132 (+340%)
Mutual labels:  openai-gym, rl
Stable Baselines
Mirror of Stable-Baselines: a fork of OpenAI Baselines, implementations of reinforcement learning algorithms
Stars: ✭ 115 (+283.33%)
Mutual labels:  openai-gym, rl
Rl Baselines Zoo
A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
Stars: ✭ 839 (+2696.67%)
Mutual labels:  openai-gym, rl
Coach
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Stars: ✭ 2,085 (+6850%)
Mutual labels:  openai-gym, rl
gym-rs
OpenAI's Gym written in pure Rust for blazingly fast performance
Stars: ✭ 34 (+13.33%)
Mutual labels:  openai-gym, rl
pong-from-pixels
Training a Neural Network to play Pong from pixels
Stars: ✭ 25 (-16.67%)
Mutual labels:  openai-gym
FinRL Podracer
Cloud-native Financial Reinforcement Learning
Stars: ✭ 179 (+496.67%)
Mutual labels:  openai-gym
rl trading
No description or website provided.
Stars: ✭ 14 (-53.33%)
Mutual labels:  openai-gym
Pytorch-PCGrad
Pytorch reimplementation for "Gradient Surgery for Multi-Task Learning"
Stars: ✭ 179 (+496.67%)
Mutual labels:  rl
sc2gym
PySC2 OpenAI Gym Environments
Stars: ✭ 50 (+66.67%)
Mutual labels:  openai-gym
gym-cartpole-swingup
A simple, continuous-control environment for OpenAI Gym
Stars: ✭ 20 (-33.33%)
Mutual labels:  openai-gym
Pytorch-RL-CPP
A Repository with C++ implementations of Reinforcement Learning Algorithms (Pytorch)
Stars: ✭ 73 (+143.33%)
Mutual labels:  openai-gym
iroko
A platform to test reinforcement learning policies in the datacenter setting.
Stars: ✭ 55 (+83.33%)
Mutual labels:  openai-gym
gym-R
An R package providing access to the OpenAI Gym API
Stars: ✭ 21 (-30%)
Mutual labels:  openai-gym
Corailed
Unrailed! simulator using C++ with some reinforcement learning and Unrailed! AI using Python with OpenCV
Stars: ✭ 15 (-50%)
Mutual labels:  rl
modelicagym
Modelica models integration with Open AI Gym
Stars: ✭ 53 (+76.67%)
Mutual labels:  openai-gym

MDP environments for the OpenAI Gym

This Python framework makes it very easy to specify simple MDPs.

Build Status

Installation

To install using pip, use:

pip install blackhc.mdp

To run the tests, use:

python setup.py test

Whitepaper

A whitepaper is available at https://arxiv.org/abs/1709.09069. Here is a BibTeX entry that you can use in publications (or download CITE_ME.bib):

@techreport{blackhc.mdp,
    Author = {Andreas Kirsch},
    Title = {MDP environments for the OpenAI Gym},
    Year = {2017},
    Url = {https://arxiv.org/abs/1709.09069}
}

Introduction

In reinforcement learning, agents learn to maximize accumulated rewards from an environment that they can interact with by observing and taking actions. Usually, these environments satisfy a Markov property and are treated as Markov Decision Processes (MDPs).

The OpenAI Gym is a standardized and open framework that provides many different environments to train agents against through a simple API.

Even the simplest of these environments already has a level of complexity that is interesting for research but can make it hard to track down bugs. However, the gym provides four very simple environments that are useful for testing. The gym.envs.debugging package contains a one-round environment with deterministic rewards and one with non-deterministic rewards, and a two-round environment with deterministic rewards and another one with non-deterministic rewards. The author has found these environments very useful for smoke-testing code changes.

This Python framework makes it very easy to specify simple MDPs like the ones described above in an extensible way. With it, one can validate that agents converge correctly as well as examine other properties.

Specification of MDPs

MDPs are Markov processes that are augmented with a reward function and discount factor. An MDP can be fully specified by a tuple of:

  • a finite set of states,
  • a finite set of actions,
  • a matrix that specifies probabilities of transitions to a new state for a given a state and action,
  • a reward function that specifies the reward for a given action taken in a certain state, and
  • a discount rate.

The reward function can be either deterministic, or it can be a probability distribution.

Within this framework, MDPs can be specified in Python using a simple domain-specific language (DSL). For example, the one-round deterministic environment defined in gym.envs.debugging.one_round_deterministic_reward could be specified as follows:

from blackhc.mdp import dsl

start = dsl.state()
end = dsl.terminal_state()

action_0 = dsl.action()
action_1 = dsl.action()

start & (action_0 | action_1) > end
start & action_1 > dsl.reward(1.)

The DSL is based on the following grammar (using EBNF[@ebnf]):

TRANSITION ::= STATE '&' ACTION '>' OUTCOME
OUTCOME ::= (REWARD | STATE) ['*' WEIGHT]

ALTERNATIVES ::= ALTERNATIVE ('|' ALTERNATIVE)* 

See below for how alternatives work. Alternatives can be used in place of states, actions and outcomes, and comprise of states, actions and outcomes.

For a given state and action, outcomes can be specified. Outcomes are state transitions or rewards. If multiple state transitions or rewards are specified for the same state and action, the MDP is non-deterministic and the state transition (or reward) are determined using a categorical distribution. By default, each outcome is weighted uniformly, except if specified otherwise by either having duplicate transitions or by using an explicit weight factor.

For example, to specify that a state receives a reward of +1 or -1 with equal probability and does not change states with probability 3/4 and only transitions to the next state with probability 1/4, we could write:

state & action > dsl.reward(-1.) | dsl.reward(1.)
state & action > state * 3 | next_state

Alternatives are distributive with respect to both conjunctions (&) and outcome mappings (>), so:

(a | b) & (c | d) > (e | f) ===
(a & c > e) | (a & c > f) | (a & d > e) | 
(a & d > f) | (b & c > e) | ... 

Alternatives can consist of states, actions, outcomes, conjunctions or partial transitions. For example, the following are valid alternatives:

stateA & actionA | stateB & actionB
(actionA > stateC) | (actionB > stateD)

As the DSL is implemented within Python, operator overloading is used to implement the semantics. Operator precedence is favorable as * has higher precedence than &, which has higher precedence than |, which has higher precedence than >. This allows for a natural formulation of transitions.

Conventional API

The framework also supports specifying an MDP using a conventional API as DSLs are not always preferred.

from blackhc import mdp

spec = mdp.MDPSpec()
start = spec.state('start')
end = spec.state('end', terminal_state=True)
action_0 = spec.action()
action_1 = spec.action()

spec.transition(start, action_0, mdp.NextState(end))
spec.transition(start, action_1, mdp.NextState(end))
spec.transition(start, action_1, mdp.Reward(1))

Visualization

To make debugging easier, MDPs can be converted to networkx graphs and rendered using pydotplus and GraphViz.

from blackhc import mdp
from blackhc.mdp import example

spec = example.ONE_ROUND_DMDP

spec_graph = spec.to_graph()
spec_png = mdp.graph_to_png(spec_graph)

mdp.display_mdp(spec)
One round deterministic MDP
Figure 1: One round deterministic MDP

Optimal values

The framework also contains a small module that can compute the optimal value functions using linear programming.

from blackhc.mdp import lp
from blackhc.mdp import example

solver = lp.LinearProgramming(example.ONE_ROUND_DMDP)
print(solver.compute_q_table())
print(solver.compute_v_vector())

Gym environment

An environment that is compatible with the OpenAI Gym can be created easily by using the to_env() method. It supports rendering into Jupyter notebooks, as RGB array for storing videos, and as png byte data.

from blackhc import mdp
from blackhc.mdp import example

env = example.MULTI_ROUND_NMDP.to_env()

env.reset()
env.render()

is_done = False
while not is_done:
    state, reward, is_done, _ = env.step(env.action_space.sample())
    env.render()
env.render() of `example.MULTI_ROUND_NMDP`
Figure 2: env.render() of `example.MULTI_ROUND_NMDP`

Examples

The blackhc.mdp.example package provides 5 MDPs. Four of them match the ones in gym.envs.debugging, and the fifth one is depicted in figure 2.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].