All Projects → MatthewJA → Inverse Reinforcement Learning

MatthewJA / Inverse Reinforcement Learning

Licence: mit
Implementations of selected inverse reinforcement learning algorithms.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Inverse Reinforcement Learning

Recsim
A Configurable Recommender Systems Simulation Platform
Stars: ✭ 461 (-11.69%)
Mutual labels:  reinforcement-learning
Tensorflow Book
Accompanying source code for Machine Learning with TensorFlow. Refer to the book for step-by-step explanations.
Stars: ✭ 4,448 (+752.11%)
Mutual labels:  reinforcement-learning
Rosettastone
Hearthstone simulator using C++ with some reinforcement learning
Stars: ✭ 510 (-2.3%)
Mutual labels:  reinforcement-learning
Robosuite
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Stars: ✭ 462 (-11.49%)
Mutual labels:  reinforcement-learning
Awesome Robotics
A curated list of awesome links and software libraries that are useful for robots.
Stars: ✭ 478 (-8.43%)
Mutual labels:  reinforcement-learning
Autokernel
AutoKernel 是一个简单易用,低门槛的自动算子优化工具,提高深度学习算法部署效率。
Stars: ✭ 485 (-7.09%)
Mutual labels:  reinforcement-learning
Courses
Quiz & Assignment of Coursera
Stars: ✭ 454 (-13.03%)
Mutual labels:  reinforcement-learning
Gym Starcraft
StarCraft environment for OpenAI Gym, based on Facebook's TorchCraft. (In progress)
Stars: ✭ 514 (-1.53%)
Mutual labels:  reinforcement-learning
Rl a3c pytorch
A3C LSTM Atari with Pytorch plus A3G design
Stars: ✭ 482 (-7.66%)
Mutual labels:  reinforcement-learning
Ai Toolbox
A C++ framework for MDPs and POMDPs with Python bindings
Stars: ✭ 500 (-4.21%)
Mutual labels:  reinforcement-learning
Neurojs
A JavaScript deep learning and reinforcement learning library.
Stars: ✭ 4,344 (+732.18%)
Mutual labels:  reinforcement-learning
Tensorflow Reinforce
Implementations of Reinforcement Learning Models in Tensorflow
Stars: ✭ 480 (-8.05%)
Mutual labels:  reinforcement-learning
Obstacle Tower Env
Obstacle Tower Environment
Stars: ✭ 496 (-4.98%)
Mutual labels:  reinforcement-learning
Rl Book
Source codes for the book "Reinforcement Learning: Theory and Python Implementation"
Stars: ✭ 464 (-11.11%)
Mutual labels:  reinforcement-learning
Dissecting Reinforcement Learning
Python code, PDFs and resources for the series of posts on Reinforcement Learning which I published on my personal blog
Stars: ✭ 512 (-1.92%)
Mutual labels:  reinforcement-learning
Arnold
Arnold - DOOM Agent
Stars: ✭ 457 (-12.45%)
Mutual labels:  reinforcement-learning
Ml Mipt
Open Machine Learning course at MIPT
Stars: ✭ 480 (-8.05%)
Mutual labels:  reinforcement-learning
Machine Learning Articles
Monthly Series - Top 10 Machine Learning Articles
Stars: ✭ 516 (-1.15%)
Mutual labels:  reinforcement-learning
Holodeck
High Fidelity Simulator for Reinforcement Learning and Robotics Research.
Stars: ✭ 513 (-1.72%)
Mutual labels:  reinforcement-learning
Reaver
Reaver: Modular Deep Reinforcement Learning Framework. Focused on StarCraft II. Supports Gym, Atari, and MuJoCo.
Stars: ✭ 499 (-4.41%)
Mutual labels:  reinforcement-learning

Inverse Reinforcement Learning

DOI

Implements selected inverse reinforcement learning (IRL) algorithms as part of COMP3710, supervised by Dr Mayank Daswani and Dr Marcus Hutter. My final report is available here and describes the implemented algorithms.

If you use this code in your work, you can cite it as follows:

@misc{alger16,
  author       = {Matthew Alger},
  title        = {Inverse Reinforcement Learning},
  year         = 2016,
  doi          = {10.5281/zenodo.555999},
  url          = {https://doi.org/10.5281/zenodo.555999}
}

Algorithms implemented

  • Linear programming IRL. From Ng & Russell, 2000. Small state space and large state space linear programming IRL.
  • Maximum entropy IRL. From Ziebart et al., 2008.
  • Deep maximum entropy IRL. From Wulfmeier et al., 2015; original derivation.

Additionally, the following MDP domains are implemented:

  • Gridworld (Sutton, 1998)
  • Objectworld (Levine et al., 2011)

Requirements

  • NumPy
  • SciPy
  • CVXOPT
  • Theano
  • MatPlotLib (for examples)

Module documentation

Following is a brief list of functions and classes exported by modules. Full documentation is included in the docstrings of each function or class; only functions and classes intended for use outside the module are documented here.

linear_irl

Implements linear programming inverse reinforcement learning (Ng & Russell, 2000).

Functions:

  • irl(n_states, n_actions, transition_probability, policy, discount, Rmax, l1): Find a reward function with inverse RL.
  • large_inverseRL(value, transition_probability, feature_matrix, n_states, n_actions, policy): Find the reward in a large state space.

maxent

Implements maximum entropy inverse reinforcement learning (Ziebart et al., 2008).

Functions:

  • irl(feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate): Find the reward function for the given trajectories.
  • find_svf(feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate): Find the state visitation frequency from trajectories.
  • find_feature_expectations(feature_matrix, trajectories): Find the feature expectations for the given trajectories. This is the average path feature vector.
  • find_expected_svf(n_states, r, n_actions, discount, transition_probability, trajectories): Find the expected state visitation frequencies using algorithm 1 from Ziebart et al. 2008.
  • expected_value_difference(n_states, n_actions, transition_probability, reward, discount, p_start_state, optimal_value, true_reward): Calculate the expected value difference, which is a proxy to how good a recovered reward function is.

deep_maxent

Implements deep maximum entropy inverse reinforcement learning based on Ziebart et al., 2008 and Wulfmeier et al., 2015, using symbolic methods with Theano.

Functions:

  • irl(structure, feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate, initialisation="normal", l1=0.1, l2=0.1): Find the reward function for the given trajectories.
  • find_svf(n_states, trajectories): Find the state vistiation frequency from trajectories.
  • find_expected_svf(n_states, r, n_actions, discount, transition_probability, trajectories): Find the expected state visitation frequencies using algorithm 1 from Ziebart et al. 2008.

value_iteration

Find the value function associated with a policy. Based on Sutton & Barto, 1998.

Functions:

  • value(policy, n_states, transition_probabilities, reward, discount, threshold=1e-2): Find the value function associated with a policy.
  • optimal_value(n_states, n_actions, transition_probabilities, reward, discount, threshold=1e-2): Find the optimal value function.
  • find_policy(n_states, n_actions, transition_probabilities, reward, discount, threshold=1e-2, v=None, stochastic=True): Find the optimal policy.

mdp

gridworld

Implements the gridworld MDP.

Classes, instance attributes, methods:

  • Gridworld(grid_size, wind, discount): Gridworld MDP.
    • actions: Tuple of (dx, dy) actions.
    • n_actions: Number of actions. int.
    • n_states: Number of states. int.
    • grid_size: Size of grid. int.
    • wind: Chance of moving randomly. float.
    • discount: MDP discount factor. float.
    • transition_probability: NumPy array with shape (n_states, n_actions, n_states) where transition_probability[si, a, sk] is the probability of transitioning from state si to state sk under action a.
    • feature_vector(i, feature_map="ident"): Get the feature vector associated with a state integer.
    • feature_matrix(feature_map="ident"): Get the feature matrix for this gridworld.
    • int_to_point(i): Convert a state int into the corresponding coordinate.
    • point_to_int(p): Convert a coordinate into the corresponding state int.
    • neighbouring(i, k): Get whether two points neighbour each other. Also returns true if they are the same point.
    • reward(state_int): Reward for being in state state_int.
    • average_reward(n_trajectories, trajectory_length, policy): Calculate the average total reward obtained by following a given policy over n_paths paths.
    • optimal_policy(state_int): The optimal policy for this gridworld.
    • optimal_policy_deterministic(state_int): Deterministic version of the optimal policy for this gridworld.
    • generate_trajectories(n_trajectories, trajectory_length, policy, random_start=False): Generate n_trajectories trajectories with length trajectory_length, following the given policy.

objectworld

Implements the objectworld MDP described in Levine et al. 2011.

Classes, instance attributes, methods:

  • OWObject(inner_colour, outer_colour): Object in objectworld.

    • inner_colour: Inner colour of object. int.
    • outer_colour: Outer colour of object. int.
  • Objectworld(grid_size, n_objects, n_colours, wind, discount): Objectworld MDP.

    • actions: Tuple of (dx, dy) actions.
    • n_actions: Number of actions. int.
    • n_states: Number of states. int.
    • grid_size: Size of grid. int.
    • n_objects: Number of objects in the world. int.
    • n_colours: Number of colours to colour objects with. int.
    • wind: Chance of moving randomly. float.
    • discount: MDP discount factor. float.
    • objects: Set of objects in the world.
    • transition_probability: NumPy array with shape (n_states, n_actions, n_states) where transition_probability[si, a, sk] is the probability of transitioning from state si to state sk under action a.
    • feature_vector(i, discrete=True): Get the feature vector associated with a state integer.
    • feature_matrix(discrete=True): Get the feature matrix for this gridworld.
    • int_to_point(i): Convert a state int into the corresponding coordinate.
    • point_to_int(p): Convert a coordinate into the corresponding state int.
    • neighbouring(i, k): Get whether two points neighbour each other. Also returns true if they are the same point.
    • reward(state_int): Reward for being in state state_int.
    • average_reward(n_trajectories, trajectory_length, policy): Calculate the average total reward obtained by following a given policy over n_paths paths.
    • generate_trajectories(n_trajectories, trajectory_length, policy): Generate n_trajectories trajectories with length trajectory_length, following the given policy.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].