All Projects → rolyatmax → Tictactoe

rolyatmax / Tictactoe

Tic Tac Toe Machine Learning

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Tictactoe

Async Deeprl
Playing Atari games with TensorFlow implementation of Asynchronous Deep Q-Learning
Stars: ✭ 44 (-21.43%)
Mutual labels:  reinforcement-learning
Holodeck Engine
High Fidelity Simulator for Reinforcement Learning and Robotics Research.
Stars: ✭ 48 (-14.29%)
Mutual labels:  reinforcement-learning
Dqn
Implementation of q-learning using TensorFlow
Stars: ✭ 53 (-5.36%)
Mutual labels:  reinforcement-learning
Ml In Tf
Get started with Machine Learning in TensorFlow with a selection of good reads and implemented examples!
Stars: ✭ 45 (-19.64%)
Mutual labels:  reinforcement-learning
Dher
DHER: Hindsight Experience Replay for Dynamic Goals (ICLR-2019)
Stars: ✭ 48 (-14.29%)
Mutual labels:  reinforcement-learning
Pytorch Rl
Stars: ✭ 52 (-7.14%)
Mutual labels:  reinforcement-learning
Orbit
Open source collection of Reinforcement Learning Environments.
Stars: ✭ 44 (-21.43%)
Mutual labels:  reinforcement-learning
Reinforcement Learning
Implementation of Reinforcement Learning algorithms in Python, based on Sutton's & Barto's Book (Ed. 2)
Stars: ✭ 55 (-1.79%)
Mutual labels:  reinforcement-learning
Gbrain
GPU Javascript Library for Machine Learning
Stars: ✭ 48 (-14.29%)
Mutual labels:  reinforcement-learning
Notes
The notes for Math, Machine Learning, Deep Learning and Research papers.
Stars: ✭ 53 (-5.36%)
Mutual labels:  reinforcement-learning
Deep traffic
MIT DeepTraffic top 2% solution (75.01 mph) 🚗.
Stars: ✭ 47 (-16.07%)
Mutual labels:  reinforcement-learning
Mujocounity
Reproducing MuJoCo benchmarks in a modern, commercial game /physics engine (Unity + PhysX).
Stars: ✭ 47 (-16.07%)
Mutual labels:  reinforcement-learning
Policy Gradient Methods
Implementation of Algorithms from the Policy Gradient Family. Currently includes: A2C, A3C, DDPG, TD3, SAC
Stars: ✭ 54 (-3.57%)
Mutual labels:  reinforcement-learning
Biped trajectory optimization
Implementing trajectory optimization on bipedal system
Stars: ✭ 45 (-19.64%)
Mutual labels:  reinforcement-learning
Reinforcepy
Collection of reinforcement learners implemented in python. Mainly including DQN and its variants
Stars: ✭ 54 (-3.57%)
Mutual labels:  reinforcement-learning
Deterministic Gail Pytorch
PyTorch implementation of Deterministic Generative Adversarial Imitation Learning (GAIL) for Off Policy learning
Stars: ✭ 44 (-21.43%)
Mutual labels:  reinforcement-learning
Gym Minigrid
Minimalistic gridworld package for OpenAI Gym
Stars: ✭ 1,047 (+1769.64%)
Mutual labels:  reinforcement-learning
Ml Surveys
📋 Survey papers summarizing advances in deep learning, NLP, CV, graphs, reinforcement learning, recommendations, graphs, etc.
Stars: ✭ 1,063 (+1798.21%)
Mutual labels:  reinforcement-learning
Demos
Some JavaScript works published as demos, mostly ML or DS
Stars: ✭ 55 (-1.79%)
Mutual labels:  reinforcement-learning
Notebooks
Some notebooks
Stars: ✭ 53 (-5.36%)
Mutual labels:  reinforcement-learning

Reinforcement Learning With TicTacToe

Making computers learn how to play tic-tac-toe.

I started messing around with reinforcement learning when I heard about a Flappy Bird RL project on GitHub. Out of the box, the algorithm took about 7 or 8 hours to train. I figured it could learn faster if multiple instances of the same algorithm spread out over the internet could all work to update the same matrix. So after forking the repo and creating a distributed learning version, I was able to get it to train in about 30 minutes with 8 browser tabs. I found myself wanting to explore reinforcement learning a bit more, and so this little project was born.

So, a general overview. This learning algorithm doesn't know how to play tic-tac-toe. It doesn't know the rules; it doesn't learn the rules; it doesn't even know it's playing against an opponent. All it knows is what the board looks like and what its options are. When it has made a move, it is rewarded or punished based on that move for that particular position. After a few thousand games, it effectively "learns" how to play. The algorithm's "knowledge" is represented in a matrix (called a "policy") where values are assigned to every possible move for each board state the algorithm has encountered.

The first two options presented, "Train Locally" and "Train with Distributed Learning" simply refer to where the q-matrix or "policy" is stored. "Train Locally" keeps the all the training confined to the browser tab. You can watch it train from nothing (it might take 2 or 3 minutes). "Distributed Learning" trains locally as well, but it also persists with the server. This policy tends to be trained fairly well already and rarely loses.

For the "distributed learning" mode, The learner pushes all policy updates onto a stack and periodically sends the batch of updated values to the server. Each update consists of a state (the board), an action (a move), and a value for that state-action pair (higher values = better moves). Currently, the server goes through the updates, writing over the state-action pairs in its own canonical policy, and then sends back an entire copy of the new-and-improved policy with which the client replaces its own.

As you might have guessed, there is some clobbering that takes place on the canonical policy. However, one might be able to avoid clobbering completely by only passing to the server a state, an action, and a reward. This lets the server run its own evaluation function instead of relying on clients' states. This would also require the constants in the evaluation function to match on the client and the server.

The algorithm trains by playing against a version of itself. The "Smart" version (listed under the "Scores" section) always selects a move its trained policy recommends. The "Kinda Smart" version (a bit of a misnomer) will occasionally select random moves to learn from the outcome. It is this "exploration" that actually allows the algorithm to discover better policies. Because you can play a tic-tac-toe game perfectly and not win, the best measure of a policy's efficacy is how many wins it has given up to its opponent. Because the "Kinda Smart" version makes random moves for some percentage of its total plays, it happens to give up quite a few wins to its opponent even though it shares the exact same q-matrix with the "Smart" version.

You can pause the training and play against the "Smart" version at any time. After the algorithm moves, it displays its options in the bottom corner of the screen. You can mouse over these options to see the relative favorability of each move.

There were some interesting optimizations I made which seemed to have helped speed up learning quite a bit. The most important optimization was probably the work I did normalizing equivalent board states. For (almost) every possible tic-tac-toe board, there are at least a few other tic-tac-toe boards that are essentially equivalent. You can take a given board and rotate three times or flip it along the vertical and/or horizontal axes.

For example, this board

 X | O |
-----------
   | X |
-----------
   |   | O

is, for our purposes, the same as

   |   | O              O |   |
-----------            -----------
 O | X |        and       | X | O
-----------            -----------
 X |   |                  |   | X

Making sure these board states were considered equivalent cut down on the amount of memory required to store the policy and, consequently, the amount of time required to generate an effective policy.

To accomplish the normalization, the choosing function turns the board state into a string. The string representation of this board (from the X's point of view):

 X | O |
-----------
   | X |
-----------
   |   | O

is AB--A---B. (As represent a player's own symbol while Bs represent the opponent's. From the O's point of view, the board would be represented as BA--B---A.) It then checks the matrix for any equivalent permutations by rotating and flipping the board. The following board states are equivalent to the board above, for example.

If it finds a permutation, it remembers how many rotations and flips it used so that it can apply the same transformations to the move it selects - "translating", so to speak, the moves between the actual board and the permutation.

See it in action at tbaldw.in/tictactoe. Check out the code at github.com/rolyatmax/tictactoe. It's a bit messy in parts (as it has changed tremendously over time), but the meat of the learning algorithm is in public/js/Q.js. For more info about how Q-Learning works, check out the Wikipedia article.

To run on your own:

npm install

cd public && npm install && npm run build

cd .. && npm run start

Point your browser to localhost:8080

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].