Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → rolyatmax → Tictactoe

rolyatmax / Tictactoe

Tic Tac Toe Machine Learning

Programming Languages

184084 projects - #8 most used programming language

Labels

machine-learning reinforcement-learning

Projects that are alternatives of or similar to Tictactoe

Playing Atari games with TensorFlow implementation of Asynchronous Deep Q-Learning

Stars: ✭ 44 (-21.43%)

Mutual labels: reinforcement-learning

Holodeck Engine

High Fidelity Simulator for Reinforcement Learning and Robotics Research.

Stars: ✭ 48 (-14.29%)

Mutual labels: reinforcement-learning

Implementation of q-learning using TensorFlow

Stars: ✭ 53 (-5.36%)

Mutual labels: reinforcement-learning

Get started with Machine Learning in TensorFlow with a selection of good reads and implemented examples!

Stars: ✭ 45 (-19.64%)

Mutual labels: reinforcement-learning

DHER: Hindsight Experience Replay for Dynamic Goals (ICLR-2019)

Stars: ✭ 48 (-14.29%)

Mutual labels: reinforcement-learning

Stars: ✭ 52 (-7.14%)

Mutual labels: reinforcement-learning

Open source collection of Reinforcement Learning Environments.

Stars: ✭ 44 (-21.43%)

Mutual labels: reinforcement-learning

Reinforcement Learning

Implementation of Reinforcement Learning algorithms in Python, based on Sutton's & Barto's Book (Ed. 2)

Stars: ✭ 55 (-1.79%)

Mutual labels: reinforcement-learning

GPU Javascript Library for Machine Learning

Stars: ✭ 48 (-14.29%)

Mutual labels: reinforcement-learning

The notes for Math, Machine Learning, Deep Learning and Research papers.

Stars: ✭ 53 (-5.36%)

Mutual labels: reinforcement-learning

MIT DeepTraffic top 2% solution (75.01 mph) 🚗.

Stars: ✭ 47 (-16.07%)

Mutual labels: reinforcement-learning

Reproducing MuJoCo benchmarks in a modern, commercial game /physics engine (Unity + PhysX).

Stars: ✭ 47 (-16.07%)

Mutual labels: reinforcement-learning

Policy Gradient Methods

Implementation of Algorithms from the Policy Gradient Family. Currently includes: A2C, A3C, DDPG, TD3, SAC

Stars: ✭ 54 (-3.57%)

Mutual labels: reinforcement-learning

Biped trajectory optimization

Implementing trajectory optimization on bipedal system

Stars: ✭ 45 (-19.64%)

Mutual labels: reinforcement-learning

Collection of reinforcement learners implemented in python. Mainly including DQN and its variants

Stars: ✭ 54 (-3.57%)

Mutual labels: reinforcement-learning

Deterministic Gail Pytorch

PyTorch implementation of Deterministic Generative Adversarial Imitation Learning (GAIL) for Off Policy learning

Stars: ✭ 44 (-21.43%)

Mutual labels: reinforcement-learning

Minimalistic gridworld package for OpenAI Gym

Stars: ✭ 1,047 (+1769.64%)

Mutual labels: reinforcement-learning

📋 Survey papers summarizing advances in deep learning, NLP, CV, graphs, reinforcement learning, recommendations, graphs, etc.

Stars: ✭ 1,063 (+1798.21%)

Mutual labels: reinforcement-learning

Some JavaScript works published as demos, mostly ML or DS

Stars: ✭ 55 (-1.79%)

Mutual labels: reinforcement-learning

Some notebooks

Stars: ✭ 53 (-5.36%)

Mutual labels: reinforcement-learning

View All Similar Projects ➔

Reinforcement Learning With TicTacToe

Making computers learn how to play tic-tac-toe.

I started messing around with reinforcement learning when I heard about a Flappy Bird RL project on GitHub. Out of the box, the algorithm took about 7 or 8 hours to train. I figured it could learn faster if multiple instances of the same algorithm spread out over the internet could all work to update the same matrix. So after forking the repo and creating a distributed learning version, I was able to get it to train in about 30 minutes with 8 browser tabs. I found myself wanting to explore reinforcement learning a bit more, and so this little project was born.

So, a general overview. This learning algorithm doesn't know how to play tic-tac-toe. It doesn't know the rules; it doesn't learn the rules; it doesn't even know it's playing against an opponent. All it knows is what the board looks like and what its options are. When it has made a move, it is rewarded or punished based on that move for that particular position. After a few thousand games, it effectively "learns" how to play. The algorithm's "knowledge" is represented in a matrix (called a "policy") where values are assigned to every possible move for each board state the algorithm has encountered.

The first two options presented, "Train Locally" and "Train with Distributed Learning" simply refer to where the q-matrix or "policy" is stored. "Train Locally" keeps the all the training confined to the browser tab. You can watch it train from nothing (it might take 2 or 3 minutes). "Distributed Learning" trains locally as well, but it also persists with the server. This policy tends to be trained fairly well already and rarely loses.

For the "distributed learning" mode, The learner pushes all policy updates onto a stack and periodically sends the batch of updated values to the server. Each update consists of a state (the board), an action (a move), and a value for that state-action pair (higher values = better moves). Currently, the server goes through the updates, writing over the state-action pairs in its own canonical policy, and then sends back an entire copy of the new-and-improved policy with which the client replaces its own.

As you might have guessed, there is some clobbering that takes place on the canonical policy. However, one might be able to avoid clobbering completely by only passing to the server a state, an action, and a reward. This lets the server run its own evaluation function instead of relying on clients' states. This would also require the constants in the evaluation function to match on the client and the server.

The algorithm trains by playing against a version of itself. The "Smart" version (listed under the "Scores" section) always selects a move its trained policy recommends. The "Kinda Smart" version (a bit of a misnomer) will occasionally select random moves to learn from the outcome. It is this "exploration" that actually allows the algorithm to discover better policies. Because you can play a tic-tac-toe game perfectly and not win, the best measure of a policy's efficacy is how many wins it has given up to its opponent. Because the "Kinda Smart" version makes random moves for some percentage of its total plays, it happens to give up quite a few wins to its opponent even though it shares the exact same q-matrix with the "Smart" version.

You can pause the training and play against the "Smart" version at any time. After the algorithm moves, it displays its options in the bottom corner of the screen. You can mouse over these options to see the relative favorability of each move.

There were some interesting optimizations I made which seemed to have helped speed up learning quite a bit. The most important optimization was probably the work I did normalizing equivalent board states. For (almost) every possible tic-tac-toe board, there are at least a few other tic-tac-toe boards that are essentially equivalent. You can take a given board and rotate three times or flip it along the vertical and/or horizontal axes.

For example, this board

 X | O |
-----------
   | X |
-----------
   |   | O

is, for our purposes, the same as

   |   | O              O |   |
-----------            -----------
 O | X |        and       | X | O
-----------            -----------
 X |   |                  |   | X

Making sure these board states were considered equivalent cut down on the amount of memory required to store the policy and, consequently, the amount of time required to generate an effective policy.

To accomplish the normalization, the choosing function turns the board state into a string. The string representation of this board (from the X's point of view):

 X | O |
-----------
   | X |
-----------
   |   | O

is AB--A---B. (As represent a player's own symbol while Bs represent the opponent's. From the O's point of view, the board would be represented as BA--B---A.) It then checks the matrix for any equivalent permutations by rotating and flipping the board. The following board states are equivalent to the board above, for example.

If it finds a permutation, it remembers how many rotations and flips it used so that it can apply the same transformations to the move it selects - "translating", so to speak, the moves between the actual board and the permutation.

See it in action at tbaldw.in/tictactoe. Check out the code at github.com/rolyatmax/tictactoe. It's a bit messy in parts (as it has changed tremendously over time), but the meat of the learning algorithm is in public/js/Q.js. For more info about how Q-Learning works, check out the Wikipedia article.

To run on your own:

npm install

cd public && npm install && npm run build

cd .. && npm run start

Point your browser to localhost:8080

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 56

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (9) 🔗