Atari Reinforcement Learning Leaderboard

Any scores out of date? Make a Pull Request.

This is a leaderboard comparing world record human performance to start of the art machine performance in the Arcade Learning Environment (ALE).

Game	Top Human Score	Top Machine Score	Best	Best Machine	Learning Type	Notes
Alien	103583	9491	Human	Rainbow	Q-gradient
Amidar	71529	5131	Human	Rainbow	Q-gradient
Assault	8647	14497	Machine	A3C	Policy-gradient
Asterix	1000000	428200	Human	Rainbow	Q-gradient
Asteroids	57340	5093	Human	A3C	Policy-gradient	`*`
Atlantis	10604840	2311815	Human	PPO	Policy-gradient
Bank Heist	45899	1611	Human	Dueling DDQN	Q-gradient
Battlezone	98000	62010	Human	Rainbow	Q-gradient
Beamrider	52866	26172	Human	Prioritized DDQN	Q-gradient	`1B`
Berzerk	1057940	2545	Human	Rainbow	Q-gradient
Bowling	279	135	Human	HyperNEAT	Genetic Policy	`J`
Boxing	99	99	Draw	Rainbow, ACER	Q,Policy-gradient
Breakout	864	766	Human	A3C	Policy-gradient
Centipede	453916	25275	Human	HyperNEAT	Genetic Policy
Chopper Command	999999	16654	Human	Rainbow	Q-gradient
Crazy Climber	219900	183135	Human	Prioritized DDQN	Q-gradient
Defender	5443150	233021	Human	A3C	Policy-gradient	`N`
Demon Attack	100100	115201	Machine	A3C	Policy-gradient	`+`
Enduro	1666	2260	Machine	Distribution DQN	Q-gradient
Fishing Derby	51	46	Human	Dueling DDQN	Q-gradient
Freeway	38	34	Human	Rainbow	Q-gradient	`1B`
Frostbite	248460	9590	Human	Rainbow	Q-gradient
Gopher	30240	70354	Machine	Rainbow	Q-gradient
Gravitar	39100	1419	Human	Rainbow	Q-gradient
HERO	257310	55887	Human	Rainbow	Q-gradient	`J`
Ice Hockey	25	10	Human	HyperNEAT	Genetic Policy
Kangaroo	1424600	14854	Human	Dueling DDQN	Q-gradient	`N`
Krull	104100	12601	Human	HyperNEAT	Genetic Policy	`N`
Kung Fu Master	79360	52181	Human	Rainbow	Q-gradient
Montezumas Revenge	400000	384	Human	Rainbow	Q-gradient
Ms Pacman	211480	6283	Human	Dueling DDQN	Q-gradient	`J`
Name This Game	21210	13439	Human	Prioritized DDQN	Q-gradient
Phoenix	251180	108528	Human	Rainbow	Q-gradient
Pitfall	114000	0	Human	Several	Q-gradient
Pong	21	21	Draw	Several	Several	`E`
Private Eye	101800	15172	Human	Distribution DQN	Q-gradient	`**`
Qbert	2400000	33817	Human	Rainbow	Q-gradient	`N`
Road Runner	210200	73949	Human	A3C	Policy-gradient
Robot Tank	68	65	Human	Dueling DDQN	Q-gradient
Seaquest	294940	50254	Human	Dueling DDQN	Q-gradient
Skiing	-3272	-6522	Human	Vanilla GA	Genetic Policy
Space Invaders	43710	23864	Human	A3C	Policy-gradient	`1B`
Star Gunner	77400	164766	Machine	A3C	Policy-gradient	`N`
Time Pilot	34400	27202	Human	A3C	Policy-gradient
Tutankham	2026	280	Human	ACER	Policy-gradient
Venture	38900	1107	Human	Distribution DQN	Q-gradient	`N`
Video Pinball	3523988	533936	Human	Rainbow	Q-gradient	`1B`
Wizard of Wor	129500	18082	Human	A3C	Policy-gradient
Yars Revenge	2011099	102557	Human	Rainbow	Q-gradient	`++`
Zaxxon	83700	24622	Human	A3C	Q-gradient

N NTSC, no emulator results available
J Score from jvgs.net
E Game is so easy there's no world record category
1B Game 1, Difficulty B
* Game 6, Difficulty B
+ Game 7, Difficulty B
** Game 1, Points
++ Game 2, Difficulty A

What the point of this?

I decided to put this together after noticing two trends in reinforcement learning papers:

Not comparing to state of art.
Comparing an algorithm with 1000s of hours playtime to a human that played for a few hours.

Respectively, these make it hard to see the relative progress of the field from paper to paper, and the absolute progress compared to human level game playing.

Though RL papers routinely quote >100% normalized human performance, the reality is that machine learning algorithms just barely beat humans on only 5 out of 49 games here, and humans have a substantial lead in the rest. We have a long way to go.

Performance Among Machines

When we exclude human scores, per-algorithm win count are as follows (two way ties friendly, three or more unfriendly):

Algorithm	Type	Wins
Rainbow	Q-gradient	18
A3C (FF and LSTM)	Policy-gradient	11
Dueling DDQN	Q-gradient	6
HyperNEAT	Genetic Policy	4
Distribution DQN	Q-gradient	3
Prioritized DDQN	Q-gradient	3
ACER	Policy-gradient	2
PPO	Policy-gradient	1
Vanilla GA	Genetic Policy	1
Noisy DQN	Q-gradient	0
Vanilla ES	Genetic Policy	0

Methodology

Human Scores

Since the ALE uses the stella Atari emulator, the Top Human Score is the top human score on an emulator. Atari (and other game) releases tend to vary across region, so this is the only way to ensure that both human and machine have, for example, equal access to game breaking bugs.

If possible, scores are taken from Twin Galaxies, which is the Guiness source for game world records, otherwise links are provided to score sources.

Machine Scores

A valid machine score is one achieved by a reinforcement learning algorithm trained directly on pixels and raw rewards, such as one that can be trained against common ALE wrappers / forks, like gym or xitari. This means that algorithms like this one which use hand-engineered intermediate rewards do not qualify.

Reference papers vary in:

Start type (no-op, random-op, human-op)
Number of test trials (from 30-200)

I take the approach here of favouring no-op starts over random ones (they usually have higher scores anyway), and treating all sample sizes equally.

References

Human
A2C
A3C
ACER
Distribution DQN
Dueling DDQN
HyperNeat, also checked against original paper
Rainbow
Prioritized DDQN
Proximal Policy Optimization
Vanilla Evolution Strategies
Vanilla Genetic Algorithm

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

cshenton / atari-leaderboard

Labels

Projects that are alternatives of or similar to atari-leaderboard