Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Seanny123 · on March 25, 2017

I feel like this is more an argument against the efficiency of flat (non-hierarchical, not model-based) Deep Reinforcement Learning than an argument for Genetic/Evolutionary Algorithms. As in, if you're not solving a task more efficiently than Evolutionary Algorithms or more difficult than they can handle, you're not doing something right. Similar to how Deep PCA and Deep Random Forests blew ConvNets out of the water on basic benchmarks like MNIST, but couldn't compete on larger datasets, indicated the proof we should require before getting excited about a new technique.

argonaut · on March 25, 2017

Not sure where you're getting this from, but CNNs are definitely still state of the art on MNIST. Papers often cite outdated CNN numbers; in fact the best published CNN accuracy numbers are probably lower than they could be - vanilla MNIST for supervised learning is a pretty useless benchmark for computer vision researchers now.

Seanny123 · on March 25, 2017

My bad, I should have been more specific. The Deep PCA paper shows superior performance not on baseline MNIST, but on the MNIST variations.

I also agree that vanilla MNIST is pretty useless for Computer Vision researches and was trying (awkwardly) to support that idea by showing how these other non Deep Learning techniques performed equally well.

argonaut · on March 26, 2017

The Deep PCA paper shows no such thing. They've cherry-picked bad CNN baselines. The red flag for me is that the "state of the art" methods they compare to achieve > 1% test error on vanilla MNIST, which is totally wrong since CNNs routinely achieve <0.5 % test error (even by the standards of 2014, when the paper was published).

Seanny123 · on March 25, 2017

Deep PCA: https://arxiv.org/abs/1404.3606

Deep Forest: https://arxiv.org/abs/1702.08835

YeGoblynQueenne · on March 25, 2017

I think this is the Deep Forest paper you meant:

http://www.cv-foundation.org/openaccess/content_iccv_2015/pa...

That's Deep Neural Decision Forests; they benchmark against MNIST and obtain state of the art results.

The other paper is about improving decision forests, although it also uses MNIST as a benchmark.

Apologies if I'm wrong and you did mean the Zhou & Feng paper.

Seanny123 · on March 25, 2017

That one is good too!

alfa02 · on March 25, 2017

Policy search is very effective. This is not a new finding as this article seems to suggest in the abstract.

Variants of ES have been used for years (http://dl.acm.org/citation.cfm?id=1645634). The article seems to ignore almost all the work in robotics, e.g. from Jan Peters' research group (http://www.jan-peters.net/ -> publications).

The good thing is that we have one more paper that justifies this research direction and a little bit more public attention.

mark_l_watson · on March 25, 2017

Great write up.

I had a similar example in my 20 year old book 'C++ Power Paradigms' in which I used a genetic algorithm to train the weights in a recurrent network. As a performance hack, weights were initially represented by just a few bits, and the bit length would gradually be increased, which greatly increased the search space. I never got this to scale past small networks, but I have thought of revisiting my old code since I have a lot more computing power available now, compared to 20 years ago.

vn0m · on March 25, 2017

I'm not an expert but for me it's seems that the proposed approach doesn't compute the value function . By optimizing directly the policy function aren't we losing some key ingredient for generalization? I mean could this algorithm be used in a fully non deterministic environment ? Like human vs machine?

alfa02 · on March 25, 2017

I am not sure at the moment. I guess the main problem of ES in a complex non-deterministic environment is that it would average over multiple local minima that occur in one generation which would result in a non-optimum solution. There are policy search methods that address this problem (e.g. VIPS https://scholar.google.com/citations?view_op=view_citation&h... , hierarchichal REPS https://scholar.google.com/citations?view_op=view_citation&h...).

Temporal credit assignment is another problem: the policy is updated after a full episode and there is no way to use the information which step was responsible for which reward. Policy search usually works well if the value function is very complex and the optimal policy is simple.

dimatura · on March 25, 2017

It does not, at least not explicitly, which is a distinguishing feature of policy optimization algorithms. I don't think there's anything about this that makes it worse for non-deterministic problems. Note that the policy can still be stochastic, if desired (not sure if that's a good idea in general). A nice feature of black box policy optimization is that it's almost trivial to apply to either stochastic or deterministic problems.

shpx · on March 26, 2017

> we were able to solve one of the hardest MuJoCo tasks (a 3D humanoid) using 1,440 CPUs across 80 machines in only 10 minutes. As a comparison, in a typical setting 32 A3C workers on one machine would solve this task in about 10 hours.

So using 80 times more machines makes you 60 times faster (assuming those are the same machines) "while performing better on 23 games tested, and worse on 28"[0]?

[0] The paper for this blog post https://arxiv.org/pdf/1703.03864.pdf

Asynchronous advantage actor critic (A3C) https://arxiv.org/pdf/1602.01783.pdf

pebblexe · on March 25, 2017

"A Field Guide to Genetic Programming" (https://www.amazon.com/Field-Guide-Genetic-Programming/dp/14...) is one of the better books I've read on the matter (but I've yet to read anything by Koza). W. Langdon's webpage (http://www0.cs.ucl.ac.uk/staff/W.Langdon/) has a lot of great information.

symstym · on March 25, 2017

Evolution Strategies is not an example of Genetic Programming, and that book by Langdon doesn't cover anything directly relevant to the linked article.

pebblexe · on March 25, 2017

You're right, I'm sorry.

zackmorris · on March 25, 2017

I have read and own Genetic Programming III by John Koza https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv... and the best part about it was that it revisited problems that had been only superficially explored with GP a decade before. The increased computing power available allowed for multiple runs and provided insights into what parameters to tune and gave hard numbers on how much computation was needed to solve various classes of problems.

In the end, it doesn't matter that much which approach is taken because it's all classification problems. We just need the solution matrix, and ideally what computation went into solving it. I feel that this simple fact is lost amidst the complexity of how ML is taught today.

ML isn’t accelerating because of better code or research breakthroughs either. It’s happening because the big CPU manufacturers didn’t do anything for 20 years and GPU manufactures had their lunch. ML is straightforward, even trivial in some cases with effectively unlimited cores and bandwidth. We’re just rediscovering parallelization algorithms that were well known in functional programming generations ago. These discoveries are inevitable in a suitable playground.

I used to have this fantasy that I would get ahead of the curve enough to be able to dabble in the last human endeavor but I'm beginning to realize that that's probably never going to happen. Machines will soon beat humans in pretty much every category, and not because someone figures out how to make it all work, but because there simply isn't enough time to stop it now. There are a dozen teams around the world racing to solve any problem and anyone’s odds of being first are perhaps 10% at best. Compounded with darwinian capitalism, the risk/reward equation is headed towards infinity so fast that it’s looking like the smartest move is not to play.

Barring a dystopian future or cataclysm, I give us 10 years, certainly no more than 20, before computers can do anything people can do, at least economically. And the really eerie thing is that that won’t be the most impressive thing happening, because kids will know it’s all just hill climbing and throwing hardware at problems. It will be all the other associated technologies that come about as people abandon the old hard ways of doing things.

Houshalter · on March 25, 2017

More relevant to this is the Essentials of Metaheuristics: https://cs.gmu.edu/~sean/book/metaheuristics/ (There is a free pdf download on that page.)

rawoke083600 · on March 25, 2017

Reading the code; I dont get get: 1. How to model the different actions

2. Tie future awards to action at this timestep...

Can anyone help with better nitty-gritty explanation ?

symstym · on March 25, 2017

The example code inline in the article just illustrates the basic idea of Evolution Strategies (ES), not their new work in applying ES.

The behavior of agents is determined by a "policy function". This function takes in inputs (e.g. what the agent sees) and outputs actions (e.g. what the agent does). The policy function has a set of internal parameters that determines the precise mapping from inputs to outputs.

In their work, they used a neural network as the policy function. The parameters are just all the weights of the network.

In a simple version, you start with some random weights for the NN. Then you make many copies of the network, each with a slight random variation made to the weights. For each of these altered networks, you use them to control an agent for a while, and see how well the agent performs during that trial period. Based on how well the different variations do during their trial runs, you adjust the weights of the network a small amount. You adjust the weights to be more similar to the variations that did well. Then you repeat the process indefinitely (generate new variations, test them, etc.).

rawoke083600 · on March 25, 2017

Very good sir ! Makes more sense now ! Thank you.

Drdrdrq · on March 25, 2017

This is a fantastic summary of both RL and ES (know as genetoc algorithms too). Kudos to authors!

dimatura · on March 25, 2017

While they're in the same general family of black box optimization algorithms, ES and GAs are not really the same, and in fact were developed independently. The historical roots of both are actually quite interesting.

Drdrdrq · on March 25, 2017

Thanks for correcting me, I appreciate it! This is a nice summary of differences: [0].

[0] http://stackoverflow.com/questions/7787232/difference-betwee...

jpt4 · on March 25, 2017

Might you please be able to refer some sources on their differing origins? I would be most interested to read them.