Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you. Several people have pointed that out, and I'm probably not reading the right papers. Is it common when people introduce a new flavor of adaptive SGD to address how it handles saddles specifically? It is probably just a a matter of what manages to bubble up to me rather than what work is actually getting done, but I felt like the non-convergence of ADAM got talked about a lot, but haven't seen people talking as much about how optimizers behave differently on the landscapes we actually observe.


Saddles are a way of conceptualizing high dimensional optimization problems. If you have a 3 dimensional surface you can imagine a saddle as an isocurve that follows a minima in at least one dimension.

Another way to conceptualize these is to think of being at the minima of a parabola in 2 dimensions, but then seeing you're not in a minima in a 3rd dimension. Any time you're in a minima in at least 1 dimension, you're on a saddle.

You can extend this concept to a neural net which lives in millions of dimensions, undergoing SGD. When beginning an optimization run SGD moves in some direction to minimize the a bundled cost, inevitably stumbling into minima in (usually) many dimensions. Subsequent iterations will shift some dimensions out of minima and other dimensions into minima, the net is always living on a saddle during this process.

There are many papers that discuss the process in these terms and others that implicitly use it. I wouldn't say its a "hot area of research" but more of a tool for thinking about these processes and sometimes gaining some insight in to why things get stuck during training.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: