Applied Optimization 精选

For new readers and those who request to be “好友 good friends” please read my 公告 first

Solving optimization problem algorithmically on a computer is a big subject in computational research ever since computer became prevalent some 50 years ago. The most commonly used method is known as “the gradient descent method”. Colloquially, this can be described as “skiing downhill in a fog”. You cannot see or know the location of the base lodge (the global minimum). But you can sense the direction of descent at the current local point. So you follow this direction hoping to eventually reach the base (globally the lowest point). Of course this is not always successful. You may become trapped at a relative minimum with nowhere to go but up. You may waste a great deal of effort zigzagging down a narrow ditch, etc, etc. Thus, optimization via gradient descent is ages old. It difficulties are also well known. Now, when the gradient information is noisy, the difficulties multiply. The classic is of course the Robbins-Munroe's stochastic approximation paper of 1951. The key equation is

next point of exploration = current point + stepsize * noisy gradient information

RM's point is that in order to prove convergence, the "step size" must continuously be decreased to zero as  1/iteration step. The intuition of this is well understood. Initially when you are far away from the global minimum, the local gradient value is usually large compared to the noise. You want to follow this direction for improvement. When you are in this situation, you are in the so-called “bias phase” of optimization. You want to follow the gradient information as much as possible even though it is corrupted by noise. Hence the step size should be large. As time goes on, you are getting close to the true minimum. The gradient becomes smaller and near zero. The noise now is often larger than the gradient value. At these times known as the “variance phase” you don’t want to pay too much attention to the observed gradient value. Hence the step size should be smaller. RM proved that under suitable assumptions about the surface (terrain) of optimization, using stepsize as 1/t where “t=1,2,3, . . ” is the “iteration step” can guarantee convergence to the minimum.  However, it is also well known this condition results in extremely slow convergence or sometimes no convergence in practice.

Thus the issues of applied optimization are concerned with what we should do with

(1) this step size decrease and

(2) how to sample the noisy gradient information.

On (1), the recommendation is DON'T decrease the step size until you see practically no more improvements in the performance. Then hold the new size until the next no improvement point. The intuition of this is also clear here since you don't want the decrease step size during the “bias” phase when gradient information (even noisy) are much greater then noise. They call this modification step wise decrease instead decreasing the step size with each iteration step you take as recommended by the theory.

On (2), they recommend take a batch of noisy sample gradients and average the batch. The size of the batch, b, is a tuning parameter. The intuitive reason here is of course again obvious.

In very simple cases, people were able to prove convergence with modifications (1) and (2) as well as experimental evidence of faster convergence in practical cases. This is the essence of a technical talk I recently listened with the title

A Statistical Lens into Applied Optimization

Given by Professor Sham Kakade of the University of Washington. You can read more details by googling the speaker’s name.

http://blog.sciencenet.cn/blog-1565-1177304.html

何毓琦

GMT+8, 2020-5-27 11:15