Basics of Unconstrained Optimization
Tags: optimization numerical-methods
Introduction
Optimization is an essential tool in decision science, and it is widely applicable to many fields thanks to its application-agnostic definition. Optimization is integral to humans and their understanding of the world in many ways. For example, investors seek to create portfolios that achieve high rates of return, manufacturers and civil engineers aim for maximum efficiency in their design, and engineers adjust parameters to optimize the performance of their algorithms.
In this post, we will (i) cover the fundamentals of unconstrained optimization of smooth functions, (ii) formulate and implement a few popular optimization algorithms in Python, (iii) and end with a comparative analysis and discussion. The ideas and algorithms described in this post are highly relevant to the bigger world of numerical methods.
Imports
Formulation
Optimization is the maximization or minimization of an objective function $f$, potentially subject to constraints $c$ on some variables $x$. A mathematical convenience that arises is that to maximize $f$ is to minimize $-f$ and vice versa. Throughout this post, we will focus on multivariate minimization.
Unconstrained optimization algorithms either require the user to specify the starting point $x_0$ based on some prior knowledge, or choose $x_0$ by a systematic approach. These algorithms follow the general pseudocode given below:
- Begin at $x_0$
- Generate a sequence of iterates $\{x_k\}_{k=0}^\infty$
- Use information about $f$ at $x_k$ and possibly at earlier iterates $x_0, \ldots, x_{k-1}$ to decide on $x_{k+1}$
- Find a $x_{k+1}$ such that $f(x_{k+1}) < f(x_k)$
- Repeat
- Terminate when no more progress can be made or a solution point has been approximated with sufficient accuracy
There are two main strategies for moving from the current point $x_k$ to a new iterate $x_{k+1}$ in an optimization algorithm: line search and trust region. They differ in the order in which they choose the direction and distance of the move to the next iterate.
Line Search
The algorithm chooses a direction $p_k$ and searches along it for a new iterate with a lower function value. The distance to move along $p_k$ is found by solving for the step length $\alpha$ in the following minimization subproblem:
\[\underset{\alpha > 0}{\min} f(x_k + \alpha p_k)\]Solving this exactly would derive the maximum benefit from the direction $p_k$, but an exact minimization is expensive and usually unnecessary. Instead, the algorithm can generate a limited number of trial step lengths until it finds one that loosely approximates the minimum in each step.
Trust Region
The algorithm uses information gathered about $f$ to construct a model function $m_k$ which behaves similar to $f$ near the current point $x_k$. However, by definition, $m_k$ might not approximate $f$ well when $x$ is far from $x_k$. Therefore, we can restrict the search to some region around $x_k$. The candidate step $p$ can then be found by solving the following minimization subproblem:
\[\underset{p}{\min} m_k(x_k + p) \quad \text{where } x_k + p \text{ lies inside the trust region}\]The algorithms we will cover here will employ the line search strategy. Before introducing and implementing these algorithms, let’s first take a look at line search conditions and a related routine for picking a reasonable step length $\alpha$ at each time step. We will use this same routine for each line search algorithm introduced later.
Step Length
There is a fundamental tradeoff when choosing the step length $a_k$. We want a substantial reduction of $f$, yet we cannot afford to spend too much time making a choice. In general, it is very expensive to identify the $\alpha$ that would yield a global or even local minimizer. The computational complexity arises from the increasing number of function evaluations. Then, instead of an exact line search, most methods use a bracketing phase to find a reasonable interval and an interpolation phase to compute a good step length within this interval. Due to this inexact nature, line search algorithms have various termination conditions as discussed next.
Wolfe Conditions
Sufficient Decrease Condition
The sufficient decrease condition states that a reasonable step length $\alpha_k$ should yield a sufficient decrease in $f$, denoted by:
\[f(x_k + \alpha p_k) \leq f(x_k) + c_1 \alpha \nabla f_k^T p_k \quad \text{for some constant } c_1 \in (0, 1)\]where $c_1$ is chosen to be quite small, say $c_1 = 10^{-4}$, in practice.
Curvature Condition
The sufficient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress as it is satisfied for all sufficiently small values of $\alpha$ (see figure below). Therefore, the curvature condition rules out short step lengths by stating:
\[|\nabla f(x_k + \alpha_k p_k)^T p_k| \geq c_2 |\nabla f_k^T p_k| \quad \text{for some constant } c_2 \in (c_1, 1)\]where typical values of $c_2$ are $c_2 = 0.9$ when the search direction $p_k$ is chosen by a Newton or quasi-Newton method, and $c_2 = 0.1$ when $p_k$ is obtained from a nonlinear conjugate gradient method.
When both the sufficient decrease and curvature conditions are satisfied, it is said that the strong Wolfe conditions are satisfied. Below figure taken from [1] shows this.
There is a one-dimensional search procedure that is guaranteed to find a step length satisfying the strong Wolfe conditions for any parameters $c_1$ and $c_2$ such that $0 < c_1 < c_2 < 1$. The procedure has two stages:
- The first stage begins with a trial estimate $\alpha_1$ and keeps increasing until an acceptable step length or interval for bracketing is found.
- The second stage employs interpolation and successively decreases the size of the interval until an acceptable step length is identified.
Let’s now implement this procedure alongside its prerequisites (e.g. line search conditions) ahead of time as it will come in handy later.
The following error calculation code will also come in handy while measuring the performance of optimization algorithms.
Algorithms
Steepest Descent Method
The steepest descent direction, $-\nabla f_k$ is the direction along which $f$ decreases most rapidly and is consequently the most obvious choice for the search direction of a line search strategy. To prove this, let’s introduce Taylor’s Theorem.
Taylor Series and subsequently Taylor’s Theorem is one of the most powerful tools mathematics has to offer for approximating a function. For a vector-valued function, we can state it as follows: $ \DeclareMathOperator{\D}{\mathop{\: \mathrm{d}}} % differential $
\[\begin{align*} &\text{If } f: \mathbb{R}^n \to \mathbb{R} \text{ is continuously differentiable and } p \in \mathbb{R}^n \text{:} \\ &f(x + p) = f(x) + \nabla f (x + tp)^T p \quad \text{for some } t \in (0, 1) \\ &\text{If } f \text{ is twice continously differentiable:} \\ &\nabla f(x + p) = \nabla f(x) + \int_{0}^{1} \nabla^2 f(x + tp)p \D{t} \quad \text{and} \\ &f(x + p) = f(x) + \nabla f(x)^T p + \frac{1}{2} p^T \nabla^2 f(x + tp)p \quad \text{for some } t \in (0, 1) \\ \end{align*}\]Taylor’s Theorem tells us that for any search direction $p$ and step length $\alpha$, we have:
\[f(x_k + \alpha p) = f(x_k) + \alpha p^T \nabla f_k + \frac{1}{2} \alpha^2 p^T \nabla^2 f(x_k + tp)p \quad \text{for some } t \in (0, 1)\]The rate of change in $f$ along direction $p$ at $x_k$ then becomes the coefficient of $\alpha$, namely $p^T \nabla f_k$. The unit direction $p$ of the most rapid decrease can be found by solving the following minimization subproblem:
\[\begin{align*} &\underset{p}{\min} p^T \nabla f_k \quad \text{ subject to} ||p|| = 1 \\ & = \underset{p}{\min} ||p|| ||\nabla f_k|| \cos \theta \quad \text{by } a \cdot b = ||a|| ||b|| \cos \theta \\ & = \underset{p}{\min} ||\nabla f_k || \cos \theta \quad \text{by } ||p|| = 1 \\ \end{align*}\]The minimizer is attained when $\cos \theta = -1$, and $\theta$ here denotes the angle between $p$ and $\nabla f_k$. This yields:
\[\underset{p}{\min} p^T \nabla f_k = - \frac{\nabla f_k}{||\nabla f_k||}\]Notice that this proves the aforementioned steepest descent claim. The steepest descent method is a linear search method that moves along $p_k = - \nabla f_k$ at every step. Its advantage is that it only requires a calculation of the gradient but not the second derivatives (i.e. the Hessian). Its disadvantage is that it can be slow on difficult problems.
Newton’s Method
The Newton direction is another viable choice for the search direction of a line search strategy. It is derived from the second-order Taylor series approximation:
\[f(x_k + p) \approx f_k + p^T \nabla f_k + \frac{1}{2} p^T \nabla^2 f_k p \overset{\text{def}}{=} m_k(p)\]If the matrix $\nabla^2 f_k$ is positive definite, the Newton direction can be found by $\underset{p}{\min} m_k (p)$. Setting the derivative of $m_k(p)$ to zero, we obtain:
\[p_k^N = - (\nabla^2 f_k)^{-1} \nabla f_k\]The natural step length $\alpha = 1$ is associated with the Newton direction as opposed to the adaptive step length schemes in the steepest descent direction. Therefore, most implementations of Newton’s method use $\alpha = 1$ where possible and adjust the step length only when it does not produce a satisfactory reduction in the value of $f$.
The advantage of Newton’s method is that it has a fast rate of local convergence, typically quadratic. Its disadvantage is that it requires the explicit computation of the Hessian, which in some cases can be error-prone and expensive. Finite-difference and automatic differentiation techniques can be helpful here.
The figure below from [3] depicts how second-order methods like Newton’s method can give better direction and step length than first-order methods like steepest descent. It also foreshadows another method we will cover: conjugate gradient.
Quasi-Newton Methods
The quasi-Newton search directions are an alternative to Newton’s direction: they do not require computation of the Hessian but still attain a superlinear rate of convergence. These directions replace the true Hessian $\nabla^2 f_k$ with an approximation $B_k$ in each step, which infers information about the second derivative of $f$ using changes in the gradient of $f$. They are derived the following way:
\[\begin{align*} \nabla f(x + p) &= \nabla f(x) + \nabla^2 f(x) p + \int_0^1 \big[\nabla^2 f(x + tp) - \nabla^2 f(x)\big] p \D{t} \\ &\quad \: \text{by adding and subtracting } \nabla^2 f(x) p \text{ from Taylor's Theorem} \end{align*}\]As $\nabla f(\cdot)$ is continuous, we also know the following:
\[\int_0^1 \big[\nabla^2 f(x + tp) - \nabla^2 f(x)\big] p \D{t} = O(||p||)\]By setting $x = x_k$ and $p = x_{k+1} - x_k$, we obtain:
\[\nabla f_{k+1} = \nabla f_k + \nabla^2 f_k(x_{k+1} - x_k) + O(||x_{k+1} - x_k||)\]The final integral term definition is eventually dominated by $\nabla^2 f_k(x_{k+1} - x_k)$ as $x_k$ and $x_{k+1}$ lie in a region near the solution $x^*$. This yields the following approximation of the Hessian:
\[\nabla^2 f_k(x_{k+1} - x_k) \approx B_{k+1} = \nabla f_{k+1} - \nabla f_k\]We also want that the approximation $B_{k+1}$, just like the true Hessian, satisfies the secant equation:
\[B_{k+1} s_k = y_k \quad \text{where } s_k = x_{k+1} - x_{k} \text{ and } y_k = \nabla f_{k+1} - \nabla f_k\]The quasi-Newton search direction is then given by using $B_k$ in place of the exact Hessian:
\[p_k = - B_k^{-1} \nabla f_k\]This is the definition of a generic quasi-Newton method; specialized methods, as we shall, see next impose other conditions on the approximation $B_{k+1}$.
Symmetric-Rank-One (SR1) Method
The symmetric-rank-one (SR1) method states that
\[B_{k+1} = B_k + \frac{(y_k - B_k s_k)(y_k - B_k s_k)^T}{(y_k - B_k s_k)^T s_k}\]NOTE: We will not implement this method as BFGS is the more popular and common choice amongst the two for quasi-Newton methods.
BFGS Method
The BFGS method, named after its inventors Broyden, Fletcher, Goldfarb, and Shanno, states that
\[B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}\]Conjugate Gradient Method
The conjugate gradient method chooses the direction as
\[p_k = - \nabla f(x_k) + \beta_k p_{k-1}\]where $\beta_k$ is a scalar that ensures that $p_k$ and $p_{k-1}$ are conjugate.
Experiments
Example Problem: Rosenbrock Function
The Rosenbrock function is a non-convex function introduced by Howard H. Rosenbrock in 1960. It has been widely used in performance testing for optimization algorithms. It is also known as Rosenbrock’s valley or Rosenbrock’s banana function.
The function is defined by:
\[f(x) = f(x_1, x_2) = (a - x_1)^2 + b(x_2 - x_1^2)^2 = a^2 - 2 a x_1 + x_1^2 + b x_2^2 - 2 b x_2 x_1^2 + b x_1^4\]The function has a global minimum at $(x_1, x_2) = (a, a^2)$ where $f(x_1, x_2) = 0$. Usually the parameter setting is: $a = 1$ and $b = 100$.
We can compute the gradient and Hessian of the function:
\[\begin{align*} \nabla f_k &= \begin{bmatrix} 4 b x_1^3 - 4 b x_1 x_2 + 2 x_1 - 2 a \\ 2 b x_2 - 2 b x_1^2 \\ \end{bmatrix} \\ \nabla^2 f_k &= H_k = \begin{bmatrix} 12 b x_1^2 - 4 b x_2 + 2 & -4 b x_1 \\ -4 b x_1 & 2 b\\ \end{bmatrix} \end{align*}\]Now, let’s implement the Rosenbrock function in Python and visualize its surface. The global minimum of the function is inside a long, flat valley. The valley is easily identifiable from the plots as well as mathematically; however, converging to the global minimum is more difficult.
Let’s now try to do optimization (minimization) on the Rosenbrock function we have implemented above using the line search optimization algorithms we have implemented previously.
Before doing deeper analysis on the performance and convergence of the algorithms, let’s first plot the solution paths of each optimization algorithm with algorithm_iters = 1024
on the predefined points. This is mostly for fun, but it might also provide an additional perspective.
Let’s now vary algorithm_iters
and plot error as a function of the number of iterations. We will do this across different starting points with different difficulty levels. This will give us an idea of the convergence rate and relative performance of the optimization algorithms we have implemented.
Analysis
Solution Trajectory Visualization
The solution trajectory visualizations show that the steepest descent method is the noisiest, whereas others are reasonably smoother. This aligns well with the idea that Newton’s method and other methods that use information from the second-derivative get better gradient directions and step sizes. We notice that the conjugate gradient method overshoots the minima on the hard level the first time it comes around. We see something similar with the BFGS method on the medium level.
Convergence Rate and Performance
First of all, we notice that Newton’s method, conjugate gradient, and BFGS all converge between $10^1$ and $10^2$ in all levels, whereas steepest descent converges much later, between $10^3$ and $10^4$. Moreover, steepest descent always converges at a higher absolute error when compared to other algorithms. This supports our previous claims that steepest ascent has the advantage of only requiring calculation of the gradient but not of second derivatives, yet it can be painfully slow on difficult problems. On the other hand, Newton’s method and quasi-Newton methods reach superlinear convergence but are more computationally expensive. They also acquire better performance compared to steepest descent as previously discussed.
Moreover, Newton’s method has the highest convergence rate and achieves the best absolute error overall. The second-best convergence rate and absolute error overall is the quasi-Newtonian BFGS method. BFGS and other quasi-Newton methods do not require computation of the Hessian but still attain a superlinear rate of convergence. Our graph supports this as the rate of convergence is comparable. The rate of convergence of the conjugate gradient method follows closely after; however, the performance of this method in terms of the absolute error is poor compared to Newton’s method and BFGS. This makes sense: the conjugate descent method tries to accelerate the convergence rate of steepest descent while avoiding the high computational cost of Newton’s method and therefore falls right in between.
References
[1] Jorge Nocedal, Stephen J. Wright, Numerical Optimization (2nd Edition), Springer
[2] University of Stuggart, Introduction to Optimization, Marc Toussaint, Lecture 02
[3] University of Stuggart, Introduction to Optimization, Marc Toussaint, Lecture 04
[4] algopy
documentation
[5] Carnegie Mellon University, Convex Optimization, Aarti Singh and Pradeep Ravikumar
[6] Columbia University, APMA E4300: Computational Math: Introduction to Numerical Methods, Marc Spiegelman
NOTE: The majority of the material covered here is inspired and adapted from [1].