# Basics of Unconstrained Optimization

* Tags: * optimization numerical-methods

## Introduction

Optimization is an essential tool in decision science, and it is widely applicable to many fields thanks to its application-agnostic definition. Optimization is integral to humans and their understanding of the world in many ways. For example, investors seek to create portfolios that achieve high rates of return, manufacturers and civil engineers aim for maximum efficiency in their design, and engineers adjust parameters to optimize the performance of their algorithms.

In this post, we will (i) cover the fundamentals of **unconstrained** optimization of **smooth** functions, (ii) formulate and implement a few popular optimization algorithms in Python, (iii) and end with a comparative analysis and discussion. The ideas and algorithms described in this post are highly relevant to the bigger world of numerical methods.

## Imports

## Formulation

Optimization is the maximization or minimization of an objective function $f$, potentially subject to constraints $c$ on some variables $x$. A mathematical convenience that arises is that to maximize $f$ is to minimize $-f$ and vice versa. Throughout this post, we will focus on multivariate **minimization**.

**Unconstrained** optimization algorithms either require the user to specify the starting point $x_0$ based on some prior knowledge, or choose $x_0$ by a systematic approach. These algorithms follow the general pseudocode given below:

- Begin at $x_0$
- Generate a sequence of iterates $\{x_k\}_{k=0}^\infty$
- Use information about $f$ at $x_k$ and possibly at earlier iterates $x_0, \ldots, x_{k-1}$ to decide on $x_{k+1}$
- Find a $x_{k+1}$ such that $f(x_{k+1}) < f(x_k)$
- Repeat

- Terminate when no more progress can be made or a solution point has been approximated with sufficient accuracy

There are two main strategies for moving from the current point $x_k$ to a new iterate $x_{k+1}$ in an optimization algorithm: **line search** and **trust region**. They differ in the order in which they choose the *direction* and *distance* of the move to the next iterate.

### Line Search

The algorithm chooses a direction $p_k$ and searches along it for a new iterate with a lower function value. The distance to move along $p_k$ is found by solving for the step length $\alpha$ in the following minimization subproblem:

\[\underset{\alpha > 0}{\min} f(x_k + \alpha p_k)\]Solving this exactly would derive the maximum benefit from the direction $p_k$, but an exact minimization is expensive and usually unnecessary. Instead, the algorithm can generate a limited number of trial step lengths until it finds one that loosely approximates the minimum in each step.

### Trust Region

The algorithm uses information gathered about $f$ to construct a model function $m_k$ which behaves similar to $f$ near the current point $x_k$. However, by definition, $m_k$ might not approximate $f$ well when $x$ is far from $x_k$. Therefore, we can restrict the search to some region around $x_k$. The candidate step $p$ can then be found by solving the following minimization subproblem:

\[\underset{p}{\min} m_k(x_k + p) \quad \text{where } x_k + p \text{ lies inside the trust region}\]The algorithms we will cover here will employ the line search strategy. Before introducing and implementing these algorithms, let’s first take a look at line search conditions and a related routine for picking a reasonable step length $\alpha$ at each time step. We will use this same routine for each line search algorithm introduced later.

### Step Length

There is a fundamental tradeoff when choosing the step length $a_k$. We want a substantial reduction of $f$, yet we cannot afford to spend too much time making a choice. In general, it is very expensive to identify the $\alpha$ that would yield a global or even local minimizer. The computational complexity arises from the increasing number of function evaluations. Then, instead of an exact line search, most methods use a **bracketing** phase to find a reasonable interval and an **interpolation** phase to compute a good step length within this interval. Due to this inexact nature, line search algorithms have various termination conditions as discussed next.

#### Wolfe Conditions

##### Sufficient Decrease Condition

The **sufficient decrease condition** states that a reasonable step length $\alpha_k$ should yield a sufficient decrease in $f$, denoted by:

where $c_1$ is chosen to be quite small, say $c_1 = 10^{-4}$, in practice.

##### Curvature Condition

The sufficient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress as it is satisfied for all sufficiently small values of $\alpha$ (see figure below). Therefore, the **curvature condition** rules out short step lengths by stating:

where typical values of $c_2$ are $c_2 = 0.9$ when the search direction $p_k$ is chosen by a Newton or quasi-Newton method, and $c_2 = 0.1$ when $p_k$ is obtained from a nonlinear conjugate gradient method.

When both the sufficient decrease and curvature conditions are satisfied, it is said that the **strong Wolfe conditions** are satisfied. Below figure taken from [1] shows this.

There is a one-dimensional search procedure that is guaranteed to find a step length satisfying the strong Wolfe conditions for any parameters $c_1$ and $c_2$ such that $0 < c_1 < c_2 < 1$. The procedure has two stages:

- The first stage begins with a trial estimate $\alpha_1$ and keeps increasing until an acceptable step length or interval for
**bracketing**is found. - The second stage employs
**interpolation**and successively decreases the size of the interval until an acceptable step length is identified.

Let’s now implement this procedure alongside its prerequisites (e.g. line search conditions) ahead of time as it will come in handy later.

The following error calculation code will also come in handy while measuring the performance of optimization algorithms.

## Algorithms

### Steepest Descent Method

The *steepest descent direction*, $-\nabla f_k$ is the direction along which $f$ decreases most rapidly and is consequently the most obvious choice for the search direction of a line search strategy. To prove this, let’s introduce **Taylor’s Theorem**.

Taylor Series and subsequently Taylor’s Theorem is one of the most powerful tools mathematics has to offer for approximating a function. For a vector-valued function, we can state it as follows: $ \DeclareMathOperator{\D}{\mathop{\: \mathrm{d}}} % differential $

\[\begin{align*} &\text{If } f: \mathbb{R}^n \to \mathbb{R} \text{ is continuously differentiable and } p \in \mathbb{R}^n \text{:} \\ &f(x + p) = f(x) + \nabla f (x + tp)^T p \quad \text{for some } t \in (0, 1) \\ &\text{If } f \text{ is twice continously differentiable:} \\ &\nabla f(x + p) = \nabla f(x) + \int_{0}^{1} \nabla^2 f(x + tp)p \D{t} \quad \text{and} \\ &f(x + p) = f(x) + \nabla f(x)^T p + \frac{1}{2} p^T \nabla^2 f(x + tp)p \quad \text{for some } t \in (0, 1) \\ \end{align*}\]Taylor’s Theorem tells us that for any search direction $p$ and step length $\alpha$, we have:

\[f(x_k + \alpha p) = f(x_k) + \alpha p^T \nabla f_k + \frac{1}{2} \alpha^2 p^T \nabla^2 f(x_k + tp)p \quad \text{for some } t \in (0, 1)\]The rate of change in $f$ along direction $p$ at $x_k$ then becomes the coefficient of $\alpha$, namely $p^T \nabla f_k$. The unit direction $p$ of the most rapid decrease can be found by solving the following minimization subproblem:

\[\begin{align*} &\underset{p}{\min} p^T \nabla f_k \quad \text{ subject to} ||p|| = 1 \\ & = \underset{p}{\min} ||p|| ||\nabla f_k|| \cos \theta \quad \text{by } a \cdot b = ||a|| ||b|| \cos \theta \\ & = \underset{p}{\min} ||\nabla f_k || \cos \theta \quad \text{by } ||p|| = 1 \\ \end{align*}\]The minimizer is attained when $\cos \theta = -1$, and $\theta$ here denotes the angle between $p$ and $\nabla f_k$. This yields:

\[\underset{p}{\min} p^T \nabla f_k = - \frac{\nabla f_k}{||\nabla f_k||}\]Notice that this proves the aforementioned steepest descent claim. The **steepest descent method** is a linear search method that moves along $p_k = - \nabla f_k$ at every step. Its advantage is that it only requires a calculation of the gradient but not the second derivatives (i.e. the Hessian). Its disadvantage is that it can be slow on difficult problems.

### Newton’s Method

The *Newton direction* is another viable choice for the search direction of a line search strategy. It is derived from the second-order Taylor series approximation:

If the matrix $\nabla^2 f_k$ is positive definite, the Newton direction can be found by $\underset{p}{\min} m_k (p)$. Setting the derivative of $m_k(p)$ to zero, we obtain:

\[p_k^N = - (\nabla^2 f_k)^{-1} \nabla f_k\]The natural step length $\alpha = 1$ is associated with the Newton direction as opposed to the adaptive step length schemes in the steepest descent direction. Therefore, most implementations of **Newton’s method** use $\alpha = 1$ where possible and adjust the step length only when it does not produce a satisfactory reduction in the value of $f$.

The advantage of Newton’s method is that it has a fast rate of local convergence, typically quadratic. Its disadvantage is that it requires the explicit computation of the Hessian, which in some cases can be error-prone and expensive. Finite-difference and automatic differentiation techniques can be helpful here.

The figure below from [3] depicts how second-order methods like Newton’s method can give better direction and step length than first-order methods like steepest descent. It also foreshadows another method we will cover: conjugate gradient.

### Quasi-Newton Methods

The *quasi-Newton search directions* are an alternative to *Newton’s direction*: they do not require computation of the Hessian but still attain a superlinear rate of convergence. These directions replace the true Hessian $\nabla^2 f_k$ with an approximation $B_k$ in each step, which infers information about the second derivative of $f$ using changes in the gradient of $f$. They are derived the following way:

As $\nabla f(\cdot)$ is continuous, we also know the following:

\[\int_0^1 \big[\nabla^2 f(x + tp) - \nabla^2 f(x)\big] p \D{t} = O(||p||)\]By setting $x = x_k$ and $p = x_{k+1} - x_k$, we obtain:

\[\nabla f_{k+1} = \nabla f_k + \nabla^2 f_k(x_{k+1} - x_k) + O(||x_{k+1} - x_k||)\]The final integral term definition is eventually dominated by $\nabla^2 f_k(x_{k+1} - x_k)$ as $x_k$ and $x_{k+1}$ lie in a region near the solution $x^*$. This yields the following approximation of the Hessian:

\[\nabla^2 f_k(x_{k+1} - x_k) \approx B_{k+1} = \nabla f_{k+1} - \nabla f_k\]We also want that the approximation $B_{k+1}$, just like the true Hessian, satisfies the *secant equation*:

The quasi-Newton search direction is then given by using $B_k$ in place of the exact Hessian:

\[p_k = - B_k^{-1} \nabla f_k\]This is the definition of a generic **quasi-Newton method**; specialized methods, as we shall, see next impose other conditions on the approximation $B_{k+1}$.

#### Symmetric-Rank-One (SR1) Method

The **symmetric-rank-one (SR1) method** states that

*NOTE*: We will not implement this method as BFGS is the more popular and common choice amongst the two for quasi-Newton methods.

#### BFGS Method

The **BFGS method**, named after its inventors Broyden, Fletcher, Goldfarb, and Shanno, states that

### Conjugate Gradient Method

The **conjugate gradient method** chooses the direction as

where $\beta_k$ is a scalar that ensures that $p_k$ and $p_{k-1}$ are *conjugate*.

## Experiments

### Example Problem: Rosenbrock Function

The **Rosenbrock function** is a non-convex function introduced by Howard H. Rosenbrock in 1960. It has been widely used in performance testing for optimization algorithms. It is also known as Rosenbrock’s valley or Rosenbrock’s banana function.

The function is defined by:

\[f(x) = f(x_1, x_2) = (a - x_1)^2 + b(x_2 - x_1^2)^2 = a^2 - 2 a x_1 + x_1^2 + b x_2^2 - 2 b x_2 x_1^2 + b x_1^4\]The function has a global minimum at $(x_1, x_2) = (a, a^2)$ where $f(x_1, x_2) = 0$. Usually the parameter setting is: $a = 1$ and $b = 100$.

We can compute the gradient and Hessian of the function:

\[\begin{align*} \nabla f_k &= \begin{bmatrix} 4 b x_1^3 - 4 b x_1 x_2 + 2 x_1 - 2 a \\ 2 b x_2 - 2 b x_1^2 \\ \end{bmatrix} \\ \nabla^2 f_k &= H_k = \begin{bmatrix} 12 b x_1^2 - 4 b x_2 + 2 & -4 b x_1 \\ -4 b x_1 & 2 b\\ \end{bmatrix} \end{align*}\]Now, let’s implement the Rosenbrock function in Python and visualize its surface. The global minimum of the function is inside a long, flat valley. The valley is easily identifiable from the plots as well as mathematically; however, converging to the global minimum is more difficult.

Let’s now try to do optimization (minimization) on the Rosenbrock function we have implemented above using the line search optimization algorithms we have implemented previously.

Before doing deeper analysis on the performance and convergence of the algorithms, let’s first plot the solution paths of each optimization algorithm with `algorithm_iters = 1024`

on the predefined points. This is mostly for fun, but it might also provide an additional perspective.

Let’s now vary `algorithm_iters`

and plot error as a function of the number of iterations. We will do this across different starting points with different difficulty levels. This will give us an idea of the convergence rate and relative performance of the optimization algorithms we have implemented.

## Analysis

### Solution Trajectory Visualization

The solution trajectory visualizations show that the steepest descent method is the noisiest, whereas others are reasonably smoother. This aligns well with the idea that Newton’s method and other methods that use information from the second-derivative get better gradient directions and step sizes. We notice that the conjugate gradient method overshoots the minima on the hard level the first time it comes around. We see something similar with the BFGS method on the medium level.

### Convergence Rate and Performance

First of all, we notice that Newton’s method, conjugate gradient, and BFGS all converge between $10^1$ and $10^2$ in all levels, whereas steepest descent converges much later, between $10^3$ and $10^4$. Moreover, steepest descent always converges at a higher absolute error when compared to other algorithms. This supports our previous claims that steepest ascent has the advantage of only requiring calculation of the gradient but not of second derivatives, yet it can be painfully slow on difficult problems. On the other hand, Newton’s method and quasi-Newton methods reach superlinear convergence but are more computationally expensive. They also acquire better performance compared to steepest descent as previously discussed.

Moreover, Newton’s method has the highest convergence rate and achieves the best absolute error overall. The second-best convergence rate and absolute error overall is the quasi-Newtonian BFGS method. BFGS and other quasi-Newton methods do not require computation of the Hessian but still attain a superlinear rate of convergence. Our graph supports this as the rate of convergence is comparable. The rate of convergence of the conjugate gradient method follows closely after; however, the performance of this method in terms of the absolute error is poor compared to Newton’s method and BFGS. This makes sense: the conjugate descent method tries to accelerate the convergence rate of steepest descent while avoiding the high computational cost of Newton’s method and therefore falls right in between.

## References

[1] Jorge Nocedal, Stephen J. Wright, *Numerical Optimization (2nd Edition)*, Springer

[2] University of Stuggart, *Introduction to Optimization*, Marc Toussaint, Lecture 02

[3] University of Stuggart, *Introduction to Optimization*, Marc Toussaint, Lecture 04

[4] `algopy`

documentation

[5] Carnegie Mellon University, *Convex Optimization*, Aarti Singh and Pradeep Ravikumar

[6] Columbia University, *APMA E4300: Computational Math: Introduction to Numerical Methods*, Marc Spiegelman

*NOTE:* The majority of the material covered here is inspired and adapted from [1].