Lecture 4OptimizationIvan RudikAEM 71301 / 103

Software and stuff

Necessary things to do:

Install the QuantEcon Julia package
Install the Optim or NLOpt Julia packages

2 / 103

Optimization

All econ problems are optimization problems

3 / 103

Optimization

All econ problems are optimization problems

Min costs

3 / 103

Optimization

All econ problems are optimization problems

Min costs
Max PV E[welfare]

3 / 103

Optimization

All econ problems are optimization problems

Min costs
Max PV E[welfare]

Some are harder than others:

3 / 103

Optimization

All econ problems are optimization problems

Min costs
Max PV E[welfare]

Some are harder than others:

Individual utility max: easy

3 / 103

Optimization

All econ problems are optimization problems

Min costs
Max PV E[welfare]

Some are harder than others:

Individual utility max: easy
Decentralized electricity market with nodal pricing and market power: hard

3 / 103

Optimization

All econ problems are optimization problems

Min costs
Max PV E[welfare]

Some are harder than others:

Individual utility max: easy
Decentralized electricity market with nodal pricing and market power: hard
One input profit maximization problem: easy

3 / 103

Optimization

All econ problems are optimization problems

Min costs
Max PV E[welfare]

Some are harder than others:

Individual utility max: easy
Decentralized electricity market with nodal pricing and market power: hard
One input profit maximization problem: easy
N-input profit maximization with learning and forecasts: hard

3 / 103

Things we will doLinear rootfinding
Non-linear rootfinding
Complementarity problems
Non-linear unconstrained maximization/minimization
Non-linear constrained maximization/minimization
4 / 103

Linear rootfinding

How do we solve these?

5 / 103

Linear rootfinding

How do we solve these?

Consider a simple generic problem:

$Ax = b$

5 / 103

Linear rootfinding

How do we solve these?

Consider a simple generic problem:

$Ax = b$

Invert $A$

$x = A^{-1}b$

5 / 103

Linear rootfinding

How do we solve these?

Consider a simple generic problem:

$Ax = b$

Invert $A$

$x = A^{-1}b$

THE END

5 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

$f(x) = 0, f:\mathbb{R}\rightarrow\mathbb{R}^n$

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

$f(x) = 0, f:\mathbb{R}\rightarrow\mathbb{R}^n$

What's a common rootfinding problem?

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

$f(x) = 0, f:\mathbb{R}\rightarrow\mathbb{R}^n$

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

$f(x) = 0, f:\mathbb{R}\rightarrow\mathbb{R}^n$

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

Yes!

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

$f(x) = 0, f:\mathbb{R}\rightarrow\mathbb{R}^n$

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

Yes!

Fixed point problems are rootfinding problems:

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

$f(x) = 0, f:\mathbb{R}\rightarrow\mathbb{R}^n$

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

Yes!

Fixed point problems are rootfinding problems:

$g(x) = x \Rightarrow f(x) \equiv g(x) - x = 0$

6 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector $x$ may be constrained to be $x \in [a,b]$

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector $x$ may be constrained to be $x \in [a,b]$

This results in a complementarity problem:

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector $x$ may be constrained to be $x \in [a,b]$

This results in a complementarity problem:

$\begin{align} x_i > a_i \Rightarrow f_i(x) \geq 0 \,\,\, \forall \, i=1,...,n \\ x_i < b_i \Rightarrow f_i(x) \leq 0 \,\,\, \forall \, i=1,...,n \end{align}$

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector $x$ may be constrained to be $x \in [a,b]$

This results in a complementarity problem:

$\begin{align} x_i > a_i \Rightarrow f_i(x) \geq 0 \,\,\, \forall \, i=1,...,n \\ x_i < b_i \Rightarrow f_i(x) \leq 0 \,\,\, \forall \, i=1,...,n \end{align}$

What do these equations mean?

7 / 103

Rootfinding with constraints

$\begin{align} x_i > a_i \Rightarrow f_i(x) \geq 0 \,\,\, \forall \, i=1,...,n \\ x_i < b_i \Rightarrow f_i(x) \leq 0 \,\,\, \forall \, i=1,...,n \end{align}$

If the constraints on $x_i$ do not bind, $a_i < x_i < b_i$ , then the first-order condition is precisely zero

8 / 103

Rootfinding with constraints

$\begin{align} x_i > a_i \Rightarrow f_i(x) \geq 0 \,\,\, \forall \, i=1,...,n \\ x_i < b_i \Rightarrow f_i(x) \leq 0 \,\,\, \forall \, i=1,...,n \end{align}$

If the constraints on $x_i$ do not bind, $a_i < x_i < b_i$ , then the first-order condition is precisely zero

Suppose the upper bound binds, $x_i = b_i$

8 / 103

Rootfinding with constraints

$\begin{align} x_i > a_i \Rightarrow f_i(x) \geq 0 \,\,\, \forall \, i=1,...,n \\ x_i < b_i \Rightarrow f_i(x) \leq 0 \,\,\, \forall \, i=1,...,n \end{align}$

If the constraints on $x_i$ do not bind, $a_i < x_i < b_i$ , then the first-order condition is precisely zero

Suppose the upper bound binds, $x_i = b_i$

Then $f_i(x) \geq 0$ since $x_i > a_i$ , but we can't guarantee that $f_i(x) = 0$ because $f_i(x)$ might still be increasing

8 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values $a$ and $b$ ,
it achieves all values in the set $[a,b]$ somewhere in its domain

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values $a$ and $b$ ,
it achieves all values in the set $[a,b]$ somewhere in its domain

How can this motivate an algorithm to find the root of a function?

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values $a$ and $b$ ,
it achieves all values in the set $[a,b]$ somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values $a$ and $b$ ,
it achieves all values in the set $[a,b]$ somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

We know a root exists by IVT, what's an efficient way to find it?

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values $a$ and $b$ ,
it achieves all values in the set $[a,b]$ somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

We know a root exists by IVT, what's an efficient way to find it?

Continually bisect the interval!

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values $a$ and $b$ ,
it achieves all values in the set $[a,b]$ somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

We know a root exists by IVT, what's an efficient way to find it?

Continually bisect the interval!

Write out the code to do it

9 / 103

The bisection algorithm

function bisection(f, lower_bound, upper_bound)
    tolerance = 1e-3                                # tolerance for solution
    guess = 0.5*(upper_bound + lower_bound)         # initial guess, bisect the interval
    difference = (upper_bound - lower_bound)/2      # initialize bound difference
    while difference > tolerance              # loop until convergence
        println("Intermediate guess of $guess.")
        difference = difference/2
        if sign(f(lower_bound)) == sign(f(guess))   # if the guess has the same sign as the lower bound
            lower_bound = guess                     # solution is in the upper half of the interval
            guess = guess + difference
        else                                        # else the solution is in the lower half of the interval
            upper_bound = guess
            guess = guess - difference
        end
    end
    println("The root of f(x) is at $guess.")
end;

10 / 103

The bisection method

f(x) = x^3

## f (generic function with 1 method)

bisection(f, -4, 1)

## Intermediate guess of -1.5.
## Intermediate guess of -0.25.
## Intermediate guess of 0.375.
## Intermediate guess of 0.0625.
## Intermediate guess of -0.09375.
## Intermediate guess of -0.015625.
## Intermediate guess of 0.0234375.
## Intermediate guess of 0.00390625.
## Intermediate guess of -0.005859375.
## Intermediate guess of -0.0009765625.
## Intermediate guess of 0.00146484375.
## Intermediate guess of 0.000244140625.
## The root of f(x) is at -0.0003662109375.

11 / 103

g(x) = 3x^3 + 2x -4

## g (generic function with 1 method)

bisection(g, -6, 4)

## Intermediate guess of -1.0.
## Intermediate guess of 1.5.
## Intermediate guess of 0.25.
## Intermediate guess of 0.875.
## Intermediate guess of 1.1875.
## Intermediate guess of 1.03125.
## Intermediate guess of 0.953125.
## Intermediate guess of 0.9140625.
## Intermediate guess of 0.89453125.
## Intermediate guess of 0.904296875.
## Intermediate guess of 0.8994140625.
## Intermediate guess of 0.90185546875.
## Intermediate guess of 0.900634765625.
## The root of f(x) is at 0.9012451171875.

12 / 103

h(x) = cos(x)

## h (generic function with 1 method)

bisection(h, -pi, pi)

## Intermediate guess of 0.0.
## Intermediate guess of -1.5707963267948966.
## Intermediate guess of -2.356194490192345.
## Intermediate guess of -1.9634954084936207.
## Intermediate guess of -1.7671458676442586.
## Intermediate guess of -1.6689710972195777.
## Intermediate guess of -1.6198837120072371.
## Intermediate guess of -1.595340019401067.
## Intermediate guess of -1.5830681730979819.
## Intermediate guess of -1.5769322499464393.
## Intermediate guess of -1.573864288370668.
## Intermediate guess of -1.5723303075827824.
## The root of f(x) is at -1.5715633171888395.

13 / 103

The bisection method

The bisection method works by continually bisecting the interval and only keeping the half interval with the zero until "convergence"

Select the midpoint of $[a,b]$ , $(a+b)/2$
Zero must be in the lower or upper half
Check the sign of the midpoint, if it has the same sign as the lower bound a root must be the right subinterval
Select the midpoint of $[(a+b)/2,b]$ ...

14 / 103

The bisection method

The bisection method is incredibly robust: if a function $f$ satisfies the IVT, it is guaranteed to converge in a specific number of iterations

15 / 103

The bisection method

The bisection method is incredibly robust: if a function $f$ satisfies the IVT, it is guaranteed to converge in a specific number of iterations

A root can be calculated to arbitrary precision $\epsilon$
in a maximum of $log([b-a]/\epsilon)/log(2)$ iterations

Robustness comes with drawbacks:

It only works in one dimension
It is slow because it only uses information about the function's level

15 / 103

Function iteration

Fixed points can be computed using function iteration

16 / 103

Function iteration

Fixed points can be computed using function iteration

Since we can recast fixed points as rootfinding problems we can use function iteration to find roots too

16 / 103

Function iteration

Fixed points can be computed using function iteration

Since we can recast fixed points as rootfinding problems we can use function iteration to find roots too

Code up a function iteration problem

16 / 103

Function iteration

Function iteration is pretty simple to implement

function function_iteration(f, guess)
    tolerance = 1e-3                                # tolerance for solution
    x_old = guess                                   # initialize old x value
    x = 1e-10                                       # initialize current x
    difference = 1e10                               # initialize bound difference
    @time while abs(difference) > tolerance         # loop until convergence
        println("Intermediate guess of $x.")
        x = f(x_old)                                # new x = f(old x)
        difference = x - x_old
        x_old = x  
    end
    println("The fixed point of f(x) is at $x.")
end;

17 / 103

f(x) = x^(-0.5)

## f (generic function with 1 method)

function_iteration(f, 2.)

## Intermediate guess of 1.0e-10.
## Intermediate guess of 0.7071067811865476.
## Intermediate guess of 1.189207115002721.
## Intermediate guess of 0.9170040432046712.
## Intermediate guess of 1.0442737824274138.
## Intermediate guess of 0.9785720620877002.
## Intermediate guess of 1.0108892860517005.
## Intermediate guess of 0.9945994234836332.
## Intermediate guess of 1.0027112750502025.
## Intermediate guess of 0.9986471128909702.
## Intermediate guess of 1.0006771306930664.
## Intermediate guess of 0.9996616064962437.
##   0.002017 seconds (924 allocations: 233.763 KiB)
## The fixed point of f(x) is at 1.0001692397053021.

18 / 103

Function iteration

19 / 103

Function iteration

Function iteration can be quick, but is not always guaranteed to converge

20 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

21 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

What's the idea?

21 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

What's the idea?

Take a hard non-linear problem and replace it with a sequence of linear problems

21 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

What's the idea?

Take a hard non-linear problem and replace it with a sequence of linear problems

Under certain conditions the sequence of solutions will converge to the true solution

21 / 103

Newton's method

Here's a graphical depiction of Newton's method:

22 / 103

Newton's method

Start with an initial guess of the root at $x^{(0)}$

23 / 103

Newton's method

Start with an initial guess of the root at $x^{(0)}$

Approximate the non-linear function with its first-order Taylor expansion about $x^{(0)}$

23 / 103

Newton's method

Start with an initial guess of the root at $x^{(0)}$

Approximate the non-linear function with its first-order Taylor expansion about $x^{(0)}$

This is just the tangent line at $x^0$ , solve for the root of this linear approximation, call it $x^{(1)}$

23 / 103

Newton's method

Start with an initial guess of the root at $x^{(0)}$

Approximate the non-linear function with its first-order Taylor expansion about $x^{(0)}$

This is just the tangent line at $x^0$ , solve for the root of this linear approximation, call it $x^{(1)}$

Repeat starting at $x^{(1)}$

23 / 103

Newton's method

Start with an initial guess of the root at $x^{(0)}$

Approximate the non-linear function with its first-order Taylor expansion about $x^{(0)}$

This is just the tangent line at $x^0$ , solve for the root of this linear approximation, call it $x^{(1)}$

Repeat starting at $x^{(1)}$

This can be applied to a function with an arbitrary number of dimensions

23 / 103

Newton's method

Begin with some initial guess of the root vector $\mathbf{x^{(0)}}$

24 / 103

Newton's method

Begin with some initial guess of the root vector $\mathbf{x^{(0)}}$

Our new guess $\mathbf{x^{(k+1)}}$ given some arbitrary point in the algorithm, $\mathbf{x^{(k)}}$ , is obtained by approximating $f(\mathbf{x})$ using a first-order Taylor expansion about $\mathbf{x^{(k)}}$ and solving for $\mathbf{x}$ : $\begin{gather} f(\mathbf{x}) \approx f(\mathbf{x^{(k)}}) + f'(\mathbf{x^{(k)}})(\mathbf{x}-\mathbf{x^{(k)}}) = 0 \notag \\ \Rightarrow \mathbf{x^{(k+1)}} = \mathbf{x^{(k)}} - \left[f'(\mathbf{x^{(k)}})\right]^{-1}f(\mathbf{x^{(k)}}) \notag \end{gather}$

24 / 103

Newton's method

Code up Newton's method

25 / 103

Newton's method

Code up Newton's method

function newtons_method(f, f_prime, guess)
    diff = Inf     # Initialize problem
    tol = 1e-5
    x_old = guess
    x = 1e10
    while abs(diff) > tol
        x = f(x_old) - f(x_old)/f_prime(x_old) # Root of linear approximation
        diff = x - x_old
        x_old = x
    end
    println("The root of f(x) is at $x.")
end;

25 / 103

Newton's method

f(x) = x^3;
f_prime(x) = 3x^2;
newtons_method(f, f_prime, 1.)

## The root of f(x) is at 1.231347218094855e-6.

26 / 103

Newton's method

f(x) = x^3;
f_prime(x) = 3x^2;
newtons_method(f, f_prime, 1.)

## The root of f(x) is at 1.231347218094855e-6.

f(x) = sin(x);
f_prime(x) = cos(x);
newtons_method(f, f_prime, pi/4)

## The root of f(x) is at 5.941936124988917e-19.

26 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If $f(x)$ is continuously differentiable, the initial guess is "sufficiently close" to the root, and $f(x)$ is invertible near the root, then Newton's method converges to the root

27 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If $f(x)$ is continuously differentiable, the initial guess is "sufficiently close" to the root, and $f(x)$ is invertible near the root, then Newton's method converges to the root

What is "sufficiently close"?

27 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If $f(x)$ is continuously differentiable, the initial guess is "sufficiently close" to the root, and $f(x)$ is invertible near the root, then Newton's method converges to the root

What is "sufficiently close"?

We need $f(x)$ to be invertible so the algorithm above is well defined

27 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If $f(x)$ is continuously differentiable, the initial guess is "sufficiently close" to the root, and $f(x)$ is invertible near the root, then Newton's method converges to the root

What is "sufficiently close"?

We need $f(x)$ to be invertible so the algorithm above is well defined

If $f'(x)$ is ill-conditioned we can run into problems with rounding error

27 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

Coding error / time
Can actually be slower to evaluate than finite differences for a nonlinear problem, see Ken Judd's notes

28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

Coding error / time
Can actually be slower to evaluate than finite differences for a nonlinear problem, see Ken Judd's notes

28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

Coding error / time
Can actually be slower to evaluate than finite differences for a nonlinear problem, see Ken Judd's notes

Using our current root guess $x^{(k)}$ and our previous root guess $x^{(k-1)}$ :

$f'(x^{(k)}) \approx \frac{f(x^{(k)})-f(x^{(k-1)})}{x^{(k)} - x^{(k-1)}}$

28 / 103

Quasi-Newton: Secant method

Our new iteration rule then becomes

29 / 103

Quasi-Newton: Secant method

Our new iteration rule then becomes

$x^{(k+1)} = x^{(k)} - \frac{x^{(k)}-x^{(k-1)}}{f(x^{(k)})-f(x^{(k-1)})}f(x^{(k)})$

29 / 103

Quasi-Newton: Secant method

Our new iteration rule then becomes

$x^{(k+1)} = x^{(k)} - \frac{x^{(k)}-x^{(k-1)}}{f(x^{(k)})-f(x^{(k-1)})}f(x^{(k)})$

Now we require two initial guesses so that we have an initial approximation of the derivative

29 / 103

Quasi-Newton: Secant method

30 / 103

Quasi-Newton: Broyden's method

Broyden's method is the most widely used rootfinding method for n-dimensional problems

31 / 103

Quasi-Newton: Broyden's method

Broyden's method is the most widely used rootfinding method for n-dimensional problems

It is a generalization of the secant method where have a sequence of guesses of the Jacobian at the root

31 / 103

Quasi-Newton: Broyden's method

Broyden's method is the most widely used rootfinding method for n-dimensional problems

It is a generalization of the secant method where have a sequence of guesses of the Jacobian at the root

We must initially provide a guess of the root, $x^{(0)}$ , but also a guess of the Jacobian, $A_{(0)}$

31 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

$\mathbf{x^{(k+1)}} = \mathbf{x^{(k)}} - A_{(k)}^{-1} \, f(\mathbf{x^{(k)}}).$

32 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

$\mathbf{x^{(k+1)}} = \mathbf{x^{(k)}} - A_{(k)}^{-1} \, f(\mathbf{x^{(k)}}).$

we still need to update $A_{(k)}$ : we do this update is performed by making the smallest change, in terms of the Frobenius matrix norm, that satisfies what is called the secant condition (under determined if $n>1$ ): $f(\mathbf{x^{(k+1)}}) - f(\mathbf{x^{(k)}}) = A_{(k+1)}\left( \mathbf{x^{(k+1)}} - \mathbf{x^{(k)}} \right)$

32 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

$\mathbf{x^{(k+1)}} = \mathbf{x^{(k)}} - A_{(k)}^{-1} \, f(\mathbf{x^{(k)}}).$

The updated differences in root guesses, and the function value at those root guesses, should align with our estimate of the Jacobian at that point

32 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

$\mathbf{x^{(k+1)}} = \mathbf{x^{(k)}} - A_{(k)}^{-1} \, f(\mathbf{x^{(k)}}).$

The updated differences in root guesses, and the function value at those root guesses, should align with our estimate of the Jacobian at that point

$A_{(k+1)} = A_{(k)} + \left[f(\mathbf{x^{(k+1)}}) - f(\mathbf{x^{(k)}}) - A_{(k+1)}\left( \mathbf{x^{(k+1)}} - \mathbf{x^{(k)}} \right)\right]\frac{\mathbf{x^{(k+1)}} - \mathbf{x^{(k)}}}{(\mathbf{x^{(k+1)}} - \mathbf{x^{(k)}})^T(\mathbf{x^{(k+1)}} - \mathbf{x^{(k)}})}$

32 / 103

Accelerating Broyden

Why update the Jacobian and then invert when we can just update an inverted Jacobian $B = A^{-1}$

$B_{(k+1)} = B_{(k)} + \frac{[d^{(k)} - u^{(k)}]{d^{(k)}}^TB_{(k)}}{{d^{(k)}}^T u^{(k)}}$

where $d^{(k)} = (\mathbf{x^{(k+1)}} - \mathbf{x^{(k)}})$ , and $u^{(k)} = B_{(k)}\left[f(\mathbf{x^{(k+1)}})-f(\mathbf{x^{(k)}})\right]$ .

33 / 103

Accelerating Broyden

Why update the Jacobian and then invert when we can just update an inverted Jacobian $B = A^{-1}$

$B_{(k+1)} = B_{(k)} + \frac{[d^{(k)} - u^{(k)}]{d^{(k)}}^TB_{(k)}}{{d^{(k)}}^T u^{(k)}}$

where $d^{(k)} = (\mathbf{x^{(k+1)}} - \mathbf{x^{(k)}})$ , and $u^{(k)} = B_{(k)}\left[f(\mathbf{x^{(k+1)}})-f(\mathbf{x^{(k)}})\right]$ .

Broyden converges under relatively weak conditions:

33 / 103

Accelerating Broyden

Why update the Jacobian and then invert when we can just update an inverted Jacobian $B = A^{-1}$

$B_{(k+1)} = B_{(k)} + \frac{[d^{(k)} - u^{(k)}]{d^{(k)}}^TB_{(k)}}{{d^{(k)}}^T u^{(k)}}$

where $d^{(k)} = (\mathbf{x^{(k+1)}} - \mathbf{x^{(k)}})$ , and $u^{(k)} = B_{(k)}\left[f(\mathbf{x^{(k+1)}})-f(\mathbf{x^{(k)}})\right]$ .

Broyden converges under relatively weak conditions:

$f$ is continuously differentiable,
$x^{(0)}$ is close to the root of $f$
$f'$ is invertible around the root
$A_0$ is sufficiently close to the Jacobian

33 / 103

Convergence speed

Rootfinding algorithms will converge at different speeds in terms of the number of operations

34 / 103

Convergence speed

Rootfinding algorithms will converge at different speeds in terms of the number of operations

A sequence of iterates $x^{(k)}$ is said to converge to $x^*$ at a rate of order $p$ if there is a constant $C$ such that

$||x^{(k+1)} - x^*|| \leq C||x^{(k)} - x^*||^p$

for sufficiently large $k$

34 / 103

Convergence speed

Rootfinding algorithms will converge at different speeds in terms of the number of operations

A sequence of iterates $x^{(k)}$ is said to converge to $x^*$ at a rate of order $p$ if there is a constant $C$ such that

$||x^{(k+1)} - x^*|| \leq C||x^{(k)} - x^*||^p$

for sufficiently large $k$

If $C < 1$ and $p = 1$ , the rate of convergence is linear

If $1 < p < 2$ , convergence is superlinear, and if $p = 2$ convergence is quadratic. The higher order the convergence rate, the faster it converges

34 / 103

Convergence speed

How fast do the methods we've seen converge?

35 / 103

Convergence speed

How fast do the methods we've seen converge?

Bisection: linear rate with $C = 0.5$ (kind of obvious once you see it)

35 / 103

Convergence speed

How fast do the methods we've seen converge?

Bisection: linear rate with $C = 0.5$ (kind of obvious once you see it)
Function iteration: linear rate with $C = ||f'(x^*)||$

35 / 103

Convergence speed

How fast do the methods we've seen converge?

Bisection: linear rate with $C = 0.5$ (kind of obvious once you see it)
Function iteration: linear rate with $C = ||f'(x^*)||$
Secant and Broyden: superlinear rate with $p \approx 1.62$

35 / 103

Convergence speed

How fast do the methods we've seen converge?

Bisection: linear rate with $C = 0.5$ (kind of obvious once you see it)
Function iteration: linear rate with $C = ||f'(x^*)||$
Secant and Broyden: superlinear rate with $p \approx 1.62$
Newton: $p = 2$

35 / 103

Convergence speed

Convergence rates only account for the number of iterations of the method

The steps taken in a given iteration of each solution method may vary in computational cost because of differences in the number of arithmetic operations

Although an algorithm may take more iterations to solve,
each iteration may be solved faster and the overall algorithm takes less time

36 / 103

Convergence speed

Ex:

Bisection method only requires a single function evaluation during each iteration
Function iteration only requires a single function evaluation during each iteration
Broyden's method requires both a function evaluation and matrix multiplication
Newton's method requires a function evaluation, a derivative evaluation, and solving a linear system

37 / 103

Convergence speed

Ex:

Bisection method only requires a single function evaluation during each iteration
Function iteration only requires a single function evaluation during each iteration
Broyden's method requires both a function evaluation and matrix multiplication
Newton's method requires a function evaluation, a derivative evaluation, and solving a linear system

Bisection and function iteration are usually slow

Broyden's method can be faster than Newton's method if derivatives are costly to compute

37 / 103

Convergence speed

Consider an example where $f(x) = x - \sqrt(x) = 0$

What does convergence look like across our main approaches in terms of the $L^1-$ norm if all guesses start at $x^{(0)} = 0.5$ ?

38 / 103

Complementarity problems: rootfinding

Let:

$x$ be an n-dimensional vector of some economic action
$a_i$ denotes a lower bound on action $i$ , and $b_i$ denotes the upper bound on action $i$
$f_i(x)$ denotes the marginal arbitrage profit of action $i$

39 / 103

Complementarity problems: rootfinding

Let:

$x$ be an n-dimensional vector of some economic action
$a_i$ denotes a lower bound on action $i$ , and $b_i$ denotes the upper bound on action $i$
$f_i(x)$ denotes the marginal arbitrage profit of action $i$

There are disequilibrium profit opportunities if

$x_i < b_i$ and $f_i(x) > 0$ (here we can increase profits by raising $x_i$ )
$x_i > a_i$ and $f_i(x) < 0$ (we can increase profits by decreasing $x_i$ )

39 / 103

Complementarity problems

We obtain a no-arbitrage equilibrium if and only if $x$ solves the complementary problem $CP(f,a,b)$

40 / 103

Complementarity problems

We obtain a no-arbitrage equilibrium if and only if $x$ solves the complementary problem $CP(f,a,b)$

We can write out the problem as finding a vector $x \in [a,b]$ that solves $\begin{align} x_i > a_i \Rightarrow f_i(x) \geq 0 \,\,\, \forall i = 1,...,n \notag\\ x_i < b_i \Rightarrow f_i(x) \leq 0 \,\,\, \forall i = 1,...,n \notag \end{align}$

40 / 103

Complementarity problems

We obtain a no-arbitrage equilibrium if and only if $x$ solves the complementary problem $CP(f,a,b)$

At interior solution, the function be precisely be zero

Corner solution at the upper bound $b_i$ for $x_i$ $\rightarrow$ $f$ must be increasing in direction $i$

The opposite is true if we are at the lower bound

40 / 103

Complementarity problems

Economic problems are complementarity problems where we are
finding a root of a function (e.g. marginal profit) subject to some constraint (e.g. price floors)

The Karush-Kuhn-Tucker theorem shows that $x$ solves the constrained optimization problem if and only if it solves the complementarity problem

41 / 103

Complementarity problems

Example: single commodity competitive spatial price equilibrium model

$n$ regions of the world
excess demand for the commodity in region $i$ is $E_i(p_i)$

42 / 103

Complementarity problems

Example: single commodity competitive spatial price equilibrium model

$n$ regions of the world
excess demand for the commodity in region $i$ is $E_i(p_i)$

If no trade $\rightarrow$ equilibrium condition is $E_i(p_i) = 0$ in all regions of the world: a simple rootfinding problem

42 / 103

Complementarity problems

Example: single commodity competitive spatial price equilibrium model

$n$ regions of the world
excess demand for the commodity in region $i$ is $E_i(p_i)$

If no trade $\rightarrow$ equilibrium condition is $E_i(p_i) = 0$ in all regions of the world: a simple rootfinding problem

Trade between regions has marginal transportation cost between regions $i$ and $j$ of $c_{ij}$

$x_{ij}$ : the amount of the good shipped from region $i$ to region $j$
capacity constraint: $b_{ij}$

42 / 103

Complementarity problems

Marginal arbitrage profit from shipping a unit of the good from $i$ to $j$ is $p_j - p_i - c_{ij}$

43 / 103

Complementarity problems

Marginal arbitrage profit from shipping a unit of the good from $i$ to $j$ is $p_j - p_i - c_{ij}$

If it's positive: incentive to ship more goods to region $i$ from region $j$

If it's negative: incentive to decrease shipments

43 / 103

Complementarity problems

Marginal arbitrage profit from shipping a unit of the good from $i$ to $j$ is $p_j - p_i - c_{ij}$

If it's positive: incentive to ship more goods to region $i$ from region $j$

If it's negative: incentive to decrease shipments

At an equilibrium only if all the arbitrage opportunities are gone: for all region pairs $i$ and $j$ $\begin{gather} 0 \leq x_{ij} \leq b_{ij} \notag \\ x_{ij} > 0 \Rightarrow p_j - p_i - c_{ij} \geq 0 \notag \\ x_{ij} < b_{ij} \Rightarrow p_j - p_i - c_{ij} \leq 0 \notag \end{gather}$

43 / 103

Complementarity problems

44 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

45 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

Market clearing in each region $i$ requires that net imports = excess demand

45 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

Market clearing in each region $i$ requires that net imports = excess demand

$\sum_k [x_{ki} - x_{ik}] = E_i(p_i)$

45 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

Market clearing in each region $i$ requires that net imports = excess demand

$\sum_k [x_{ki} - x_{ik}] = E_i(p_i)$

This implies that we can solve for the price in region $i$ ,

$p_i = E_i^{-1}\left( \sum_k [x_{ki} - x_{ik}]\right)$

45 / 103

Complementarity problems

Finally, if we define marginal arbitrage profit from shipping another unit from $i$ to $j$ as

$f_{ij}(x) = E_j^{-1}\left( \sum_k[x_{kj} - x_{jk}] \right) - E_i^{-1}\left( \sum_k[x_{ki} - x_{ik}] \right) - c_{ij}$

then $x$ is an equilibrium vector of trade flows if and only if $x$ solves $CP(f,0,b)$ and $x$ , $f$ , and $b$ are $n^2 \times 1$ vectors

Even this complex trade-equilibrium model can be reduced to a simple complementarity problem

46 / 103

Complementarity problems

Decentralized models can generally be solved by reducing them to complementarity problems, e.g:

Competitive energy markets with cross-market cap and trade or capacity constraints
Federalism models
etc etc

47 / 103

Complementarity problems

Four plots of marginal arbitrage profit $f_i(x)$ , what is the equilibrium choice of $x$ ?

48 / 103

How do we solve them?

A complementarity problem $CP(f, a, b)$ can be re-framed as a rootfinding problem easily $\hat{f}(x) = min(max(f(x),a-x),b-x) = 0$

49 / 103

How do we solve them?

We can then solve the problem with conventional rootfinding techniques
or approximate this with a smooth function to make it easier on a rootfinder

50 / 103

How do we solve them?

We can then solve the problem with conventional rootfinding techniques
or approximate this with a smooth function to make it easier on a rootfinder

The workhorse solver for complementarity problems is the PATH solver: Julia wrapper can be found here

50 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

We make two distinctions:

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

We make two distinctions:

Local vs global: are we finding an optimum in a local region, or globally?

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

We make two distinctions:

Local vs global: are we finding an optimum in a local region, or globally?

Derivative-using vs derivative-free: Do we want to use higher-order information?

51 / 103

Maximization (minimization) methods

I'll focus on local solvers, common global solvers I won't cover:

Genetic algorithms
Simulated annealing
DIRECT
Sto-go

52 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, $f(x)$ ,
and we want to find a local minimum in some interval $[a,b]$

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, $f(x)$ ,
and we want to find a local minimum in some interval $[a,b]$

Select points $x_1,x_2 \in [a,b]$ where $x_2 > x_1$

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, $f(x)$ ,
and we want to find a local minimum in some interval $[a,b]$

Select points $x_1,x_2 \in [a,b]$ where $x_2 > x_1$
If $f(x_1) < f(x_2)$ replace $[a,b]$ with $[a,x_2]$ , else replace $[a,b]$ with $[x_1,b]$

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, $f(x)$ ,
and we want to find a local minimum in some interval $[a,b]$

Select points $x_1,x_2 \in [a,b]$ where $x_2 > x_1$
If $f(x_1) < f(x_2)$ replace $[a,b]$ with $[a,x_2]$ , else replace $[a,b]$ with $[x_1,b]$
Repeat until convergence criterion is met

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, $f(x)$ ,
and we want to find a local minimum in some interval $[a,b]$

Select points $x_1,x_2 \in [a,b]$ where $x_2 > x_1$
If $f(x_1) < f(x_2)$ replace $[a,b]$ with $[a,x_2]$ , else replace $[a,b]$ with $[x_1,b]$
Repeat until convergence criterion is met

Replace the endpoint of the interval next to the evaluated point with the highest value $\rightarrow$ keep the lower evaluated point in the interval $\rightarrow$ guarantees that a local minimum still exists

53 / 103

Derivative free optimization: Golden search

How do we pick $x_1$ and $x_2$ ?

54 / 103

Derivative free optimization: Golden search

How do we pick $x_1$ and $x_2$ ?

Achievable goal for selection process:

New interval is independent of whether the upper or lower bound is replaced
Only requires one function evaluation per iteration

54 / 103

Derivative free optimization: Golden search

How do we pick $x_1$ and $x_2$ ?

Achievable goal for selection process:

New interval is independent of whether the upper or lower bound is replaced
Only requires one function evaluation per iteration

There's one algorithm that satisfies this

54 / 103

Derivative free optimization: Golden search

Golden search algorithm for point selection:

$\begin{gather} x_i = a + \alpha_i (b-a) \notag \\ \alpha_1 = \frac{3-\sqrt{5}}{2} \qquad \alpha_2 = \frac{\sqrt{5} - 1}{2} \end{gather}$

55 / 103

Derivative free optimization: Golden search

Golden search algorithm for point selection:

$\begin{gather} x_i = a + \alpha_i (b-a) \notag \\ \alpha_1 = \frac{3-\sqrt{5}}{2} \qquad \alpha_2 = \frac{\sqrt{5} - 1}{2} \end{gather}$

The value of $\alpha_2$ is called the golden ratio and is where the algorithm gets its name

55 / 103

Derivative free optimization: Golden search

Golden search algorithm for point selection:

$\begin{gather} x_i = a + \alpha_i (b-a) \notag \\ \alpha_1 = \frac{3-\sqrt{5}}{2} \qquad \alpha_2 = \frac{\sqrt{5} - 1}{2} \end{gather}$

The value of $\alpha_2$ is called the golden ratio and is where the algorithm gets its name

Write out a golden search algorithm

55 / 103

Golden search

function golden_search(f, lower_bound, upper_bound)
    alpha_1 = (3 - sqrt(5))/2  # GS parameter 1
    alpha_2 = (sqrt(5) - 1)/2  # GS parameter 2
    tolerance = 1e-2           # tolerance for convergence
    difference = 1e10
    while difference > tolerance
        x_1 = lower_bound + alpha_1*(upper_bound - lower_bound)  # new x_1
        x_2 = lower_bound + alpha_2*(upper_bound - lower_bound)  # new x_2
        if f(x_1) < f(x_2)     # reset bounds
            upper_bound = x_2
        else
            lower_bound = x_1
        end
        difference = x_2 - x_1
    end
    println("Minimum is at x = $((lower_bound+upper_bound)/2).")
end;

56 / 103

Golden search

function golden_search(f, lower_bound, upper_bound)
    alpha_1 = (3 - sqrt(5))/2  # GS parameter 1
    alpha_2 = (sqrt(5) - 1)/2  # GS parameter 2
    tolerance = 1e-2           # tolerance for convergence
    difference = 1e10
    while difference > tolerance
        x_1 = lower_bound + alpha_1*(upper_bound - lower_bound)  # new x_1
        x_2 = lower_bound + alpha_2*(upper_bound - lower_bound)  # new x_2
        if f(x_1) < f(x_2)     # reset bounds
            upper_bound = x_2
        else
            lower_bound = x_1
        end
        difference = x_2 - x_1
    end
    println("Minimum is at x = $((lower_bound+upper_bound)/2).")
end;

f(x) = 2x^2 + 9x;
golden_search(f, -10, 10)

## Minimum is at x = -2.2483173872886444.

56 / 103

Nelder-Mead: Simplex

Golden search is nice and simple but only works in one dimension

There are several derivative free methods for minimization that work in multiple dimensions, the most commonly used one is Nelder-Mead (NM)

57 / 103

Nelder-Mead: Simplex

Golden search is nice and simple but only works in one dimension

There are several derivative free methods for minimization that work in multiple dimensions, the most commonly used one is Nelder-Mead (NM)

NM works by first constructing a simplex: we evaluate the function at $n+1$ points in an $n$ dimensional problem

It then manipulates the highest value point, similar to golden search

57 / 103

Nelder-Mead: Simplex

There are six operations:

58 / 103

Nelder-Mead: Simplex

There are six operations:

Order: order the value at the vertices of the simplex $f(x_1)\leq...\leq f(x_{n+1})$

58 / 103

Nelder-Mead: Simplex

There are six operations:

Order: order the value at the vertices of the simplex $f(x_1)\leq...\leq f(x_{n+1})$
Centroid: calculate $x_0$ , the centroid of the non - $x_{n+1}$ points

58 / 103

Nelder-Mead: Simplex

There are six operations:

Order: order the value at the vertices of the simplex $f(x_1)\leq...\leq f(x_{n+1})$
Centroid: calculate $x_0$ , the centroid of the non - $x_{n+1}$ points
Reflection: reflect xn+1 through the opposite face of the simplex and evaluate the new point: xr=x0+α(x0−xn+1), α>0
- If this improves upon the second-highest (e.g. its lower) but is not the lowest value point, replace $x_{n+1}$ with $x_r$ and restart
- If this is the lowest value point so far, go to step 4
- If $f(x_r) > f(x_n)$ go to step 5

58 / 103

Nelder-Mead: SimplexExpansion: push the reflected point further in the same direction
59 / 103

Nelder-Mead: SimplexExpansion: push the reflected point further in the same direction
Contract: Contract the highest value point toward the middleCompute xc=x0+γ(x0−xn+1), 0<γ≤0.5
If xc is better than the worst point replace xn+1 with xc and restart
Else go to step 6

59 / 103

Nelder-Mead: SimplexExpansion: push the reflected point further in the same direction
Contract: Contract the highest value point toward the middleCompute xc=x0+γ(x0−xn+1), 0<γ≤0.5
If xc is better than the worst point replace xn+1 with xc and restart
Else go to step 6

Shrink: shrink the simplex toward the best pointReplace all points but the best one with xi=x1+σ(xi−x1)

59 / 103

Nelder-Mead: Simplex

Expansion: push the reflected point further in the same direction
Contract: Contract the highest value point toward the middle
- Compute $x_c = x_0 + \gamma(x_0 - x_{n+1})$ , $0 < \gamma \leq 0.5$
- If $x_c$ is better than the worst point replace $x_{n+1}$ with $x_c$ and restart
- Else go to step 6
Shrink: shrink the simplex toward the best point
- Replace all points but the best one with $x_i = x_1 + \sigma(x_i - x_1)$

Nelder-Mead is a pain to code efficiently (i.e. don't spend the time doing it yourself) but is in the Optim.jl package

59 / 103

Nelder-Mead: Simplex

Nelder-Mead is commonly used but slow and unreliable, no real useful convergence properties, avoid using it

60 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function $f$

61 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function $f$

$x^*$ is a global minimizer if $f(x^*) \leq f(x)$ for all $x$ over the domain of the function

61 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function $f$

$x^*$ is a global minimizer if $f(x^*) \leq f(x)$ for all $x$ over the domain of the function

most algorithms are local minimizers that find a point $x^*$
such that $f(x^*) \leq f(x)$ for all $x \in N$ , where $N$ is a neighborhood of $x^*$

61 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function $f$

$x^*$ is a global minimizer if $f(x^*) \leq f(x)$ for all $x$ over the domain of the function

most algorithms are local minimizers that find a point $x^*$
such that $f(x^*) \leq f(x)$ for all $x \in N$ , where $N$ is a neighborhood of $x^*$

Typically analytical problems are set up to have a unique minimum so any local solver can generally find the global optimum

61 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

62 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

Concave transitions
Games with multiple equilibria
Etc

62 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

Concave transitions
Games with multiple equilibria
Etc

How do we find a local minimum?

62 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

Concave transitions
Games with multiple equilibria
Etc

How do we find a local minimum?

Do we need to evaluate every single point?

62 / 103

The general unconstrained approach

Optimization algorithms typically have the following set up:

Start at some $x_0$
Work through a series of iterates $\{x_k\}_{k=1}^\infty$ until we have "converged" with sufficient accuracy

63 / 103

The general unconstrained approach

Optimization algorithms typically have the following set up:

Start at some $x_0$
Work through a series of iterates $\{x_k\}_{k=1}^\infty$ until we have "converged" with sufficient accuracy

If the function is smooth, we can take advantage of that information
about the function's shape to figure out which direction to move in next

63 / 103

The general unconstrained approach

Optimization algorithms typically have the following set up:

Start at some $x_0$
Work through a series of iterates $\{x_k\}_{k=1}^\infty$ until we have "converged" with sufficient accuracy

If the function is smooth, we can take advantage of that information
about the function's shape to figure out which direction to move in next

If $f$ is twice continuously differentiable, we can use the gradient $\nabla f$ and Hessian $\nabla^2 f$ to figure out if $x^*$ is a local minimizer

63 / 103

The general unconstrained approach

Taylor's Theorem tells us that if $f$ is twice differentiable, then there exists a $t \in (0,1)$ such that

$f(x^* + p) = f(x^*) + \nabla\,f(x^*)^T \, p + \frac{1}{2!}\,p^T\,\nabla^2\,f(x^*+tp)\,p$

This is an exact equality

64 / 103

The general unconstrained approach

Taylor's Theorem tells us that if $f$ is twice differentiable, then there exists a $t \in (0,1)$ such that

$f(x^* + p) = f(x^*) + \nabla\,f(x^*)^T \, p + \frac{1}{2!}\,p^T\,\nabla^2\,f(x^*+tp)\,p$

This is an exact equality

From here we can prove the usual necessary and sufficient conditions for a local optimum

64 / 103

Two large classes of algorithms

All modern algorithms have that general set up but may go about it in different ways

65 / 103

Two large classes of algorithms

All modern algorithms have that general set up but may go about it in different ways

Most modern optimization problems fall into one of two classes:

Line search
Trust region

65 / 103

Two large classes of algorithms

All modern algorithms have that general set up but may go about it in different ways

Most modern optimization problems fall into one of two classes:

Line search
Trust region

The relationship between these two approaches has a lot of similiarities
to the relationship between the constrained problem and the dual Lagrange problem

65 / 103

Line search algorithms

General idea:

Start at some current iterate $x_k$
Select a direction to move in $p_k$
Figure out how far along $p_k$ to move

66 / 103

Line search algorithms

General idea:

Start at some current iterate $x_k$
Select a direction to move in $p_k$
Figure out how far along $p_k$ to move

How do we figure out how far to move?

66 / 103

Line search algorithms

General idea:

Start at some current iterate $x_k$
Select a direction to move in $p_k$
Figure out how far along $p_k$ to move

How do we figure out how far to move?

"Approximately" solve this problem to figure out the step length $\alpha$ $\min_{\alpha > 0} f(x_k + \alpha p_k)$

66 / 103

Line search algorithms

General idea:

Start at some current iterate $x_k$
Select a direction to move in $p_k$
Figure out how far along $p_k$ to move

How do we figure out how far to move?

"Approximately" solve this problem to figure out the step length $\alpha$ $\min_{\alpha > 0} f(x_k + \alpha p_k)$

We are finding the distance to move, $\alpha$ in direction $p_k$ that minimizes our objective $f$

66 / 103

Line search algorithms

Typically do not perform the full minimization problem since it is costly

We only try a limited number of step lengths $\alpha$ before picking the best one and moving onto our next iterate $x_{k+1}$

67 / 103

Line search algorithms

Typically do not perform the full minimization problem since it is costly

We only try a limited number of step lengths $\alpha$ before picking the best one and moving onto our next iterate $x_{k+1}$

We still haven't answered, what direction $p_k$ do we decide to move in?

67 / 103

Line search: direction choice

What's an obvious choice for $p_k$ ?

68 / 103

Line search: direction choice

What's an obvious choice for $p_k$ ?

The direction that yields the steepest descent

68 / 103

Line search: direction choice

What's an obvious choice for $p_k$ ?

The direction that yields the steepest descent

$-\nabla \, f_k$ is the direction that makes $f$ decrease most rapidly, $k$ indicates we are evaluating $f$ at iteration $k$

68 / 103

Line search: direction choice

What's an obvious choice for $p_k$ ?

The direction that yields the steepest descent

$-\nabla \, f_k$ is the direction that makes $f$ decrease most rapidly, $k$ indicates we are evaluating $f$ at iteration $k$

68 / 103

Line search: steepest descent

We can verify this is the direction of steepest descent by referring to Taylor's theorem

69 / 103

Line search: steepest descent

We can verify this is the direction of steepest descent by referring to Taylor's theorem

For any direction $p$ and step length $\alpha$ , we have that $f(x_k + \alpha p) = f(x_k) + \alpha\,p^T\,\nabla\,f_k + \frac{1}{2!}\,\alpha^2p^T\,\nabla^2\,f(x_k+tp)\,p$

69 / 103

Line search: steepest descent

We can verify this is the direction of steepest descent by referring to Taylor's theorem

For any direction $p$ and step length $\alpha$ , we have that $f(x_k + \alpha p) = f(x_k) + \alpha\,p^T\,\nabla\,f_k + \frac{1}{2!}\,\alpha^2p^T\,\nabla^2\,f(x_k+tp)\,p$

The rate of change in $f$ along $p$ at $x_k$ $(\alpha = 0)$ is $p^T \, \nabla\,f_k$

69 / 103

Line search: steepest descent

The the unit vector of quickest descent solves $\min_p p^T\,\nabla\,f_k \,\,\,\,\, \text{subject to: }||p|| = 1$

70 / 103

Line search: steepest descent

The the unit vector of quickest descent solves $\min_p p^T\,\nabla\,f_k \,\,\,\,\, \text{subject to: }||p|| = 1$

Re-express the objective as $\min_{\theta,p} ||p||\,||\nabla\,f_k||cos\,\theta$ , where $\theta$ is the angle between $p$ and $\nabla\,f_k$

70 / 103

Line search: steepest descent

The the unit vector of quickest descent solves $\min_p p^T\,\nabla\,f_k \,\,\,\,\, \text{subject to: }||p|| = 1$

Re-express the objective as $\min_{\theta,p} ||p||\,||\nabla\,f_k||cos\,\theta$ , where $\theta$ is the angle between $p$ and $\nabla\,f_k$

The minimum is attained when $cos\,\theta = -1$ and $p = -\frac{\nabla\,f_k}{||\nabla\,f_k||},$ so the direction of steepest descent is simply $-\nabla\,f_k$

70 / 103

Line search: steepest descent

The steepest descent method searches along this direction at every iteration $k$

It may select the step length $\alpha_k$ in several different ways

A benefit of the algorithm is that we only require the gradient of the function, and no Hessian

However it can be very slow

71 / 103

Line search: alternative directions

We can always use search directions other than the steepest descent

72 / 103

Line search: alternative directions

We can always use search directions other than the steepest descent

Any descent direction, i.e. one that is within $45^\circ$ of $-\nabla\,f_k$ ,
is guaranteed to produce a decrease in $f$ as long as the step size is sufficiently small

72 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

$f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)$

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

$f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)$

If $p_k$ is in a descending direction, $\theta_k$ will be of an angle such that $cos\,\theta_k < 0$

This gives us

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

$f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)$

If $p_k$ is in a descending direction, $\theta_k$ will be of an angle such that $cos\,\theta_k < 0$

This gives us

$p_k^T\,\nabla f_k = ||p_k||\,||\nabla\,f_k||cos\,\theta_k < 0$

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

$f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)$

If $p_k$ is in a descending direction, $\theta_k$ will be of an angle such that $cos\,\theta_k < 0$

This gives us

$p_k^T\,\nabla f_k = ||p_k||\,||\nabla\,f_k||cos\,\theta_k < 0$

Therefore $f(x_k + \epsilon p_k) < f(x_k)$ for positive but sufficiently small $\epsilon$

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

$f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)$

If $p_k$ is in a descending direction, $\theta_k$ will be of an angle such that $cos\,\theta_k < 0$

This gives us

$p_k^T\,\nabla f_k = ||p_k||\,||\nabla\,f_k||cos\,\theta_k < 0$

Therefore $f(x_k + \epsilon p_k) < f(x_k)$ for positive but sufficiently small $\epsilon$

Is $-\nabla\,f_k$ always the best search direction?

73 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

74 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

Newton's direction comes out of the second order Taylor series approximation to $f(x_k + p)$ $f(x_k + p) \approx f_k + p^T\,\nabla\,f_k + \frac{1}{2!}\,p^T\,\nabla^2f_k\,p$ Define this as $\mathbf{m_k(p)}$

74 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

We find the Newton direction by selecting the vector $p$ that minimizes $f(x_k + p)$

74 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

We find the Newton direction by selecting the vector $p$ that minimizes $f(x_k + p)$

This ends up being $p^N_k = -\frac{\nabla f_k}{\nabla^2 f_k}$

74 / 103

Newton's method75 / 103

Newton's method

This approximation to the function we are trying to solve has error of $O(||p||^3)$ ,
so if $p$ is small, the quadratic approximation is very accurate

75 / 103

Newton's method

This approximation to the function we are trying to solve has error of $O(||p||^3)$ ,
so if $p$ is small, the quadratic approximation is very accurate

Drawback: requires explicit computation of the Hessian, $\nabla^2 \, f(x)$

Quasi-Newton solvers also exist (e.g. BFGS, L-BFGS, etc)

75 / 103

Trust region methods

Trust region methods construct an approximating model, $m_k$
whose behavior near the current iterate $x_k$ is close to that of the actual function $f$

76 / 103

Trust region methods

Trust region methods construct an approximating model, $m_k$
whose behavior near the current iterate $x_k$ is close to that of the actual function $f$

We then search for a minimizer of $m_k$

76 / 103

Trust region methods

Trust region methods construct an approximating model, $m_k$
whose behavior near the current iterate $x_k$ is close to that of the actual function $f$

We then search for a minimizer of $m_k$

Issue: $m_k$ may not represent $f$ well when far away from the current iterate $x_k$

76 / 103

Trust region methods

Trust region methods construct an approximating model, $m_k$
whose behavior near the current iterate $x_k$ is close to that of the actual function $f$

We then search for a minimizer of $m_k$

Issue: $m_k$ may not represent $f$ well when far away from the current iterate $x_k$

Solution: Restrict the search for a minimizer to be within some region of $x_k$ , called a trust region

76 / 103

Trust region methods

Trust region problems can be formulated as $\min_p m_k(x_k + p)$ where

$x_k+p \in \Gamma$
$\Gamma$ is a ball defined by $||p||_2 \leq \Delta$
$\Delta$ is called the trust region radius

77 / 103

Trust region Methods

Typically the approximating model $m_k$ is a quadratic function (i.e. a second-order Taylor approximation) $m_k(x_k + p) = f_k + p^T\,\nabla\,f_k + \frac{1}{2!}\,p^T\,B_k\,p$ where $B_k$ is the Hessian or an approximation to the Hessian

78 / 103

Line search vs trust region

Whats the fundamental difference between line search and trust region?

79 / 103

Line search vs trust region

Whats the fundamental difference between line search and trust region?

Line search first picks a direction then searches along that direction for the optimal step length

Trust region first defines our step length via the trust region radius, then searches for the optimal direction

79 / 103

Line search vs trust region

There is a special case of the trust region where if we set $B_k$ , the approximate Hessian, to zero, the solution to the problem is $p_k = -\frac{\Delta_k \,\nabla \,f_k}{||\nabla\,f_k||}$

This is just the steepest descent solution for the line search problem

80 / 103

Problem scaling

The scaling of a problem matters for optimization performance

81 / 103

Problem scaling

The scaling of a problem matters for optimization performance

A problem is if changes to $x$ in a certain direction
produce much bigger changes in $f$ than changes to in $x$ in another direction

81 / 103

Problem scaling

The scaling of a problem matters for optimization performance

A problem is if changes to $x$ in a certain direction
produce much bigger changes in $f$ than changes to in $x$ in another direction

81 / 103

Problem scaling

Ex: $f(x) = 10^9 x_1^2 + x_2^2$ is poorly scaled

82 / 103

Problem scaling

Ex: $f(x) = 10^9 x_1^2 + x_2^2$ is poorly scaled

This happens when things change at different rates:

Investment rates are between 0 and 1
Consumption can be in trillions of dollars

82 / 103

Problem scaling

Ex: $f(x) = 10^9 x_1^2 + x_2^2$ is poorly scaled

This happens when things change at different rates:

Investment rates are between 0 and 1
Consumption can be in trillions of dollars

How do we solve this issue?

82 / 103

Problem scaling

Ex: $f(x) = 10^9 x_1^2 + x_2^2$ is poorly scaled

This happens when things change at different rates:

Investment rates are between 0 and 1
Consumption can be in trillions of dollars

How do we solve this issue?

Rescale the problem: put them in units that are
generally within an order of magnitude of 1

Investment rate in percentage terms: $0\%-100\%$
Consumption in units of trillion dollars instead of dollars

82 / 103

Constrained optimization

How do we solve constrained optimization problems?

83 / 103

Constrained optimization

How do we solve constrained optimization problems?

Typically as a variant of unconstrained optimization techniques

83 / 103

Constrained optimization

How do we solve constrained optimization problems?

Typically as a variant of unconstrained optimization techniques

We will discuss three types of constrained optimization algorithms

Penalty methods
Active set methods
Interior point methods

83 / 103

Constrained optimization

These are the algorithms in workhorse commercial solvers: KNITRO

84 / 103

Constrained optimization

These are the algorithms in workhorse commercial solvers: KNITRO

85 / 103

Constrained optimization

These are the algorithms in workhorse commercial solvers: fmincon/MATLAB

86 / 103

Constrained optimization: Penalty methods

Suppose we wish to minimize some function subject to equality constraints (easily generalizes to inequality) $\min_x f(x) \,\,\, \text{subject to:} \,\, c_i(x) = 0$

87 / 103

Constrained optimization: Penalty methods

Suppose we wish to minimize some function subject to equality constraints (easily generalizes to inequality) $\min_x f(x) \,\,\, \text{subject to:} \,\, c_i(x) = 0$

How does an algorithm know to not violate the constraint?

87 / 103

Constrained optimization: Penalty methods

Suppose we wish to minimize some function subject to equality constraints (easily generalizes to inequality) $\min_x f(x) \,\,\, \text{subject to:} \,\, c_i(x) = 0$

How does an algorithm know to not violate the constraint?

One way is to introduce a penalty function into our objective and remove the constraint: $Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)$

where $\mu$ is the penalty parameter

87 / 103

Constrained optimization: Penalty methods

$Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)$

88 / 103

Constrained optimization: Penalty methods

$Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)$

The second term increases the value of the function, bigger $\mu \rightarrow$ bigger penalty from violating the constraint

88 / 103

Constrained optimization: Penalty methods

$Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)$

The second term increases the value of the function, bigger $\mu \rightarrow$ bigger penalty from violating the constraint

The penalty terms are smooth $\rightarrow$ use unconstrained optimization techniques
to solve the problem by searching for iterates of $x_k$

88 / 103

Constrained optimization: Penalty methods

$Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)$

The second term increases the value of the function, bigger $\mu \rightarrow$ bigger penalty from violating the constraint

The penalty terms are smooth $\rightarrow$ use unconstrained optimization techniques
to solve the problem by searching for iterates of $x_k$

Also generally iterate on sequences of $\mu_k \rightarrow \infty$ as $k \rightarrow \infty$ , to require satisfying the constraints as we close in

88 / 103

Constrained optimization: Penalty methods

$Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)$

The second term increases the value of the function, bigger $\mu \rightarrow$ bigger penalty from violating the constraint

The penalty terms are smooth $\rightarrow$ use unconstrained optimization techniques
to solve the problem by searching for iterates of $x_k$

Also generally iterate on sequences of $\mu_k \rightarrow \infty$ as $k \rightarrow \infty$ , to require satisfying the constraints as we close in

There are also augmented Lagrangian methods that take the quadratic penalty method and add in explicit estimates of Lagrange multipliers to help force binding constraints to bind precisely

88 / 103

Constrained optimization: Penalty method example

Example: $\min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0$

89 / 103

Constrained optimization: Penalty method example

Example: $\min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0$

Solution is pretty easy to show to be $(-1, -1)$

89 / 103

Constrained optimization: Penalty method example

Example: $\min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0$

Solution is pretty easy to show to be $(-1, -1)$

The penalty method function $Q(x_1, x_2; \mu)$ is $Q(x_1, x_2; \mu) = x_1 + x_2 + \frac{\mu}{2} (x_1^2 + x_2^2 - 2)^2$

89 / 103

Constrained optimization: Penalty method example

Example: $\min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0$

Solution is pretty easy to show to be $(-1, -1)$

The penalty method function $Q(x_1, x_2; \mu)$ is $Q(x_1, x_2; \mu) = x_1 + x_2 + \frac{\mu}{2} (x_1^2 + x_2^2 - 2)^2$

Let's ramp up $\mu$ and see what happens to how the function looks

89 / 103

Constrained optimization: Penalty method example

$\mu = 1$ , solution is around $(-1.1, -1.1)$

90 / 103

Constrained optimization: Penalty method example

$\mu = 10$ , solution is very close to $(-1, -1)$ , can easily see trough, and rapid value increase outside $x_1^2 + x_2^2 = 2$

91 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

92 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

Main idea:

Replace the large non-linear constrained problem with a constrained quadratic programming problem
Use Newton's method to solve the sequence of simpler quadratic problems

92 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

Main idea:

Replace the large non-linear constrained problem with a constrained quadratic programming problem
Use Newton's method to solve the sequence of simpler quadratic problems

Tthe Lagrangian is $L(x,\lambda) = f(x) - \lambda^T\,c(x)$

92 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

Main idea:

Replace the large non-linear constrained problem with a constrained quadratic programming problem
Use Newton's method to solve the sequence of simpler quadratic problems

Tthe Lagrangian is $L(x,\lambda) = f(x) - \lambda^T\,c(x)$

Denote $A(x)^T$ as the Jacobian of the constraints $A(x)^T = [\nabla\,c_1(x),...,\nabla\,c_m(x)]$

92 / 103

Constrained optimization: Active set methods

The first-order conditions $F(x,\lambda)$ can be written as, $\begin{gather} \nabla\,f(x) - A(x)^T\,\lambda = 0 \notag\\ c(x) = 0 \notag \end{gather}$

Any solution to the equality constrained problem, where $A(x^*)$ has full rank also satisfies the first-order necessary conditions

93 / 103

Constrained optimization: Active set methods

The first-order conditions $F(x,\lambda)$ can be written as, $\begin{gather} \nabla\,f(x) - A(x)^T\,\lambda = 0 \notag\\ c(x) = 0 \notag \end{gather}$

Any solution to the equality constrained problem, where $A(x^*)$ has full rank also satisfies the first-order necessary conditions

Active set methods use Newton's method to find the solution $(x^*, \lambda^*)$ of $F(x,\lambda)$

93 / 103

Constrained optimization: Active set methods

Issue: if we have many constraints, keeping track of all of them can be expensive

94 / 103

Constrained optimization: Active set methods

Issue: if we have many constraints, keeping track of all of them can be expensive

Main idea: recognize that if an inequality constraint is not binding, or active, then it has no influence on the solution

$\rightarrow$ in the iteration procedure we can effectively ignore it

94 / 103

Constrained optimization: Active set methods

Issue: if we have many constraints, keeping track of all of them can be expensive

Main idea: recognize that if an inequality constraint is not binding, or active, then it has no influence on the solution

$\rightarrow$ in the iteration procedure we can effectively ignore it

Active set methods find ways to reduce the complexity of the optimization routine
by selectively ignoring constraints that are not active (i.e. non-positive Lagrange multipliers) or close to being active

94 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

The name interior point comes from the algorithm traversing the domain along the interior of the inequality constraints

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

The name interior point comes from the algorithm traversing the domain along the interior of the inequality constraints

Issue: how do we ensure we are on the interior of the feasible set?

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

The name interior point comes from the algorithm traversing the domain along the interior of the inequality constraints

Issue: how do we ensure we are on the interior of the feasible set?

Main idea: impose a barrier to stop the solver from letting a constraint bind

95 / 103

Constrained optimization: Interior point methods

Consider the following constrained optimization problem $\begin{gather} \min_{x} f(x) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) \geq 0 \end{gather}$

96 / 103

Constrained optimization: Interior point methods

Consider the following constrained optimization problem $\begin{gather} \min_{x} f(x) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) \geq 0 \end{gather}$

Reformulate this problem as $\begin{gather} \min_{x,s} f(x) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0, s \geq 0 \end{gather}$

where $s$ is a vector of slack variables for the constraints

96 / 103

Constrained optimization: Interior point methods

Final step: introduce a barrier function to eliminate the inequality constraint, $\begin{gather} \min_{x,s} f(x) - \mu \sum_{i=1}^m log(s_i) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0 \end{gather}$

where $\mu$ is a positive barrier parameter

97 / 103

Constrained optimization: Interior point methods

where $\mu$ is a positive barrier parameter

The barrier function prevents the components of $s$ from approaching
zero by imposing a logarithmic barrier $\rightarrow$ it maintains slack in the constraints

97 / 103

Constrained optimization: Interior point methods

where $\mu$ is a positive barrier parameter

The barrier function prevents the components of $s$ from approaching
zero by imposing a logarithmic barrier $\rightarrow$ it maintains slack in the constraints

Interior point methods solve a sequence of barrier problems until $\{\mu_k\}$ converges to zero

97 / 103

Constrained optimization: Interior point methods

where $\mu$ is a positive barrier parameter

The barrier function prevents the components of $s$ from approaching
zero by imposing a logarithmic barrier $\rightarrow$ it maintains slack in the constraints

Interior point methods solve a sequence of barrier problems until $\{\mu_k\}$ converges to zero

The solution to the barrier problem converges to that of the original problem

97 / 103

Best practices for optimization

Plug in your guess, let the solver go, and you're done right?

98 / 103

Best practices for optimization

Plug in your guess, let the solver go, and you're done right?

WRONG

98 / 103

Best practices for optimization

Plug in your guess, let the solver go, and you're done right?

WRONG

These algorithms are not guaranteed to always find even a local solution, you need to test and make sure you are converging correctly

98 / 103

Check exitflags: KNITRO-specific numbers here

Exitflags tell you why the solver stopped, exit flags of 0 or -10X are generally good, anything else is bad

-10X can indicate bad scaling, ill-conditioning, etc

99 / 103

Try alternative algorithms

Optimization is approximately 53% art

100 / 103

Try alternative algorithms

Optimization is approximately 53% art

Not all algorithms are suited for every problem $\rightarrow$ it is useful to check how different algorithms perform

100 / 103

Try alternative algorithms

Optimization is approximately 53% art

Not all algorithms are suited for every problem $\rightarrow$ it is useful to check how different algorithms perform

Interior-point is usually the default in constrained optimization solvers (low memory usage, fast), but try other algorithms and see if the solution generally remains the same

100 / 103

Be aware of tolerances

Two main tolerances in optimization:

ftol is the tolerance for the change in the function value (absolute and relative)
xtol is the tolerance for the change in the input values (absolute and relative)

101 / 103

Be aware of tolerances

Two main tolerances in optimization:

ftol is the tolerance for the change in the function value (absolute and relative)
xtol is the tolerance for the change in the input values (absolute and relative)

What is a suitable tolerance?

101 / 103

Be aware of tolerances

Two main tolerances in optimization:

ftol is the tolerance for the change in the function value (absolute and relative)
xtol is the tolerance for the change in the input values (absolute and relative)

What is a suitable tolerance?

It depends

101 / 103

Be aware of tolerances

Two main tolerances in optimization:

ftol is the tolerance for the change in the function value (absolute and relative)
xtol is the tolerance for the change in the input values (absolute and relative)

What is a suitable tolerance?

It depends

Explore sensitivity to tolerance, typically pick a conservative (small) number

Defaults in solvers are usually 1e-6

101 / 103

Be aware of tolerances

May be a substantial tradeoff between accuracy of your solution and speed

102 / 103

Be aware of tolerances

May be a substantial tradeoff between accuracy of your solution and speed

Common bad practice is to pick a larger tolerance (e.g. 1e-3) so the problem "works" (e.g. so your big MLE converges)

102 / 103

Be aware of tolerances

May be a substantial tradeoff between accuracy of your solution and speed

Common bad practice is to pick a larger tolerance (e.g. 1e-3) so the problem "works" (e.g. so your big MLE converges)

Issue is that 1e-3 might be pretty big for your problem
if you haven't checked that your solution is not sensitive to the tolerance

102 / 103

Perturb your initial guesses

Initial guesses matter

103 / 103

Perturb your initial guesses

Initial guesses matter

Good ones can improve performance

e.g. initial guess for next iteration of coefficient estimates should be current iteration estimates

103 / 103

Perturb your initial guesses

Initial guesses matter

Good ones can improve performance

e.g. initial guess for next iteration of coefficient estimates should be current iteration estimates

Bad ones can give you terrible performance, or wrong answers if your problem isn't perfect

e.g. bad scaling, not well-conditioned, multiple equilibria

103 / 103

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help