Processing math: 63%
+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 4

Optimization

Ivan Rudik

AEM 7130

1 / 103

Software and stuff

Necessary things to do:

  • Install the QuantEcon Julia package
  • Install the Optim or NLOpt Julia packages
2 / 103

Optimization

All econ problems are optimization problems

3 / 103

Optimization

All econ problems are optimization problems

  • Min costs
3 / 103

Optimization

All econ problems are optimization problems

  • Min costs

  • Max PV E[welfare]

3 / 103

Optimization

All econ problems are optimization problems

  • Min costs

  • Max PV E[welfare]

Some are harder than others:

3 / 103

Optimization

All econ problems are optimization problems

  • Min costs

  • Max PV E[welfare]

Some are harder than others:

  • Individual utility max: easy
3 / 103

Optimization

All econ problems are optimization problems

  • Min costs

  • Max PV E[welfare]

Some are harder than others:

  • Individual utility max: easy

  • Decentralized electricity market with nodal pricing and market power: hard

3 / 103

Optimization

All econ problems are optimization problems

  • Min costs

  • Max PV E[welfare]

Some are harder than others:

  • Individual utility max: easy

  • Decentralized electricity market with nodal pricing and market power: hard

  • One input profit maximization problem: easy

3 / 103

Optimization

All econ problems are optimization problems

  • Min costs

  • Max PV E[welfare]

Some are harder than others:

  • Individual utility max: easy

  • Decentralized electricity market with nodal pricing and market power: hard

  • One input profit maximization problem: easy

  • N-input profit maximization with learning and forecasts: hard

3 / 103

Things we will do

  1. Linear rootfinding
  2. Non-linear rootfinding
  3. Complementarity problems
  4. Non-linear unconstrained maximization/minimization
  5. Non-linear constrained maximization/minimization
4 / 103

Linear rootfinding

How do we solve these?

5 / 103

Linear rootfinding

How do we solve these?

Consider a simple generic problem:

Ax=b

5 / 103

Linear rootfinding

How do we solve these?

Consider a simple generic problem:

Ax=b

Invert A

x=A1b

5 / 103

Linear rootfinding

How do we solve these?

Consider a simple generic problem:

Ax=b

Invert A

x=A1b

THE END

5 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

f(x)=0,f:RRn

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

f(x)=0,f:RRn

What's a common rootfinding problem?

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

f(x)=0,f:RRn

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

f(x)=0,f:RRn

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

Yes!

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

f(x)=0,f:RRn

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

Yes!

Fixed point problems are rootfinding problems:

6 / 103

Non-linear rootfinding

With non-linear rootfinding problems we want to solve:

f(x)=0,f:RRn

What's a common rootfinding problem?

Can we reframe a common economic problem as rootfinding?

Yes!

Fixed point problems are rootfinding problems:

g(x)=xf(x)g(x)x=0

6 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector x may be constrained to be x[a,b]

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector x may be constrained to be x[a,b]

This results in a complementarity problem:

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector x may be constrained to be x[a,b]

This results in a complementarity problem:

xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

7 / 103

Rootfinding with constraints

Economic problems virtually always involve constraints

These constraints may or may not be binding

e.g. a firm profit max problem: a control vector x may be constrained to be x[a,b]

This results in a complementarity problem:

xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

What do these equations mean?

7 / 103

Rootfinding with constraints

xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

If the constraints on xi do not bind, ai<xi<bi, then the first-order condition is precisely zero

8 / 103

Rootfinding with constraints

xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

If the constraints on xi do not bind, ai<xi<bi, then the first-order condition is precisely zero

Suppose the upper bound binds, xi=bi

8 / 103

Rootfinding with constraints

xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

If the constraints on xi do not bind, ai<xi<bi, then the first-order condition is precisely zero

Suppose the upper bound binds, xi=bi

Then fi(x)0 since xi>ai, but we can't guarantee that fi(x)=0 because fi(x) might still be increasing

8 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values a and b,
it achieves all values in the set [a,b] somewhere in its domain

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values a and b,
it achieves all values in the set [a,b] somewhere in its domain

How can this motivate an algorithm to find the root of a function?

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values a and b,
it achieves all values in the set [a,b] somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values a and b,
it achieves all values in the set [a,b] somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

We know a root exists by IVT, what's an efficient way to find it?

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values a and b,
it achieves all values in the set [a,b] somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

We know a root exists by IVT, what's an efficient way to find it?

Continually bisect the interval!

9 / 103

Basic non-linear rootfinders: Bisection method

What does the intermediate value theorem tell us?

If a continuous real-valued function on a given interval takes on two values a and b,
it achieves all values in the set [a,b] somewhere in its domain

How can this motivate an algorithm to find the root of a function?

If we have a continuous, 1 variable function that is positive at some value and negative at another, a root must fall in between those values

We know a root exists by IVT, what's an efficient way to find it?

Continually bisect the interval!

Write out the code to do it

9 / 103

The bisection algorithm

function bisection(f, lower_bound, upper_bound)
tolerance = 1e-3 # tolerance for solution
guess = 0.5*(upper_bound + lower_bound) # initial guess, bisect the interval
difference = (upper_bound - lower_bound)/2 # initialize bound difference
while difference > tolerance # loop until convergence
println("Intermediate guess of $guess.")
difference = difference/2
if sign(f(lower_bound)) == sign(f(guess)) # if the guess has the same sign as the lower bound
lower_bound = guess # solution is in the upper half of the interval
guess = guess + difference
else # else the solution is in the lower half of the interval
upper_bound = guess
guess = guess - difference
end
end
println("The root of f(x) is at $guess.")
end;
10 / 103

The bisection method

f(x) = x^3
## f (generic function with 1 method)
bisection(f, -4, 1)
## Intermediate guess of -1.5.
## Intermediate guess of -0.25.
## Intermediate guess of 0.375.
## Intermediate guess of 0.0625.
## Intermediate guess of -0.09375.
## Intermediate guess of -0.015625.
## Intermediate guess of 0.0234375.
## Intermediate guess of 0.00390625.
## Intermediate guess of -0.005859375.
## Intermediate guess of -0.0009765625.
## Intermediate guess of 0.00146484375.
## Intermediate guess of 0.000244140625.
## The root of f(x) is at -0.0003662109375.
11 / 103
g(x) = 3x^3 + 2x -4
## g (generic function with 1 method)
bisection(g, -6, 4)
## Intermediate guess of -1.0.
## Intermediate guess of 1.5.
## Intermediate guess of 0.25.
## Intermediate guess of 0.875.
## Intermediate guess of 1.1875.
## Intermediate guess of 1.03125.
## Intermediate guess of 0.953125.
## Intermediate guess of 0.9140625.
## Intermediate guess of 0.89453125.
## Intermediate guess of 0.904296875.
## Intermediate guess of 0.8994140625.
## Intermediate guess of 0.90185546875.
## Intermediate guess of 0.900634765625.
## The root of f(x) is at 0.9012451171875.
12 / 103
h(x) = cos(x)
## h (generic function with 1 method)
bisection(h, -pi, pi)
## Intermediate guess of 0.0.
## Intermediate guess of -1.5707963267948966.
## Intermediate guess of -2.356194490192345.
## Intermediate guess of -1.9634954084936207.
## Intermediate guess of -1.7671458676442586.
## Intermediate guess of -1.6689710972195777.
## Intermediate guess of -1.6198837120072371.
## Intermediate guess of -1.595340019401067.
## Intermediate guess of -1.5830681730979819.
## Intermediate guess of -1.5769322499464393.
## Intermediate guess of -1.573864288370668.
## Intermediate guess of -1.5723303075827824.
## The root of f(x) is at -1.5715633171888395.
13 / 103

The bisection method

The bisection method works by continually bisecting the interval and only keeping the half interval with the zero until "convergence"

  1. Select the midpoint of [a,b], (a+b)/2
  2. Zero must be in the lower or upper half
  3. Check the sign of the midpoint, if it has the same sign as the lower bound a root must be the right subinterval
  4. Select the midpoint of [(a+b)/2,b]...
14 / 103

The bisection method

The bisection method is incredibly robust: if a function f satisfies the IVT, it is guaranteed to converge in a specific number of iterations

15 / 103

The bisection method

The bisection method is incredibly robust: if a function f satisfies the IVT, it is guaranteed to converge in a specific number of iterations

A root can be calculated to arbitrary precision ϵ
in a maximum of log([ba]/ϵ)/log(2) iterations

Robustness comes with drawbacks:

  1. It only works in one dimension
  2. It is slow because it only uses information about the function's level
15 / 103

Function iteration

Fixed points can be computed using function iteration

16 / 103

Function iteration

Fixed points can be computed using function iteration

Since we can recast fixed points as rootfinding problems we can use function iteration to find roots too

16 / 103

Function iteration

Fixed points can be computed using function iteration

Since we can recast fixed points as rootfinding problems we can use function iteration to find roots too

Code up a function iteration problem

16 / 103

Function iteration

Function iteration is pretty simple to implement

function function_iteration(f, guess)
tolerance = 1e-3 # tolerance for solution
x_old = guess # initialize old x value
x = 1e-10 # initialize current x
difference = 1e10 # initialize bound difference
@time while abs(difference) > tolerance # loop until convergence
println("Intermediate guess of $x.")
x = f(x_old) # new x = f(old x)
difference = x - x_old
x_old = x
end
println("The fixed point of f(x) is at $x.")
end;
17 / 103
f(x) = x^(-0.5)
## f (generic function with 1 method)
function_iteration(f, 2.)
## Intermediate guess of 1.0e-10.
## Intermediate guess of 0.7071067811865476.
## Intermediate guess of 1.189207115002721.
## Intermediate guess of 0.9170040432046712.
## Intermediate guess of 1.0442737824274138.
## Intermediate guess of 0.9785720620877002.
## Intermediate guess of 1.0108892860517005.
## Intermediate guess of 0.9945994234836332.
## Intermediate guess of 1.0027112750502025.
## Intermediate guess of 0.9986471128909702.
## Intermediate guess of 1.0006771306930664.
## Intermediate guess of 0.9996616064962437.
## 0.002017 seconds (924 allocations: 233.763 KiB)
## The fixed point of f(x) is at 1.0001692397053021.
18 / 103

Function iteration

19 / 103

Function iteration

Function iteration can be quick, but is not always guaranteed to converge

20 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

21 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

What's the idea?

21 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

What's the idea?

Take a hard non-linear problem and replace it with a sequence of linear problems

21 / 103

Newton's method

Newton's method and variants are the workhorses of solving n-dimensional non-linear problems

What's the idea?

Take a hard non-linear problem and replace it with a sequence of linear problems

Under certain conditions the sequence of solutions will converge to the true solution

21 / 103

Newton's method

Here's a graphical depiction of Newton's method:

22 / 103

Newton's method

Start with an initial guess of the root at x(0)

23 / 103

Newton's method

Start with an initial guess of the root at x(0)

Approximate the non-linear function with its first-order Taylor expansion about x(0)

23 / 103

Newton's method

Start with an initial guess of the root at x(0)

Approximate the non-linear function with its first-order Taylor expansion about x(0)

This is just the tangent line at x0, solve for the root of this linear approximation, call it x(1)

23 / 103

Newton's method

Start with an initial guess of the root at x(0)

Approximate the non-linear function with its first-order Taylor expansion about x(0)

This is just the tangent line at x0, solve for the root of this linear approximation, call it x(1)

Repeat starting at x(1)

23 / 103

Newton's method

Start with an initial guess of the root at x(0)

Approximate the non-linear function with its first-order Taylor expansion about x(0)

This is just the tangent line at x0, solve for the root of this linear approximation, call it x(1)

Repeat starting at x(1)

This can be applied to a function with an arbitrary number of dimensions

23 / 103

Newton's method

Begin with some initial guess of the root vector x(0)

24 / 103

Newton's method

Begin with some initial guess of the root vector x(0)

Our new guess x(k+1) given some arbitrary point in the algorithm, x(k), is obtained by approximating f(x) using a first-order Taylor expansion about x(k) and solving for x: f(x)f(x(k))+f(x(k))(xx(k))=0x(k+1)=x(k)[f(x(k))]1f(x(k))

24 / 103

Newton's method

Code up Newton's method

25 / 103

Newton's method

Code up Newton's method

function newtons_method(f, f_prime, guess)
diff = Inf # Initialize problem
tol = 1e-5
x_old = guess
x = 1e10
while abs(diff) > tol
x = f(x_old) - f(x_old)/f_prime(x_old) # Root of linear approximation
diff = x - x_old
x_old = x
end
println("The root of f(x) is at $x.")
end;
25 / 103

Newton's method

f(x) = x^3;
f_prime(x) = 3x^2;
newtons_method(f, f_prime, 1.)
## The root of f(x) is at 1.231347218094855e-6.
26 / 103

Newton's method

f(x) = x^3;
f_prime(x) = 3x^2;
newtons_method(f, f_prime, 1.)
## The root of f(x) is at 1.231347218094855e-6.
f(x) = sin(x);
f_prime(x) = cos(x);
newtons_method(f, f_prime, pi/4)
## The root of f(x) is at 5.941936124988917e-19.
26 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If f(x) is continuously differentiable, the initial guess is "sufficiently close" to the root, and f(x) is invertible near the root, then Newton's method converges to the root

27 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If f(x) is continuously differentiable, the initial guess is "sufficiently close" to the root, and f(x) is invertible near the root, then Newton's method converges to the root

What is "sufficiently close"?

27 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If f(x) is continuously differentiable, the initial guess is "sufficiently close" to the root, and f(x) is invertible near the root, then Newton's method converges to the root

What is "sufficiently close"?

We need f(x) to be invertible so the algorithm above is well defined

27 / 103

Newton's method

Newton's method has nice properties regarding convergence and speed:

If f(x) is continuously differentiable, the initial guess is "sufficiently close" to the root, and f(x) is invertible near the root, then Newton's method converges to the root

What is "sufficiently close"?

We need f(x) to be invertible so the algorithm above is well defined

If f(x) is ill-conditioned we can run into problems with rounding error

27 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

  1. Coding error / time
  2. Can actually be slower to evaluate than finite differences for a nonlinear problem, see Ken Judd's notes
28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

  1. Coding error / time
  2. Can actually be slower to evaluate than finite differences for a nonlinear problem, see Ken Judd's notes
28 / 103

Quasi-Newton: Secant method

We usually don't want to deal with analytic derivatives unless we have access to autodifferentiation

Why?

  1. Coding error / time
  2. Can actually be slower to evaluate than finite differences for a nonlinear problem, see Ken Judd's notes

Using our current root guess x(k) and our previous root guess x(k1):

f(x(k))f(x(k))f(x(k1))x(k)x(k1)

28 / 103

Quasi-Newton: Secant method

Our new iteration rule then becomes

29 / 103

Quasi-Newton: Secant method

Our new iteration rule then becomes

x(k+1)=x(k)x(k)x(k1)f(x(k))f(x(k1))f(x(k))

29 / 103

Quasi-Newton: Secant method

Our new iteration rule then becomes

x(k+1)=x(k)x(k)x(k1)f(x(k))f(x(k1))f(x(k))

Now we require two initial guesses so that we have an initial approximation of the derivative

29 / 103

Quasi-Newton: Secant method

30 / 103

Quasi-Newton: Broyden's method

Broyden's method is the most widely used rootfinding method for n-dimensional problems

31 / 103

Quasi-Newton: Broyden's method

Broyden's method is the most widely used rootfinding method for n-dimensional problems

It is a generalization of the secant method where have a sequence of guesses of the Jacobian at the root

31 / 103

Quasi-Newton: Broyden's method

Broyden's method is the most widely used rootfinding method for n-dimensional problems

It is a generalization of the secant method where have a sequence of guesses of the Jacobian at the root

We must initially provide a guess of the root, x(0), but also a guess of the Jacobian, A(0)

31 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

x(k+1)=x(k)A1(k)f(x(k)).

32 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

x(k+1)=x(k)A1(k)f(x(k)).

we still need to update A(k): we do this update is performed by making the smallest change, in terms of the Frobenius matrix norm, that satisfies what is called the secant condition (under determined if n>1): f(x(k+1))f(x(k))=A(k+1)(x(k+1)x(k))

32 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

x(k+1)=x(k)A1(k)f(x(k)).

we still need to update A(k): we do this update is performed by making the smallest change, in terms of the Frobenius matrix norm, that satisfies what is called the secant condition (under determined if n>1): f(x(k+1))f(x(k))=A(k+1)(x(k+1)x(k))

The updated differences in root guesses, and the function value at those root guesses, should align with our estimate of the Jacobian at that point

32 / 103

Quasi-Newton: Broyden's method

Root guess update is the same as before but with our guess of the Jacobian substituted in for the actual Jacobian or the finite difference approximation

x(k+1)=x(k)A1(k)f(x(k)).

we still need to update A(k): we do this update is performed by making the smallest change, in terms of the Frobenius matrix norm, that satisfies what is called the secant condition (under determined if n>1): f(x(k+1))f(x(k))=A(k+1)(x(k+1)x(k))

The updated differences in root guesses, and the function value at those root guesses, should align with our estimate of the Jacobian at that point

A(k+1)=A(k)+[f(x(k+1))f(x(k))A(k+1)(x(k+1)x(k))]x(k+1)x(k)(x(k+1)x(k))T(x(k+1)x(k))

32 / 103

Accelerating Broyden

Why update the Jacobian and then invert when we can just update an inverted Jacobian B=A1

B(k+1)=B(k)+[d(k)u(k)]d(k)TB(k)d(k)Tu(k)

where d(k)=(x(k+1)x(k)), and u(k)=B(k)[f(x(k+1))f(x(k))].

33 / 103

Accelerating Broyden

Why update the Jacobian and then invert when we can just update an inverted Jacobian B=A1

B(k+1)=B(k)+[d(k)u(k)]d(k)TB(k)d(k)Tu(k)

where d(k)=(x(k+1)x(k)), and u(k)=B(k)[f(x(k+1))f(x(k))].

Broyden converges under relatively weak conditions:

33 / 103

Accelerating Broyden

Why update the Jacobian and then invert when we can just update an inverted Jacobian B=A1

B(k+1)=B(k)+[d(k)u(k)]d(k)TB(k)d(k)Tu(k)

where d(k)=(x(k+1)x(k)), and u(k)=B(k)[f(x(k+1))f(x(k))].

Broyden converges under relatively weak conditions:

  1. f is continuously differentiable,
  2. x(0) is close to the root of f
  3. f is invertible around the root
  4. A0 is sufficiently close to the Jacobian
33 / 103

Convergence speed

Rootfinding algorithms will converge at different speeds in terms of the number of operations

34 / 103

Convergence speed

Rootfinding algorithms will converge at different speeds in terms of the number of operations

A sequence of iterates x(k) is said to converge to x at a rate of order p if there is a constant C such that

||x(k+1)x||C||x(k)x||p

for sufficiently large k

34 / 103

Convergence speed

Rootfinding algorithms will converge at different speeds in terms of the number of operations

A sequence of iterates x(k) is said to converge to x at a rate of order p if there is a constant C such that

||x(k+1)x||C||x(k)x||p

for sufficiently large k

If C<1 and p=1, the rate of convergence is linear

If 1<p<2, convergence is superlinear, and if p=2 convergence is quadratic. The higher order the convergence rate, the faster it converges

34 / 103

Convergence speed

How fast do the methods we've seen converge?

35 / 103

Convergence speed

How fast do the methods we've seen converge?

  • Bisection: linear rate with C=0.5 (kind of obvious once you see it)
35 / 103

Convergence speed

How fast do the methods we've seen converge?

  • Bisection: linear rate with C=0.5 (kind of obvious once you see it)

  • Function iteration: linear rate with C=||f(x)||

35 / 103

Convergence speed

How fast do the methods we've seen converge?

  • Bisection: linear rate with C=0.5 (kind of obvious once you see it)

  • Function iteration: linear rate with C=||f(x)||

  • Secant and Broyden: superlinear rate with p1.62

35 / 103

Convergence speed

How fast do the methods we've seen converge?

  • Bisection: linear rate with C=0.5 (kind of obvious once you see it)

  • Function iteration: linear rate with C=||f(x)||

  • Secant and Broyden: superlinear rate with p1.62

  • Newton: p=2

35 / 103

Convergence speed

Convergence rates only account for the number of iterations of the method

The steps taken in a given iteration of each solution method may vary in computational cost because of differences in the number of arithmetic operations

Although an algorithm may take more iterations to solve,
each iteration may be solved faster and the overall algorithm takes less time

36 / 103

Convergence speed

Ex:

  • Bisection method only requires a single function evaluation during each iteration
  • Function iteration only requires a single function evaluation during each iteration
  • Broyden's method requires both a function evaluation and matrix multiplication
  • Newton's method requires a function evaluation, a derivative evaluation, and solving a linear system
37 / 103

Convergence speed

Ex:

  • Bisection method only requires a single function evaluation during each iteration
  • Function iteration only requires a single function evaluation during each iteration
  • Broyden's method requires both a function evaluation and matrix multiplication
  • Newton's method requires a function evaluation, a derivative evaluation, and solving a linear system

Bisection and function iteration are usually slow

Broyden's method can be faster than Newton's method if derivatives are costly to compute

37 / 103

Convergence speed

Consider an example where f(x)=x(x)=0

What does convergence look like across our main approaches in terms of the L1norm if all guesses start at x(0)=0.5?

38 / 103

Complementarity problems: rootfinding

Let:

  • x be an n-dimensional vector of some economic action
  • ai denotes a lower bound on action i, and bi denotes the upper bound on action i
  • fi(x) denotes the marginal arbitrage profit of action i
39 / 103

Complementarity problems: rootfinding

Let:

  • x be an n-dimensional vector of some economic action
  • ai denotes a lower bound on action i, and bi denotes the upper bound on action i
  • fi(x) denotes the marginal arbitrage profit of action i

There are disequilibrium profit opportunities if

  1. xi<bi and fi(x)>0 (here we can increase profits by raising xi)
  2. xi>ai and fi(x)<0 (we can increase profits by decreasing xi)
39 / 103

Complementarity problems

We obtain a no-arbitrage equilibrium if and only if x solves the complementary problem CP(f,a,b)

40 / 103

Complementarity problems

We obtain a no-arbitrage equilibrium if and only if x solves the complementary problem CP(f,a,b)

We can write out the problem as finding a vector x[a,b] that solves xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

40 / 103

Complementarity problems

We obtain a no-arbitrage equilibrium if and only if x solves the complementary problem CP(f,a,b)

We can write out the problem as finding a vector x[a,b] that solves xi>aifi(x)0i=1,...,nxi<bifi(x)0i=1,...,n

At interior solution, the function be precisely be zero

Corner solution at the upper bound bi for xi f must be increasing in direction i

The opposite is true if we are at the lower bound

40 / 103

Complementarity problems

Economic problems are complementarity problems where we are
finding a root of a function (e.g. marginal profit) subject to some constraint (e.g. price floors)

The Karush-Kuhn-Tucker theorem shows that x solves the constrained optimization problem if and only if it solves the complementarity problem

41 / 103

Complementarity problems

Example: single commodity competitive spatial price equilibrium model

  • n regions of the world
  • excess demand for the commodity in region i is Ei(pi)
42 / 103

Complementarity problems

Example: single commodity competitive spatial price equilibrium model

  • n regions of the world
  • excess demand for the commodity in region i is Ei(pi)

If no trade equilibrium condition is Ei(pi)=0 in all regions of the world: a simple rootfinding problem

42 / 103

Complementarity problems

Example: single commodity competitive spatial price equilibrium model

  • n regions of the world
  • excess demand for the commodity in region i is Ei(pi)

If no trade equilibrium condition is Ei(pi)=0 in all regions of the world: a simple rootfinding problem

Trade between regions has marginal transportation cost between regions i and j of cij

  • xij: the amount of the good shipped from region i to region j
  • capacity constraint: bij
42 / 103

Complementarity problems

Marginal arbitrage profit from shipping a unit of the good from i to j is pjpicij

43 / 103

Complementarity problems

Marginal arbitrage profit from shipping a unit of the good from i to j is pjpicij

If it's positive: incentive to ship more goods to region i from region j

If it's negative: incentive to decrease shipments

43 / 103

Complementarity problems

Marginal arbitrage profit from shipping a unit of the good from i to j is pjpicij

If it's positive: incentive to ship more goods to region i from region j

If it's negative: incentive to decrease shipments

At an equilibrium only if all the arbitrage opportunities are gone: for all region pairs i and j 0xijbijxij>0pjpicij0xij<bijpjpicij0

43 / 103

Complementarity problems

At an equilibrium only if all the arbitrage opportunities are gone: for all region pairs i and j 0xijbijxij>0pjpicij0xij<bijpjpicij0

44 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

45 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

Market clearing in each region i requires that net imports = excess demand

45 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

Market clearing in each region i requires that net imports = excess demand

k[xkixik]=Ei(pi)

45 / 103

Complementarity problems

How do we formulate this as a complementarity problem?

Market clearing in each region i requires that net imports = excess demand

k[xkixik]=Ei(pi)

This implies that we can solve for the price in region i,

pi=E1i(k[xkixik])

45 / 103

Complementarity problems

Finally, if we define marginal arbitrage profit from shipping another unit from i to j as

fij(x)=E1j(k[xkjxjk])E1i(k[xkixik])cij

then x is an equilibrium vector of trade flows if and only if x solves CP(f,0,b) and x, f, and b are n2×1 vectors

Even this complex trade-equilibrium model can be reduced to a simple complementarity problem

46 / 103

Complementarity problems

Decentralized models can generally be solved by reducing them to complementarity problems, e.g:

  1. Competitive energy markets with cross-market cap and trade or capacity constraints
  2. Federalism models
  3. etc etc
47 / 103

Complementarity problems

Four plots of marginal arbitrage profit fi(x), what is the equilibrium choice of x?

48 / 103

How do we solve them?

A complementarity problem CP(f,a,b) can be re-framed as a rootfinding problem easily ˆf(x)=min(max(f(x),ax),bx)=0

49 / 103

How do we solve them?

We can then solve the problem with conventional rootfinding techniques
or approximate this with a smooth function to make it easier on a rootfinder

50 / 103

How do we solve them?

We can then solve the problem with conventional rootfinding techniques
or approximate this with a smooth function to make it easier on a rootfinder

The workhorse solver for complementarity problems is the PATH solver: Julia wrapper can be found here

50 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

We make two distinctions:

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

We make two distinctions:

Local vs global: are we finding an optimum in a local region, or globally?

51 / 103

Maximization (minimization) methods

How we solve maximization problems has many similarities to rootfinding and complementarity problems

FOCs of an unconstrained maximization problem are just a rootfinding problem

FOCs of a constrained maximization problem are just a complementarity problem

I'll tend to frame problems as minimization problems because it is the convention the optimization literature

We make two distinctions:

Local vs global: are we finding an optimum in a local region, or globally?

Derivative-using vs derivative-free: Do we want to use higher-order information?

51 / 103

Maximization (minimization) methods

I'll focus on local solvers, common global solvers I won't cover:

  1. Genetic algorithms
  2. Simulated annealing
  3. DIRECT
  4. Sto-go
52 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, f(x),
and we want to find a local minimum in some interval [a,b]

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, f(x),
and we want to find a local minimum in some interval [a,b]

  1. Select points x1,x2[a,b] where x2>x1
53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, f(x),
and we want to find a local minimum in some interval [a,b]

  1. Select points x1,x2[a,b] where x2>x1

  2. If f(x1)<f(x2) replace [a,b] with [a,x2], else replace [a,b] with [x1,b]

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, f(x),
and we want to find a local minimum in some interval [a,b]

  1. Select points x1,x2[a,b] where x2>x1

  2. If f(x1)<f(x2) replace [a,b] with [a,x2], else replace [a,b] with [x1,b]

  3. Repeat until convergence criterion is met

53 / 103

Derivative free optimization: Golden search

Similar to bisection, golden search looks for a solution of a one-dimensional problem over smaller and smaller brackets

If we have a continuous one dimensional function, f(x),
and we want to find a local minimum in some interval [a,b]

  1. Select points x1,x2[a,b] where x2>x1

  2. If f(x1)<f(x2) replace [a,b] with [a,x2], else replace [a,b] with [x1,b]

  3. Repeat until convergence criterion is met

Replace the endpoint of the interval next to the evaluated point with the highest value keep the lower evaluated point in the interval guarantees that a local minimum still exists

53 / 103

Derivative free optimization: Golden search

How do we pick x1 and x2?

54 / 103

Derivative free optimization: Golden search

How do we pick x1 and x2?

Achievable goal for selection process:

  • New interval is independent of whether the upper or lower bound is replaced
  • Only requires one function evaluation per iteration
54 / 103

Derivative free optimization: Golden search

How do we pick x1 and x2?

Achievable goal for selection process:

  • New interval is independent of whether the upper or lower bound is replaced
  • Only requires one function evaluation per iteration

There's one algorithm that satisfies this

54 / 103

Derivative free optimization: Golden search

Golden search algorithm for point selection:

xi=a+αi(ba)α1=352α2=512

55 / 103

Derivative free optimization: Golden search

Golden search algorithm for point selection:

xi=a+αi(ba)α1=352α2=512

The value of α2 is called the golden ratio and is where the algorithm gets its name

55 / 103

Derivative free optimization: Golden search

Golden search algorithm for point selection:

xi=a+αi(ba)α1=352α2=512

The value of α2 is called the golden ratio and is where the algorithm gets its name

Write out a golden search algorithm

55 / 103

Golden search

function golden_search(f, lower_bound, upper_bound)
alpha_1 = (3 - sqrt(5))/2 # GS parameter 1
alpha_2 = (sqrt(5) - 1)/2 # GS parameter 2
tolerance = 1e-2 # tolerance for convergence
difference = 1e10
while difference > tolerance
x_1 = lower_bound + alpha_1*(upper_bound - lower_bound) # new x_1
x_2 = lower_bound + alpha_2*(upper_bound - lower_bound) # new x_2
if f(x_1) < f(x_2) # reset bounds
upper_bound = x_2
else
lower_bound = x_1
end
difference = x_2 - x_1
end
println("Minimum is at x = $((lower_bound+upper_bound)/2).")
end;
56 / 103

Golden search

function golden_search(f, lower_bound, upper_bound)
alpha_1 = (3 - sqrt(5))/2 # GS parameter 1
alpha_2 = (sqrt(5) - 1)/2 # GS parameter 2
tolerance = 1e-2 # tolerance for convergence
difference = 1e10
while difference > tolerance
x_1 = lower_bound + alpha_1*(upper_bound - lower_bound) # new x_1
x_2 = lower_bound + alpha_2*(upper_bound - lower_bound) # new x_2
if f(x_1) < f(x_2) # reset bounds
upper_bound = x_2
else
lower_bound = x_1
end
difference = x_2 - x_1
end
println("Minimum is at x = $((lower_bound+upper_bound)/2).")
end;
f(x) = 2x^2 + 9x;
golden_search(f, -10, 10)
## Minimum is at x = -2.2483173872886444.
56 / 103

Nelder-Mead: Simplex

Golden search is nice and simple but only works in one dimension

There are several derivative free methods for minimization that work in multiple dimensions, the most commonly used one is Nelder-Mead (NM)

57 / 103

Nelder-Mead: Simplex

Golden search is nice and simple but only works in one dimension

There are several derivative free methods for minimization that work in multiple dimensions, the most commonly used one is Nelder-Mead (NM)

NM works by first constructing a simplex: we evaluate the function at n+1 points in an n dimensional problem

It then manipulates the highest value point, similar to golden search

57 / 103

Nelder-Mead: Simplex

There are six operations:

58 / 103

Nelder-Mead: Simplex

There are six operations:

  • Order: order the value at the vertices of the simplex f(x1)...f(xn+1)
58 / 103

Nelder-Mead: Simplex

There are six operations:

  • Order: order the value at the vertices of the simplex f(x1)...f(xn+1)
  • Centroid: calculate x0, the centroid of the non - xn+1 points
58 / 103

Nelder-Mead: Simplex

There are six operations:

  • Order: order the value at the vertices of the simplex f(x1)...f(xn+1)
  • Centroid: calculate x0, the centroid of the non - xn+1 points
  • Reflection: reflect xn+1 through the opposite face of the simplex and evaluate the new point: xr=x0+α(x0xn+1), α>0
    • If this improves upon the second-highest (e.g. its lower) but is not the lowest value point, replace xn+1 with xr and restart
    • If this is the lowest value point so far, go to step 4
    • If f(xr)>f(xn) go to step 5
58 / 103

Nelder-Mead: Simplex

  • Expansion: push the reflected point further in the same direction
59 / 103

Nelder-Mead: Simplex

  • Expansion: push the reflected point further in the same direction
  • Contract: Contract the highest value point toward the middle
    • Compute xc=x0+γ(x0xn+1), 0<γ0.5
    • If xc is better than the worst point replace xn+1 with xc and restart
    • Else go to step 6
59 / 103

Nelder-Mead: Simplex

  • Expansion: push the reflected point further in the same direction
  • Contract: Contract the highest value point toward the middle
    • Compute xc=x0+γ(x0xn+1), 0<γ0.5
    • If xc is better than the worst point replace xn+1 with xc and restart
    • Else go to step 6
  • Shrink: shrink the simplex toward the best point
    • Replace all points but the best one with xi=x1+σ(xix1)
59 / 103

Nelder-Mead: Simplex

  • Expansion: push the reflected point further in the same direction
  • Contract: Contract the highest value point toward the middle
    • Compute xc=x0+γ(x0xn+1), 0<γ0.5
    • If xc is better than the worst point replace xn+1 with xc and restart
    • Else go to step 6
  • Shrink: shrink the simplex toward the best point
    • Replace all points but the best one with xi=x1+σ(xix1)

Nelder-Mead is a pain to code efficiently (i.e. don't spend the time doing it yourself) but is in the Optim.jl package

59 / 103

Nelder-Mead: Simplex

Nelder-Mead is commonly used but slow and unreliable, no real useful convergence properties, avoid using it

60 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function f

61 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function f

x is a global minimizer if f(x)f(x) for all x over the domain of the function

61 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function f

x is a global minimizer if f(x)f(x) for all x over the domain of the function

Problem: most algorithms are local minimizers that find a point x
such that f(x)f(x) for all xN, where N is a neighborhood of x

61 / 103

What is a solution?

We typically want to find a global extremum, here a minimum,
of our objective function f

x is a global minimizer if f(x)f(x) for all x over the domain of the function

Problem: most algorithms are local minimizers that find a point x
such that f(x)f(x) for all xN, where N is a neighborhood of x

Typically analytical problems are set up to have a unique minimum so any local solver can generally find the global optimum

61 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

62 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

  • Concave transitions
  • Games with multiple equilibria
  • Etc
62 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

  • Concave transitions
  • Games with multiple equilibria
  • Etc

How do we find a local minimum?

62 / 103

What is a solution?

Lots of problems have properties that don't satisfy the typical sufficiency conditions for a unique minimum (strictly decreasing and convex), like

  • Concave transitions
  • Games with multiple equilibria
  • Etc

How do we find a local minimum?

Do we need to evaluate every single point?

62 / 103

The general unconstrained approach

Optimization algorithms typically have the following set up:

  1. Start at some x0
  2. Work through a series of iterates {xk}k=1 until we have "converged" with sufficient accuracy
63 / 103

The general unconstrained approach

Optimization algorithms typically have the following set up:

  1. Start at some x0
  2. Work through a series of iterates {xk}k=1 until we have "converged" with sufficient accuracy

If the function is smooth, we can take advantage of that information
about the function's shape to figure out which direction to move in next

63 / 103

The general unconstrained approach

Optimization algorithms typically have the following set up:

  1. Start at some x0
  2. Work through a series of iterates {xk}k=1 until we have "converged" with sufficient accuracy

If the function is smooth, we can take advantage of that information
about the function's shape to figure out which direction to move in next

If f is twice continuously differentiable, we can use the gradient f and Hessian 2f to figure out if x is a local minimizer

63 / 103

The general unconstrained approach

Taylor's Theorem tells us that if f is twice differentiable, then there exists a t(0,1) such that

f(x+p)=f(x)+f(x)Tp+12!pT2f(x+tp)p

This is an exact equality

64 / 103

The general unconstrained approach

Taylor's Theorem tells us that if f is twice differentiable, then there exists a t(0,1) such that

f(x+p)=f(x)+f(x)Tp+12!pT2f(x+tp)p

This is an exact equality

From here we can prove the usual necessary and sufficient conditions for a local optimum

64 / 103

Two large classes of algorithms

All modern algorithms have that general set up but may go about it in different ways

65 / 103

Two large classes of algorithms

All modern algorithms have that general set up but may go about it in different ways

Most modern optimization problems fall into one of two classes:

  1. Line search
  2. Trust region
65 / 103

Two large classes of algorithms

All modern algorithms have that general set up but may go about it in different ways

Most modern optimization problems fall into one of two classes:

  1. Line search
  2. Trust region

The relationship between these two approaches has a lot of similiarities
to the relationship between the constrained problem and the dual Lagrange problem

65 / 103

Line search algorithms

General idea:

  1. Start at some current iterate xk
  2. Select a direction to move in pk
  3. Figure out how far along pk to move
66 / 103

Line search algorithms

General idea:

  1. Start at some current iterate xk
  2. Select a direction to move in pk
  3. Figure out how far along pk to move

How do we figure out how far to move?

66 / 103

Line search algorithms

General idea:

  1. Start at some current iterate xk
  2. Select a direction to move in pk
  3. Figure out how far along pk to move

How do we figure out how far to move?

"Approximately" solve this problem to figure out the step length α min

66 / 103

Line search algorithms

General idea:

  1. Start at some current iterate x_k
  2. Select a direction to move in p_k
  3. Figure out how far along p_k to move

How do we figure out how far to move?

"Approximately" solve this problem to figure out the step length \alpha \min_{\alpha > 0} f(x_k + \alpha p_k)

We are finding the distance to move, \alpha in direction p_k that minimizes our objective f

66 / 103

Line search algorithms

Typically do not perform the full minimization problem since it is costly

We only try a limited number of step lengths \alpha before picking the best one and moving onto our next iterate x_{k+1}

67 / 103

Line search algorithms

Typically do not perform the full minimization problem since it is costly

We only try a limited number of step lengths \alpha before picking the best one and moving onto our next iterate x_{k+1}

We still haven't answered, what direction p_k do we decide to move in?

67 / 103

Line search: direction choice

What's an obvious choice for p_k?

68 / 103

Line search: direction choice

What's an obvious choice for p_k?

The direction that yields the steepest descent

68 / 103

Line search: direction choice

What's an obvious choice for p_k?

The direction that yields the steepest descent

-\nabla \, f_k is the direction that makes f decrease most rapidly, k indicates we are evaluating f at iteration k

68 / 103

Line search: direction choice

What's an obvious choice for p_k?

The direction that yields the steepest descent

-\nabla \, f_k is the direction that makes f decrease most rapidly, k indicates we are evaluating f at iteration k

68 / 103

Line search: steepest descent

We can verify this is the direction of steepest descent by referring to Taylor's theorem

69 / 103

Line search: steepest descent

We can verify this is the direction of steepest descent by referring to Taylor's theorem

For any direction p and step length \alpha, we have that f(x_k + \alpha p) = f(x_k) + \alpha\,p^T\,\nabla\,f_k + \frac{1}{2!}\,\alpha^2p^T\,\nabla^2\,f(x_k+tp)\,p

69 / 103

Line search: steepest descent

We can verify this is the direction of steepest descent by referring to Taylor's theorem

For any direction p and step length \alpha, we have that f(x_k + \alpha p) = f(x_k) + \alpha\,p^T\,\nabla\,f_k + \frac{1}{2!}\,\alpha^2p^T\,\nabla^2\,f(x_k+tp)\,p

The rate of change in f along p at x_k (\alpha = 0) is p^T \, \nabla\,f_k

69 / 103

Line search: steepest descent

The the unit vector of quickest descent solves \min_p p^T\,\nabla\,f_k \,\,\,\,\, \text{subject to: }||p|| = 1

70 / 103

Line search: steepest descent

The the unit vector of quickest descent solves \min_p p^T\,\nabla\,f_k \,\,\,\,\, \text{subject to: }||p|| = 1

Re-express the objective as \min_{\theta,p} ||p||\,||\nabla\,f_k||cos\,\theta, where \theta is the angle between p and \nabla\,f_k

70 / 103

Line search: steepest descent

The the unit vector of quickest descent solves \min_p p^T\,\nabla\,f_k \,\,\,\,\, \text{subject to: }||p|| = 1

Re-express the objective as \min_{\theta,p} ||p||\,||\nabla\,f_k||cos\,\theta, where \theta is the angle between p and \nabla\,f_k

The minimum is attained when cos\,\theta = -1 and p = -\frac{\nabla\,f_k}{||\nabla\,f_k||}, so the direction of steepest descent is simply -\nabla\,f_k

70 / 103

Line search: steepest descent

The steepest descent method searches along this direction at every iteration k

It may select the step length \alpha_k in several different ways

A benefit of the algorithm is that we only require the gradient of the function, and no Hessian

However it can be very slow

71 / 103

Line search: alternative directions

We can always use search directions other than the steepest descent

72 / 103

Line search: alternative directions

We can always use search directions other than the steepest descent

Any descent direction, i.e. one that is within 45^\circ of -\nabla\,f_k,
is guaranteed to produce a decrease in f as long as the step size is sufficiently small

72 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)

If p_k is in a descending direction, \theta_k will be of an angle such that cos\,\theta_k < 0

This gives us

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)

If p_k is in a descending direction, \theta_k will be of an angle such that cos\,\theta_k < 0

This gives us

p_k^T\,\nabla f_k = ||p_k||\,||\nabla\,f_k||cos\,\theta_k < 0

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)

If p_k is in a descending direction, \theta_k will be of an angle such that cos\,\theta_k < 0

This gives us

p_k^T\,\nabla f_k = ||p_k||\,||\nabla\,f_k||cos\,\theta_k < 0

Therefore f(x_k + \epsilon p_k) < f(x_k) for positive but sufficiently small \epsilon

73 / 103

Line search: alternative directions

We can actually verify this with Taylor's theorem

f(x_k + \epsilon p_k) = f(x_k) + \epsilon\,p_k^T\,\nabla\,f_k + O(\epsilon^2)

If p_k is in a descending direction, \theta_k will be of an angle such that cos\,\theta_k < 0

This gives us

p_k^T\,\nabla f_k = ||p_k||\,||\nabla\,f_k||cos\,\theta_k < 0

Therefore f(x_k + \epsilon p_k) < f(x_k) for positive but sufficiently small \epsilon

Is -\nabla\,f_k always the best search direction?

73 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

74 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

Newton's direction comes out of the second order Taylor series approximation to f(x_k + p) f(x_k + p) \approx f_k + p^T\,\nabla\,f_k + \frac{1}{2!}\,p^T\,\nabla^2f_k\,p Define this as \mathbf{m_k(p)}

74 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

Newton's direction comes out of the second order Taylor series approximation to f(x_k + p) f(x_k + p) \approx f_k + p^T\,\nabla\,f_k + \frac{1}{2!}\,p^T\,\nabla^2f_k\,p Define this as \mathbf{m_k(p)}

We find the Newton direction by selecting the vector p that minimizes f(x_k + p)

74 / 103

Newton's method

The most important search direction is not steepest descent but Newton's direction

Newton's direction comes out of the second order Taylor series approximation to f(x_k + p) f(x_k + p) \approx f_k + p^T\,\nabla\,f_k + \frac{1}{2!}\,p^T\,\nabla^2f_k\,p Define this as \mathbf{m_k(p)}

We find the Newton direction by selecting the vector p that minimizes f(x_k + p)

This ends up being p^N_k = -\frac{\nabla f_k}{\nabla^2 f_k}

74 / 103

Newton's method

75 / 103

Newton's method

This approximation to the function we are trying to solve has error of O(||p||^3),
so if p is small, the quadratic approximation is very accurate

75 / 103

Newton's method

This approximation to the function we are trying to solve has error of O(||p||^3),
so if p is small, the quadratic approximation is very accurate

Drawback: requires explicit computation of the Hessian, \nabla^2 \, f(x)

  • Quasi-Newton solvers also exist (e.g. BFGS, L-BFGS, etc)
75 / 103

Trust region methods

Trust region methods construct an approximating model, m_k
whose behavior near the current iterate x_k is close to that of the actual function f

76 / 103

Trust region methods

Trust region methods construct an approximating model, m_k
whose behavior near the current iterate x_k is close to that of the actual function f

We then search for a minimizer of m_k

76 / 103

Trust region methods

Trust region methods construct an approximating model, m_k
whose behavior near the current iterate x_k is close to that of the actual function f

We then search for a minimizer of m_k

Issue: m_k may not represent f well when far away from the current iterate x_k

76 / 103

Trust region methods

Trust region methods construct an approximating model, m_k
whose behavior near the current iterate x_k is close to that of the actual function f

We then search for a minimizer of m_k

Issue: m_k may not represent f well when far away from the current iterate x_k

Solution: Restrict the search for a minimizer to be within some region of x_k, called a trust region

76 / 103

Trust region methods

Trust region problems can be formulated as \min_p m_k(x_k + p) where

  • x_k+p \in \Gamma
  • \Gamma is a ball defined by ||p||_2 \leq \Delta
  • \Delta is called the trust region radius
77 / 103

Trust region Methods

Typically the approximating model m_k is a quadratic function (i.e. a second-order Taylor approximation) m_k(x_k + p) = f_k + p^T\,\nabla\,f_k + \frac{1}{2!}\,p^T\,B_k\,p where B_k is the Hessian or an approximation to the Hessian

78 / 103

Line search vs trust region

Whats the fundamental difference between line search and trust region?

79 / 103

Line search vs trust region

Whats the fundamental difference between line search and trust region?

Line search first picks a direction then searches along that direction for the optimal step length

Trust region first defines our step length via the trust region radius, then searches for the optimal direction

79 / 103

Line search vs trust region

There is a special case of the trust region where if we set B_k, the approximate Hessian, to zero, the solution to the problem is p_k = -\frac{\Delta_k \,\nabla \,f_k}{||\nabla\,f_k||}

This is just the steepest descent solution for the line search problem

80 / 103

Problem scaling

The scaling of a problem matters for optimization performance

81 / 103

Problem scaling

The scaling of a problem matters for optimization performance

A problem is poorly scaled if changes to x in a certain direction
produce much bigger changes in f than changes to in x in another direction

81 / 103

Problem scaling

The scaling of a problem matters for optimization performance

A problem is poorly scaled if changes to x in a certain direction
produce much bigger changes in f than changes to in x in another direction

81 / 103

Problem scaling

Ex: f(x) = 10^9 x_1^2 + x_2^2 is poorly scaled

82 / 103

Problem scaling

Ex: f(x) = 10^9 x_1^2 + x_2^2 is poorly scaled

This happens when things change at different rates:

  • Investment rates are between 0 and 1
  • Consumption can be in trillions of dollars
82 / 103

Problem scaling

Ex: f(x) = 10^9 x_1^2 + x_2^2 is poorly scaled

This happens when things change at different rates:

  • Investment rates are between 0 and 1
  • Consumption can be in trillions of dollars

How do we solve this issue?

82 / 103

Problem scaling

Ex: f(x) = 10^9 x_1^2 + x_2^2 is poorly scaled

This happens when things change at different rates:

  • Investment rates are between 0 and 1
  • Consumption can be in trillions of dollars

How do we solve this issue?

Rescale the problem: put them in units that are
generally within an order of magnitude of 1

  • Investment rate in percentage terms: 0\%-100\%
  • Consumption in units of trillion dollars instead of dollars
82 / 103

Constrained optimization

How do we solve constrained optimization problems?

83 / 103

Constrained optimization

How do we solve constrained optimization problems?

Typically as a variant of unconstrained optimization techniques

83 / 103

Constrained optimization

How do we solve constrained optimization problems?

Typically as a variant of unconstrained optimization techniques

We will discuss three types of constrained optimization algorithms

  • Penalty methods
  • Active set methods
  • Interior point methods
83 / 103

Constrained optimization

These are the algorithms in workhorse commercial solvers: KNITRO

84 / 103

Constrained optimization

These are the algorithms in workhorse commercial solvers: KNITRO

85 / 103

Constrained optimization

These are the algorithms in workhorse commercial solvers: fmincon/MATLAB

86 / 103

Constrained optimization: Penalty methods

Suppose we wish to minimize some function subject to equality constraints (easily generalizes to inequality) \min_x f(x) \,\,\, \text{subject to:} \,\, c_i(x) = 0

87 / 103

Constrained optimization: Penalty methods

Suppose we wish to minimize some function subject to equality constraints (easily generalizes to inequality) \min_x f(x) \,\,\, \text{subject to:} \,\, c_i(x) = 0

How does an algorithm know to not violate the constraint?

87 / 103

Constrained optimization: Penalty methods

Suppose we wish to minimize some function subject to equality constraints (easily generalizes to inequality) \min_x f(x) \,\,\, \text{subject to:} \,\, c_i(x) = 0

How does an algorithm know to not violate the constraint?

One way is to introduce a penalty function into our objective and remove the constraint: Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)

where \mu is the penalty parameter

87 / 103

Constrained optimization: Penalty methods

Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)

88 / 103

Constrained optimization: Penalty methods

Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)

The second term increases the value of the function, bigger \mu \rightarrow bigger penalty from violating the constraint

88 / 103

Constrained optimization: Penalty methods

Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)

The second term increases the value of the function, bigger \mu \rightarrow bigger penalty from violating the constraint

The penalty terms are smooth \rightarrow use unconstrained optimization techniques
to solve the problem by searching for iterates of x_k

88 / 103

Constrained optimization: Penalty methods

Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)

The second term increases the value of the function, bigger \mu \rightarrow bigger penalty from violating the constraint

The penalty terms are smooth \rightarrow use unconstrained optimization techniques
to solve the problem by searching for iterates of x_k

Also generally iterate on sequences of \mu_k \rightarrow \infty as k \rightarrow \infty, to require satisfying the constraints as we close in

88 / 103

Constrained optimization: Penalty methods

Q(x;\mu) = f(x) + \frac{\mu}{2} \sum_i c_i^2(x)

The second term increases the value of the function, bigger \mu \rightarrow bigger penalty from violating the constraint

The penalty terms are smooth \rightarrow use unconstrained optimization techniques
to solve the problem by searching for iterates of x_k

Also generally iterate on sequences of \mu_k \rightarrow \infty as k \rightarrow \infty, to require satisfying the constraints as we close in

There are also augmented Lagrangian methods that take the quadratic penalty method and add in explicit estimates of Lagrange multipliers to help force binding constraints to bind precisely

88 / 103

Constrained optimization: Penalty method example

Example: \min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0

89 / 103

Constrained optimization: Penalty method example

Example: \min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0

Solution is pretty easy to show to be (-1, -1)

89 / 103

Constrained optimization: Penalty method example

Example: \min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0

Solution is pretty easy to show to be (-1, -1)

The penalty method function Q(x_1, x_2; \mu) is Q(x_1, x_2; \mu) = x_1 + x_2 + \frac{\mu}{2} (x_1^2 + x_2^2 - 2)^2

89 / 103

Constrained optimization: Penalty method example

Example: \min x_1 + x_2 \,\,\,\,\,\text{ subject to: } \,\,\, x_1^2 + x_2^2 - 2 = 0

Solution is pretty easy to show to be (-1, -1)

The penalty method function Q(x_1, x_2; \mu) is Q(x_1, x_2; \mu) = x_1 + x_2 + \frac{\mu}{2} (x_1^2 + x_2^2 - 2)^2

Let's ramp up \mu and see what happens to how the function looks

89 / 103

Constrained optimization: Penalty method example

\mu = 1, solution is around (-1.1, -1.1)

90 / 103

Constrained optimization: Penalty method example

\mu = 10, solution is very close to (-1, -1), can easily see trough, and rapid value increase outside x_1^2 + x_2^2 = 2

91 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

92 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

Main idea:

  1. Replace the large non-linear constrained problem with a constrained quadratic programming problem
  2. Use Newton's method to solve the sequence of simpler quadratic problems
92 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

Main idea:

  1. Replace the large non-linear constrained problem with a constrained quadratic programming problem
  2. Use Newton's method to solve the sequence of simpler quadratic problems

Tthe Lagrangian is L(x,\lambda) = f(x) - \lambda^T\,c(x)

92 / 103

Constrained optimization: Active set methods

Active set methods encapsulate sequential quadratic programming (SQP) methods

Main idea:

  1. Replace the large non-linear constrained problem with a constrained quadratic programming problem
  2. Use Newton's method to solve the sequence of simpler quadratic problems

Tthe Lagrangian is L(x,\lambda) = f(x) - \lambda^T\,c(x)

Denote A(x)^T as the Jacobian of the constraints A(x)^T = [\nabla\,c_1(x),...,\nabla\,c_m(x)]

92 / 103

Constrained optimization: Active set methods

The first-order conditions F(x,\lambda) can be written as, \begin{gather} \nabla\,f(x) - A(x)^T\,\lambda = 0 \notag\\ c(x) = 0 \notag \end{gather}

Any solution to the equality constrained problem, where A(x^*) has full rank also satisfies the first-order necessary conditions

93 / 103

Constrained optimization: Active set methods

The first-order conditions F(x,\lambda) can be written as, \begin{gather} \nabla\,f(x) - A(x)^T\,\lambda = 0 \notag\\ c(x) = 0 \notag \end{gather}

Any solution to the equality constrained problem, where A(x^*) has full rank also satisfies the first-order necessary conditions

Active set methods use Newton's method to find the solution (x^*, \lambda^*) of F(x,\lambda)

93 / 103

Constrained optimization: Active set methods

Issue: if we have many constraints, keeping track of all of them can be expensive

94 / 103

Constrained optimization: Active set methods

Issue: if we have many constraints, keeping track of all of them can be expensive

Main idea: recognize that if an inequality constraint is not binding, or active, then it has no influence on the solution

\rightarrow in the iteration procedure we can effectively ignore it

94 / 103

Constrained optimization: Active set methods

Issue: if we have many constraints, keeping track of all of them can be expensive

Main idea: recognize that if an inequality constraint is not binding, or active, then it has no influence on the solution

\rightarrow in the iteration procedure we can effectively ignore it

Active set methods find ways to reduce the complexity of the optimization routine
by selectively ignoring constraints that are not active (i.e. non-positive Lagrange multipliers) or close to being active

94 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

The name interior point comes from the algorithm traversing the domain along the interior of the inequality constraints

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

The name interior point comes from the algorithm traversing the domain along the interior of the inequality constraints

Issue: how do we ensure we are on the interior of the feasible set?

95 / 103

Constrained optimization: Interior point methods

Interior point methods are also called barrier methods

These are typically used for inequality constrained problems

The name interior point comes from the algorithm traversing the domain along the interior of the inequality constraints

Issue: how do we ensure we are on the interior of the feasible set?

Main idea: impose a barrier to stop the solver from letting a constraint bind

95 / 103

Constrained optimization: Interior point methods

Consider the following constrained optimization problem \begin{gather} \min_{x} f(x) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) \geq 0 \end{gather}

96 / 103

Constrained optimization: Interior point methods

Consider the following constrained optimization problem \begin{gather} \min_{x} f(x) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) \geq 0 \end{gather}

Reformulate this problem as \begin{gather} \min_{x,s} f(x) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0, s \geq 0 \end{gather}

where s is a vector of slack variables for the constraints

96 / 103

Constrained optimization: Interior point methods

Final step: introduce a barrier function to eliminate the inequality constraint, \begin{gather} \min_{x,s} f(x) - \mu \sum_{i=1}^m log(s_i) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0 \end{gather}

where \mu is a positive barrier parameter

97 / 103

Constrained optimization: Interior point methods

Final step: introduce a barrier function to eliminate the inequality constraint, \begin{gather} \min_{x,s} f(x) - \mu \sum_{i=1}^m log(s_i) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0 \end{gather}

where \mu is a positive barrier parameter

The barrier function prevents the components of s from approaching
zero by imposing a logarithmic barrier \rightarrow it maintains slack in the constraints

97 / 103

Constrained optimization: Interior point methods

Final step: introduce a barrier function to eliminate the inequality constraint, \begin{gather} \min_{x,s} f(x) - \mu \sum_{i=1}^m log(s_i) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0 \end{gather}

where \mu is a positive barrier parameter

The barrier function prevents the components of s from approaching
zero by imposing a logarithmic barrier \rightarrow it maintains slack in the constraints

Interior point methods solve a sequence of barrier problems until \{\mu_k\} converges to zero

97 / 103

Constrained optimization: Interior point methods

Final step: introduce a barrier function to eliminate the inequality constraint, \begin{gather} \min_{x,s} f(x) - \mu \sum_{i=1}^m log(s_i) \notag\\ \text{subject to: } c_E(x) = 0, c_I(x) - s = 0 \end{gather}

where \mu is a positive barrier parameter

The barrier function prevents the components of s from approaching
zero by imposing a logarithmic barrier \rightarrow it maintains slack in the constraints

Interior point methods solve a sequence of barrier problems until \{\mu_k\} converges to zero

The solution to the barrier problem converges to that of the original problem

97 / 103

Best practices for optimization

Plug in your guess, let the solver go, and you're done right?

98 / 103

Best practices for optimization

Plug in your guess, let the solver go, and you're done right?

WRONG

98 / 103

Best practices for optimization

Plug in your guess, let the solver go, and you're done right?

WRONG

These algorithms are not guaranteed to always find even a local solution, you need to test and make sure you are converging correctly

98 / 103

Check exitflags: KNITRO-specific numbers here

Exitflags tell you why the solver stopped, exit flags of 0 or -10X are generally good, anything else is bad

-10X can indicate bad scaling, ill-conditioning, etc

99 / 103

Try alternative algorithms

Optimization is approximately 53% art

100 / 103

Try alternative algorithms

Optimization is approximately 53% art

Not all algorithms are suited for every problem \rightarrow it is useful to check how different algorithms perform

100 / 103

Try alternative algorithms

Optimization is approximately 53% art

Not all algorithms are suited for every problem \rightarrow it is useful to check how different algorithms perform

Interior-point is usually the default in constrained optimization solvers (low memory usage, fast), but try other algorithms and see if the solution generally remains the same

100 / 103

Be aware of tolerances

Two main tolerances in optimization:

  1. ftol is the tolerance for the change in the function value (absolute and relative)
  2. xtol is the tolerance for the change in the input values (absolute and relative)
101 / 103

Be aware of tolerances

Two main tolerances in optimization:

  1. ftol is the tolerance for the change in the function value (absolute and relative)
  2. xtol is the tolerance for the change in the input values (absolute and relative)

What is a suitable tolerance?

101 / 103

Be aware of tolerances

Two main tolerances in optimization:

  1. ftol is the tolerance for the change in the function value (absolute and relative)
  2. xtol is the tolerance for the change in the input values (absolute and relative)

What is a suitable tolerance?

It depends

101 / 103

Be aware of tolerances

Two main tolerances in optimization:

  1. ftol is the tolerance for the change in the function value (absolute and relative)
  2. xtol is the tolerance for the change in the input values (absolute and relative)

What is a suitable tolerance?

It depends

Explore sensitivity to tolerance, typically pick a conservative (small) number

  • Defaults in solvers are usually 1e-6
101 / 103

Be aware of tolerances

May be a substantial tradeoff between accuracy of your solution and speed

102 / 103

Be aware of tolerances

May be a substantial tradeoff between accuracy of your solution and speed

Common bad practice is to pick a larger tolerance (e.g. 1e-3) so the problem "works" (e.g. so your big MLE converges)

102 / 103

Be aware of tolerances

May be a substantial tradeoff between accuracy of your solution and speed

Common bad practice is to pick a larger tolerance (e.g. 1e-3) so the problem "works" (e.g. so your big MLE converges)

Issue is that 1e-3 might be pretty big for your problem
if you haven't checked that your solution is not sensitive to the tolerance

102 / 103

Perturb your initial guesses

Initial guesses matter

103 / 103

Perturb your initial guesses

Initial guesses matter

Good ones can improve performance

  • e.g. initial guess for next iteration of coefficient estimates should be current iteration estimates
103 / 103

Perturb your initial guesses

Initial guesses matter

Good ones can improve performance

  • e.g. initial guess for next iteration of coefficient estimates should be current iteration estimates

Bad ones can give you terrible performance, or wrong answers if your problem isn't perfect

  • e.g. bad scaling, not well-conditioned, multiple equilibria
103 / 103

Software and stuff

Necessary things to do:

  • Install the QuantEcon Julia package
  • Install the Optim or NLOpt Julia packages
2 / 103
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow