A brief yet long-winded discussion of Lagrangian-relaxation algorithms.

Click for background material…

Lagrangian relaxation

Lagrangian relaxation is a sophisticated generalization of the greedy approach: to approximately solve a given “hard” optimization problem, one solves a sequence of carefully chosen “easier” problems, then aggregates the solutions of the easier problems, typically by taking a weighted combination of them.

The easy problems are obtained from the original problem by relaxing some of the more difficult constraints, replacing them by a single surrogate constraint. Typically, the surrogate constraint is a weighted combination of the constraints it replaces, and the weights come from the gradient of a smooth potential function that somehow proxies the replaced constraints. To prove the algorithm’s quality of approximation, one proves some invariant involving this potential function.

Historically, Lagrangian-relaxation algorithms have been used (i) for problem formulations involving huge numbers of variables (as a practical alternative to ellipsoid methods — similarly to column generation; the Held-Karp lower bound for the traveling salesman problem is one famous example), and (ii) for problems that are composed of many subproblems that are loosely coupled by side constraints (multicommodity flow is a canonical example). The last few decades have seen improved bounds on convergence rates — that is, bounds on the number of subproblem solutions needed to approximate the original problem to the desired extent. Similarly to greedy algorithms, the form that these algorithms take makes them useful in many contexts (for example, distributed, parallel, and online settings).

The basic observation — that a “hard” problem can be approximately solved by a weighted combination of “easy” problems — can also be used to prove some nice combinatorial theorems. This is roughly because, if the easy-problem solutions have some nice combinatorial property, the property may carry over to their weighted combination, and thus to some approximate solution to the hard problem.

The basic observation is also directly related to the learning-theory concept of boosting (the problem of strongly learning a concept reduces to the problem of weakly learning it). Relatedly, some Lagrangian-relaxation algorithms can be viewed as learning-theory algorithms for following expert advice (such as multiplicative-weights-update algorithms).

Replacing many constraints by a single surrogate

A typical optimization problem is of the form find a vector $x\in {\mathbb R}^ n$ meeting some constraints, possibly while maximizing or minimizing some objective function.

We will write the constraints in the form $x\in {\cal P},\, Ax \le \mathbf1$ . That is, we partition the constraints into two sets: the first set we write abstractly as “ $x\in {\cal P}$ ”, where ${\cal P}$ can be an arbitrary set, but is typically some polytope (these are the constraints we want to leave “in place”); the second set “ $Ax \le \mathbf1$ ” we call side constraints. (Technically, we require that $Ax\ge \mathbf0$ for all $x\in {\cal P}$ , that is, the side constraints are packing or covering constraints.)

The algorithm will replace the side constraints by a single “surrogate” constraint. Somehow, the algorithm will compute a coefficient vector $\alpha \in {\mathbb R}_+^ m$ , such that $\sum _ i \alpha _ i = 1$ . The surrogate constraint will be the single linear constraint that is their $\alpha$ -weighted average:

\begin{equation} ({\textstyle \sum _ i \alpha _ i A_ i}) \cdot x \le 1\label{eq:1}, ~ \mbox{ in other words, }~ (\alpha ^{\scriptscriptstyle \sf T}A) x \le 1. \end{equation}

(Here $A_ i$ is the $i$ th row of $A$ , so the $i$ th side constraint is $A_ i\cdot x \le 1$ .)

The resulting surrogate problem (for this $\alpha$ ) is find $x\in {\cal P}$ such that the single linear constraint \eqref{eq:1} holds (while maximizing or minimizing the objective function, if present). Clearly, the surrogate problem is a relaxation of the original problem. That is, any feasible solution to the original problem is also feasible for the surrogate problem.

The algorithm takes a weighted average of surrogate solutions.

The Lagrangian-relaxation algorithms that we discuss here all work as follows. In each iteration, the algorithm computes a particular vector $\alpha \in {\mathbb R}_+^ m$ of constraint multipliers, then solves the surrogate problem for that particular $\alpha$ . Finally, it returns a weighted average $\overline x$ of the solutions to the surrogate problems.

Example.

A main point is that the surrogate problem is easier to solve than the original problem. As a simple example, suppose we start with the packing problem $\max \{ |x|_1 \, :\, Ax \le \mathbf1, x\in {\mathbb R}_+^ n\}$ , where the polytope ${\cal P}$ is simply the positive quadrant. Then the surrogate problem for a given $\alpha \in {\mathbb R}_+^ m$ will be of the form

\[ \max \{ |x|_1 \, :\, \beta \cdot x \le 1,\, x\in {\mathbb R}_+^ n\} , \]

where $\beta = \alpha ^{\scriptscriptstyle \sf T}A$ . The surrogate problem has just one constraint, so is solved by taking just one index $j$ and taking that $x_ j$ as large as possible ( $x_ j = 1/\beta _ j$ ). The index $j$ should be chosen to maximize the resulting cost, $1/\beta _ j$ .

The surrogate problems are feasible, and they bound OPT

Each surrogate problem is a relaxation of the original problem (including its objective), so, for each surrogate problem, the solution to the surrogate problem will meet the constraint $x\in {\cal P}$ and will have cost at most (or, for maximization, at least) the optimal cost of the original problem. Hence the weighted average $\overline x$ will also be in ${\cal P}$ and will have cost at most (or at least) the optimal cost of the original problem. If the algorithm chooses the surrogate problems well, then the weighted average $\overline x$ should also (approximately) meet all side constraints.

Solving the surrogate problems approximately.

Each surrogate problem doesn’t have to be solved exactly. If each surrogate problem’s solution is, say, a $\lambda$ -approximate solution for its problem, it is also a $\lambda$ -approximate solution for the original problem (minus the side constraints). The weighted combination $\overline x$ of such solutions will necessarily also be a $\lambda$ -approximate solution to the original problem (minus the side constraints).

Exercise (using an approximate subroutine).

Modify the correctness proof in the multicommodity flow example to show that, if in each iteration the algorithm chooses a path $p’$ such that $\sum _{e\in p'} \alpha _ e/c_ e \le \lambda \, \min _{p\in P} \sum _{e\in p}\alpha _ e/c_ e$ , then the final flow $f$ will have value at least $(1-O(\varepsilon ))|f^*|/\lambda$ .

Choosing the constraint multipliers

Typically, the algorithm will maintain a current solution $x$ of some kind, and will be guided by a real-valued potential function $\phi (x)$ that somehow smoothly proxies all of the side constraints. (For example, for the side constraints $Ax\le \mathbf1$ , one might use the potential function $\Phi (x) = \ln \sum _{i} \exp (A_ i x)$ .) The constraint multiplier vector $\alpha$ will typically be the gradient $\alpha = \nabla \Phi (x)$ of the potential function with respect to $x$ .

The algorithms are not typically pure gradient-descent algorithms, though. For example, in each iteration, the algorithm might choose just one coordinate $x_ j$ of the current solution $x$ , and then increase only that coordinate. It might choose the $x_ j$ so as to maximize $c_ j/(\delta \Phi /\delta x_ j)$ — the rate of increase in the cost divided by the rate of increase in the potential function.

Potential-function invariant for the side constraints

In this way, the algorithm will maintain some potential-function invariant on the current solution $x$ . For example, the invariant might be $\Phi (x) \le (c\cdot x)/{\textsc{opt}}+ \ln m$ . At termination (once $c\cdot x$ is large enough) the invariant will imply that the side constraints are approximately met.

Step size and trust regions

The potential functions, which are typically not linear, are generally smooth. That is, at any point $\tilde x$ , they are well-approximated by their first-order approximation: for any “small” $d$ ,

\[ \Phi (\tilde x+d) \approx \Phi (\tilde x) + \nabla \Phi (\tilde x) \cdot d. \]

With each surrogate subproblem, the algorithm will typically take the current overall solution $\tilde x$ , and then move it in the direction of the solution to the surrogate subproblem: $\tilde x \leftarrow \tilde x + \lambda x$ , where $x$ is the solution to the surrogate subproblem.

The algorithm will typically choose the step size $\lambda$ so that the increase in the potential function is $(1+O(\varepsilon ))$ -approximated by its first-order approximation $\lambda \nabla \Phi (\tilde x) \cdot \tilde x$ . That is, it will move $\tilde x$ to a new point that is within the trust region around the current point.

This will suffice to maintain the desired invariant. (A tolerance for this $(1+O(\varepsilon ))$ -approximation of the change will be built into the potential function or the invariant.)

Convergence

There is a large literature on Lagrangian-relaxation algorithms in Computer Science. Some of the more well-known papers are by Plotkin, Shmoys, and Tardos (e.g. [12]), and by Grigoriadis and Khachiyan (e.g. [6]).

With current algorithms (e.g. ), the number of iterations needed in the worst case to converge to a $(1\pm \varepsilon )$ -approximate solution is typically $O(\min (m,\omega )\log (m)/\epsilon ^2)$ , where $m$ is the number of constraints, and $\omega$ is the so-called width of the problem (here, the maximum value of $A_ i\cdot x$ for any $x\in {\cal P}$ and matrix row $A_ i$ ).

The time to solve the subproblem in each iteration depends on the subproblem. For explicitly given pure packing or pure covering problems (where ${\cal P}$ is the just $\{ x~ |~ x\ge 0\}$ ), the time can (roughly) be reduced to $O(1)$ per iteration [9].

Surrogate duality

The surrogate relaxation described above is a form of partial linear-programming duality: for any constraint-multiplier vector $\alpha$ , the optimal solution to the surrogate relaxation of the problem gives a bound on opt. This is a kind of duality called surrogate duality.

It often translates directly into standard LP dual solutions. The vector $\alpha$ corresponds to a dual solution (for some reformulation of the original LP). For example, recall the surrogate problem for the simple packing problem $\max \{ |x|_1 \, :\, Ax \le \mathbf1, x\in {\mathbb R}_+^ n\}$ . The surrogate is $\max \{ |x|_1 \, :\, \beta \cdot x \ge 1, x\in {\mathbb R}_+^ n\} ,$ where $\beta = \alpha ^{\scriptscriptstyle \sf T}A$ . The surrogate problem is solved by finding the index $j$ that maximizes $1/\beta _ j$ , taking that $x_ j$ to be $1/\beta _ j$ , and leaving all other coordinates zero. Since this is the optimal solution for the surrogate problem, which is in turn a relaxation of the original problem, the cost achieved by this $x$ is an upper bound on opt:

\[ {\textsc{opt}}~ \le ~ 1/\min _ j \beta _ j ~ =~ 1/\min _ j \alpha \cdot A^{\scriptscriptstyle \sf T}_ j. \]

(Here $A^{\scriptscriptstyle \sf T}_ j$ is the $j$ th column of $A$ .)

The standard LP dual of the original problem is $\min \{ |y|_1 \, :\, A^{\scriptscriptstyle \sf T}y \ge 1, x\in {\mathbb R}_+^ n\}$ . We leave it as an exercise to show that scaling the constraint-multiplier vector $\alpha$ gives a dual solution $y$ whose cost (by weak duality) gives the same upper bound on . (Recall that $|\alpha |_1=1$ .)

Problems having many variables

Lagrangian-relaxation algorithms can be used for problems with many variables, as long as the surrogate problem can be solved. For example, consider a simple multicommodity-flow problem in feasibility form:

\begin{equation} \label{primal}\textstyle \mbox{find } f\in {\mathbb R}^ P \text { such that } |f| = 1 \mbox{ and } \forall e\in E.\, f(e) \le \, c. \end{equation}

There is a, implicitly, a variable for every path $p\in P$ . There can be exponentially many, but, applying Lagrangian-relaxation with the capacity constraints as side constraints, the surrogate problem that arises is this: for a given $\alpha$ (with $|\alpha |_1=1$ ),

\begin{equation} \label{surrogate}\textstyle \mbox{find } f\in {\mathbb R}^ P \text { such that } |f| = 1 \mbox{ and } \sum _{e\in E} \alpha _ e f(e)\, \le \, c. \end{equation}

The solution (if it exists) will be to send $1$ unit of flow along a path $p\in P$ such that $\sum _{e\in P} \alpha _ e \le c$ . That is, \eqref{surrogate} is equivalent to

\begin{equation} \label{surrogate2}\textstyle \mbox{find } p\in P \text { such that } \sum _{e\in p} \alpha _ e \, \le \, c. \end{equation}

Such a path can be found quickly, if it exists, using a shortest-path algorithm. In this way, Lagrangian-relaxation can reduce the multicommodity-flow problem to a (poly-length) sequence of shortest-path problems.

Similarity to column generation.

Column generation is a practical technique with a similar spirit. The idea is that if a problem formulation has many more variables than constraints, one may run the Simplex algorithm, but only explicitly write down columns (variables) that have been brought into the current solution with non-zero values. To do the pivot step at each iteration requires checking whether a new variable (column) should be brought in. Dantzig-Wolfe decomposition explicitly combines this idea with Lagrangian relaxation. Benders’ decomposition brings in rows as needed, instead of columns.

Relation to dual Ellipsoid method

Recall that the Ellipsoid method uses a “separation oracle” framework: to try find a feasible point in a convex body ${\cal B}$ , the Ellipsoid method supplies a candidate point $\tilde\alpha$ to the separation oracle for ${\cal B}$ . The oracle must either determine that $\tilde\alpha$ is in ${\cal B}$ (and we are done), or must return a linear constraint that is valid for ${\cal B}$ but not for $\tilde\alpha$ (a proof that $\tilde\alpha$ is not in ${\cal B}$ ), in which case the algorithm continues. Crucially for many applications, the convex body ${\cal B}$ may be defined (implicitly) as the intersection of exponentially many constraints. As long as the oracle can find a constraint violated by a given $\tilde\alpha$ (if there is one) in polynomial time, the Ellipsoid method will finish in polynomial time.

Consider applying the separation-oracle framework to the dual of the problem \eqref{primal} above. One dual formulation (after appropriate scaling) is

\begin{equation} \label{dual}\textstyle \mbox{find } \alpha \in {\mathbb R}^ m \text { such that } |\alpha |_1 = 1 \mbox{ and } \forall p\in P.\, \sum _{e\in p} \alpha _ e \gt c. \end{equation}

(Note that the last inequality is strict.) This dual is feasible iff the primal is infeasible.

Take the polytope ${\cal B}$ to be the set of $\alpha$ ’s meeting the constraints in \eqref{dual}, then apply Ellipsoid. For this polytope, Ellipsoid will give the separation oracle a candidate $\alpha$ . To determine whether $\alpha$ is feasible for ${\cal B}$ , the separation oracle can check directly that $|\alpha |_1=1$ . Determining whether the remaining constraints are feasible is equivalent to solving the surrogate problem \eqref{surrogate2}.

Generally, whenever one can apply Ellipsoid to solve a given packing or covering problem using a particular separation oracle, then it is also possible to solve the problem by applying Lagrangian relaxation to the dual: the surrogate problem that arises will be the one solved by the separation oracle. The converse is also true.

Machine learning

Some types of Lagrangian-relaxation algorithms are closely related to multiplicative-weights update algorithms from machine learning, which play a role in boosting and following expert advice with low regret. Boosting reduces the problem of strongly learning a concept to a sequence of problems of weakly learning the concept against some distribution $\alpha$ . Technically, this reduction corresponds to the reduction done by Lagrangian-relaxation algorithms, reducing a hard problem to a sequence of surrogate problem. See [3] for more details.

Lagrangian-relaxation algorithms can also be interpreted as fictitious-play algorithms for two-player zero-sum games with non-negative payoffs.

Differential equations and Lyapunov functions

Lagrangian-relaxation algorithms may be thought of as gradient-descent algorithms, at least in that each step is chosen to augment the current solution in a direction parallel to some coordinate axis, and along which the potential-function change is suitably bounded.

Making the step size $\varepsilon$ infinitesmal gives a continuous-time algorithm. The resulting system avoids the discrete step, so is easier to study than a discrete algorithm: it can be modeled as a system of differential equations, where the potential function of the algorithm becomes a Lyapunov function for the system. Such systems are deeply studied in control theory [2], and are used to analyze congestion-control mechanisms [11, 7, 5].

Oblivious randomized rounding

Many of the notes here explore the following connection between greedy/Lagrangian-relaxation algorithms and randomized rounding: Lagrangian-relaxation algorithms can be understood as randomized-rounding algorithms, and can be systematically derived using the method of conditional expectations. In this view, the algorithms and their underlying potential functions are derived using standard probabilistic tools. The potential functions are “given meaning” as conditional expectations or pessimistic estimators.

Lagrangian-relaxation algorithms have a long history. They were analyzed as early as the 1950s (e.g., by von Neumann [10]). Their main historical use has been to reduce a hard problem (such as multicommodity flow) to a sequence of simpler problems (such as shortest paths). For a brief summary of their history within computer science, see [8]. See also the technical survey [1]. For an operations-research perspective, see [13].

On Wikipedia

Lagrangian relaxation, Lagrange multiplier, Trust region
Column generation, Dantzig-Wolfe decomposition, Benders’ decomposition
Ellipsoid method
the Winnow algorithm and AdaBoost (multiplicative-weights algorithms in machine learning)
Fictitious play
Lyapunov function, Congestion control, control theory (Caltech)

Bibliography

[1]	S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
[2]	K. ström and R. Murray. Feedback systems: an introduction for scientists and engineers. Princeton University Press, 2008. Online at http://www.cds.caltech.edu/ $\sim$ murray/amwiki.
[3]	Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999.
[4]	N. Garg and J. Könemann. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM Journal on Computing, 37:630, 2007. Preliminary version in FOCS’98 (pp. 300–309).
[5]	N. Garg and N. Young. On-line end-to-end congestion control. In Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE Symposium on, pages 303–310. IEEE, 2002.
[6]	M. D. Grigoriadis, L. G. Khachiyan, L. Porkolab, and J. Villavicencio. Approximate max-min resource sharing for structured concave optimization. SIAM Journal on Optimization, 11:1081–1091, 2001.
[7]	F. Kelly. Mathematical modelling of the internet. Mathematics Unlimited-2001 and Beyond, pages 685–702, 2001.
[8]	P. Klein and N. E. Young. On the number of iterations for Dantzig-Wolfe optimization and packing-covering approximation algorithms. Lecture Notes in Computer Science, 1610:320–327, 1999.
[9]	C. Koufogiannakis and N. E. Young. Beating simplex for fractional packing and covering linear programs. In Forty Eighth Annual Symposium on Foundations of Computer Science, Providence, Rhode Island, 21–23 Oct. 2007. IEEE.
[10]	H. W. Kuhn and A. W. Tucker. Review of ‘A numerical method for determination of the value and the best strategies of a zero-sum two-person game with large numbers of strategies’, by John von Neumann. In John von Neumann: Collected Works, volume VI, pages 96–97. Pergamon Press, 1963.
[11]	S. Low, F. Paganini, and J. Doyle. Internet congestion control. Control Systems Magazine, IEEE, 22(1):28–43, 2002.
[12]	S. A. Plotkin, D. B. Shmoys, and E. Tardos. Fast approximation algorithms for fractional packing and covering problems. Math. Operations Research, 20(2):257–301, 1995. Preliminary version in FOCS’91.
[13]	M. J. Todd. The many facets of linear programming. Mathematical Programming, 91(3):417–436, 2002.

Notes on algorithms

Lecture notes on algorithms

Lagrangian relaxation / discussion