A brief yet long-winded discussion of Lagrangian-relaxation algorithms.
Lagrangian relaxation
Lagrangian relaxation is a sophisticated generalization of the greedy approach: to approximately solve a given “hard” optimization problem, one solves a sequence of carefully chosen “easier” problems, then aggregates the solutions of the easier problems, typically by taking a weighted combination of them.
The easy problems are obtained from the original problem by relaxing some of the more difficult constraints, replacing them by a single surrogate constraint. Typically, the surrogate constraint is a weighted combination of the constraints it replaces, and the weights come from the gradient of a smooth potential function that somehow proxies the replaced constraints. To prove the algorithm’s quality of approximation, one proves some invariant involving this potential function.
Historically, Lagrangian-relaxation algorithms have been used (i) for problem formulations involving huge numbers of variables (as a practical alternative to ellipsoid methods — similarly to column generation; the Held-Karp lower bound for the traveling salesman problem is one famous example), and (ii) for problems that are composed of many subproblems that are loosely coupled by side constraints (multicommodity flow is a canonical example). The last few decades have seen improved bounds on convergence rates — that is, bounds on the number of subproblem solutions needed to approximate the original problem to the desired extent. Similarly to greedy algorithms, the form that these algorithms take makes them useful in many contexts (for example, distributed, parallel, and online settings).
The basic observation — that a “hard” problem can be approximately solved by a weighted combination of “easy” problems — can also be used to prove some nice combinatorial theorems. This is roughly because, if the easy-problem solutions have some nice combinatorial property, the property may carry over to their weighted combination, and thus to some approximate solution to the hard problem.
The basic observation is also directly related to the learning-theory concept of boosting (the problem of strongly learning a concept reduces to the problem of weakly learning it). Relatedly, some Lagrangian-relaxation algorithms can be viewed as learning-theory algorithms for following expert advice (such as multiplicative-weights-update algorithms).
Replacing many constraints by a single surrogate
A typical optimization problem is of the form find a vector x\in {\mathbb R}^ n meeting some constraints, possibly while maximizing or minimizing some objective function.
We will write the constraints in the form x\in {\cal P},\, Ax \le \mathbf1. That is, we partition the constraints into two sets: the first set we write abstractly as “x\in {\cal P}”, where {\cal P} can be an arbitrary set, but is typically some polytope (these are the constraints we want to leave “in place”); the second set “Ax \le \mathbf1” we call side constraints. (Technically, we require that Ax\ge \mathbf0 for all x\in {\cal P}, that is, the side constraints are packing or covering constraints.)
The algorithm will replace the side constraints by a single “surrogate” constraint. Somehow, the algorithm will compute a coefficient vector \alpha \in {\mathbb R}_+^ m, such that \sum _ i \alpha _ i = 1. The surrogate constraint will be the single linear constraint that is their \alpha -weighted average:
(Here A_ i is the ith row of A, so the ith side constraint is A_ i\cdot x \le 1.)
The resulting surrogate problem (for this \alpha ) is find x\in {\cal P} such that the single linear constraint \eqref{eq:1} holds (while maximizing or minimizing the objective function, if present). Clearly, the surrogate problem is a relaxation of the original problem. That is, any feasible solution to the original problem is also feasible for the surrogate problem.
The Lagrangian-relaxation algorithms that we discuss here all work as follows. In each iteration, the algorithm computes a particular vector \alpha \in {\mathbb R}_+^ m of constraint multipliers, then solves the surrogate problem for that particular \alpha . Finally, it returns a weighted average \overline x of the solutions to the surrogate problems.
A main point is that the surrogate problem is easier to solve than the original problem. As a simple example, suppose we start with the packing problem \max \{ |x|_1 \, :\, Ax \le \mathbf1, x\in {\mathbb R}_+^ n\} , where the polytope {\cal P} is simply the positive quadrant. Then the surrogate problem for a given \alpha \in {\mathbb R}_+^ m will be of the form
where \beta = \alpha ^{\scriptscriptstyle \sf T}A. The surrogate problem has just one constraint, so is solved by taking just one index j and taking that x_ j as large as possible (x_ j = 1/\beta _ j). The index j should be chosen to maximize the resulting cost, 1/\beta _ j.
The surrogate problems are feasible, and they bound OPT
Each surrogate problem is a relaxation of the original problem (including its objective), so, for each surrogate problem, the solution to the surrogate problem will meet the constraint x\in {\cal P} and will have cost at most (or, for maximization, at least) the optimal cost of the original problem. Hence the weighted average \overline x will also be in {\cal P} and will have cost at most (or at least) the optimal cost of the original problem. If the algorithm chooses the surrogate problems well, then the weighted average \overline x should also (approximately) meet all side constraints.
Each surrogate problem doesn’t have to be solved exactly. If each surrogate problem’s solution is, say, a \lambda -approximate solution for its problem, it is also a \lambda -approximate solution for the original problem (minus the side constraints). The weighted combination \overline x of such solutions will necessarily also be a \lambda -approximate solution to the original problem (minus the side constraints).
Modify the correctness proof in the multicommodity flow example to show that, if in each iteration the algorithm chooses a path p’ such that \sum _{e\in p’} \alpha _ e/c_ e \le \lambda \, \min _{p\in P} \sum _{e\in p}\alpha _ e/c_ e, then the final flow f will have value at least (1-O(\varepsilon ))|f^*|/\lambda .
Choosing the constraint multipliers
Typically, the algorithm will maintain a current solution x of some kind, and will be guided by a real-valued potential function \phi (x) that somehow smoothly proxies all of the side constraints. (For example, for the side constraints Ax\le \mathbf1, one might use the potential function \Phi (x) = \ln \sum _{i} \exp (A_ i x).) The constraint multiplier vector \alpha will typically be the gradient \alpha = \nabla \Phi (x) of the potential function with respect to x.
The algorithms are not typically pure gradient-descent algorithms, though. For example, in each iteration, the algorithm might choose just one coordinate x_ j of the current solution x, and then increase only that coordinate. It might choose the x_ j so as to maximize c_ j/(\delta \Phi /\delta x_ j) — the rate of increase in the cost divided by the rate of increase in the potential function.
Potential-function invariant for the side constraints
In this way, the algorithm will maintain some potential-function invariant on the current solution x. For example, the invariant might be \Phi (x) \le (c\cdot x)/{\textsc{opt}}+ \ln m. At termination (once c\cdot x is large enough) the invariant will imply that the side constraints are approximately met.
Step size and trust regions
The potential functions, which are typically not linear, are generally smooth. That is, at any point \tilde x, they are well-approximated by their first-order approximation: for any “small” d,
With each surrogate subproblem, the algorithm will typically take the current overall solution \tilde x, and then move it in the direction of the solution to the surrogate subproblem: \tilde x \leftarrow \tilde x + \lambda x, where x is the solution to the surrogate subproblem.
The algorithm will typically choose the step size \lambda so that the increase in the potential function is (1+O(\varepsilon ))-approximated by its first-order approximation \lambda \nabla \Phi (\tilde x) \cdot \tilde x. That is, it will move \tilde x to a new point that is within the trust region around the current point.
This will suffice to maintain the desired invariant. (A tolerance for this (1+O(\varepsilon ))-approximation of the change will be built into the potential function or the invariant.)
Convergence
There is a large literature on Lagrangian-relaxation algorithms in Computer Science. Some of the more well-known papers are by Plotkin, Shmoys, and Tardos (e.g. [12]), and by Grigoriadis and Khachiyan (e.g. [6]).
With current algorithms (e.g. [4]), the number of iterations needed in the worst case to converge to a (1\pm \varepsilon )-approximate solution is typically O(\min (m,\omega )\log (m)/\epsilon ^2), where m is the number of constraints, and \omega is the so-called width of the problem (here, the maximum value of A_ i\cdot x for any x\in {\cal P} and matrix row A_ i).
The time to solve the subproblem in each iteration depends on the subproblem. For explicitly given pure packing or pure covering problems (where {\cal P} is the just \{ x~ |~ x\ge 0\} ), the time can (roughly) be reduced to O(1) per iteration [9].
Surrogate duality
The surrogate relaxation described above is a form of partial linear-programming duality: for any constraint-multiplier vector \alpha , the optimal solution to the surrogate relaxation of the problem gives a bound on opt. This is a kind of duality called surrogate duality.
It often translates directly into standard LP dual solutions. The vector \alpha corresponds to a dual solution (for some reformulation of the original LP). For example, recall the surrogate problem for the simple packing problem \max \{ |x|_1 \, :\, Ax \le \mathbf1, x\in {\mathbb R}_+^ n\} . The surrogate is \max \{ |x|_1 \, :\, \beta \cdot x \ge 1, x\in {\mathbb R}_+^ n\} , where \beta = \alpha ^{\scriptscriptstyle \sf T}A. The surrogate problem is solved by finding the index j that maximizes 1/\beta _ j, taking that x_ j to be 1/\beta _ j, and leaving all other coordinates zero. Since this is the optimal solution for the surrogate problem, which is in turn a relaxation of the original problem, the cost achieved by this x is an upper bound on opt:
(Here A^{\scriptscriptstyle \sf T}_ j is the jth column of A.)
The standard LP dual of the original problem is \min \{ |y|_1 \, :\, A^{\scriptscriptstyle \sf T}y \ge 1, x\in {\mathbb R}_+^ n\} . We leave it as an exercise to show that scaling the constraint-multiplier vector \alpha gives a dual solution y whose cost (by weak duality) gives the same upper bound on opt. (Recall that |\alpha |_1=1.)
Problems having many variables
Lagrangian-relaxation algorithms can be used for problems with many variables, as long as the surrogate problem can be solved. For example, consider a simple multicommodity-flow problem in feasibility form:
There is a, implicitly, a variable for every path p\in P. There can be exponentially many, but, applying Lagrangian-relaxation with the capacity constraints as side constraints, the surrogate problem that arises is this: for a given \alpha (with |\alpha |_1=1),
The solution (if it exists) will be to send 1 unit of flow along a path p\in P such that \sum _{e\in P} \alpha _ e \le c. That is, \eqref{surrogate} is equivalent to
Such a path can be found quickly, if it exists, using a shortest-path algorithm. In this way, Lagrangian-relaxation can reduce the multicommodity-flow problem to a (poly-length) sequence of shortest-path problems.
Column generation is a practical technique with a similar spirit. The idea is that if a problem formulation has many more variables than constraints, one may run the Simplex algorithm, but only explicitly write down columns (variables) that have been brought into the current solution with non-zero values. To do the pivot step at each iteration requires checking whether a new variable (column) should be brought in. Dantzig-Wolfe decomposition explicitly combines this idea with Lagrangian relaxation. Benders’ decomposition brings in rows as needed, instead of columns.
Relation to dual Ellipsoid method
Recall that the Ellipsoid method uses a “separation oracle” framework: to try find a feasible point in a convex body {\cal B}, the Ellipsoid method supplies a candidate point \tilde\alpha to the separation oracle for {\cal B}. The oracle must either determine that \tilde\alpha is in {\cal B} (and we are done), or must return a linear constraint that is valid for {\cal B} but not for \tilde\alpha (a proof that \tilde\alpha is not in {\cal B}), in which case the algorithm continues. Crucially for many applications, the convex body {\cal B} may be defined (implicitly) as the intersection of exponentially many constraints. As long as the oracle can find a constraint violated by a given \tilde\alpha (if there is one) in polynomial time, the Ellipsoid method will finish in polynomial time.
Consider applying the separation-oracle framework to the dual of the problem \eqref{primal} above. One dual formulation (after appropriate scaling) is
(Note that the last inequality is strict.) This dual is feasible iff the primal is infeasible.
Take the polytope {\cal B} to be the set of \alpha ’s meeting the constraints in \eqref{dual}, then apply Ellipsoid. For this polytope, Ellipsoid will give the separation oracle a candidate \alpha . To determine whether \alpha is feasible for {\cal B}, the separation oracle can check directly that |\alpha |_1=1. Determining whether the remaining constraints are feasible is equivalent to solving the surrogate problem \eqref{surrogate2}.
Generally, whenever one can apply Ellipsoid to solve a given packing or covering problem using a particular separation oracle, then it is also possible to solve the problem by applying Lagrangian relaxation to the dual: the surrogate problem that arises will be the one solved by the separation oracle. The converse is also true.
Machine learning
Some types of Lagrangian-relaxation algorithms are closely related to multiplicative-weights update algorithms from machine learning, which play a role in boosting and following expert advice with low regret. Boosting reduces the problem of strongly learning a concept to a sequence of problems of weakly learning the concept against some distribution \alpha . Technically, this reduction corresponds to the reduction done by Lagrangian-relaxation algorithms, reducing a hard problem to a sequence of surrogate problem. See [3] for more details.
Lagrangian-relaxation algorithms can also be interpreted as fictitious-play algorithms for two-player zero-sum games with non-negative payoffs.
Differential equations and Lyapunov functions
Lagrangian-relaxation algorithms may be thought of as gradient-descent algorithms, at least in that each step is chosen to augment the current solution in a direction parallel to some coordinate axis, and along which the potential-function change is suitably bounded.
Making the step size \varepsilon infinitesmal gives a continuous-time algorithm. The resulting system avoids the discrete step, so is easier to study than a discrete algorithm: it can be modeled as a system of differential equations, where the potential function of the algorithm becomes a Lyapunov function for the system. Such systems are deeply studied in control theory [2], and are used to analyze congestion-control mechanisms [11, 7, 5].
Oblivious randomized rounding
Many of the notes here explore the following connection between greedy/Lagrangian-relaxation algorithms and randomized rounding: Lagrangian-relaxation algorithms can be understood as randomized-rounding algorithms, and can be systematically derived using the method of conditional expectations. In this view, the algorithms and their underlying potential functions are derived using standard probabilistic tools. The potential functions are “given meaning” as conditional expectations or pessimistic estimators.
Related
Lagrangian-relaxation algorithms have a long history. They were analyzed as early as the 1950s (e.g., by von Neumann [10]). Their main historical use has been to reduce a hard problem (such as multicommodity flow) to a sequence of simpler problems (such as shortest paths). For a brief summary of their history within computer science, see [8]. See also the technical survey [1]. For an operations-research perspective, see [13].
On Wikipedia
-
Column generation, Dantzig-Wolfe decomposition, Benders’ decomposition
-
the Winnow algorithm and AdaBoost (multiplicative-weights algorithms in machine learning)
-
Lyapunov function, Congestion control, control theory (Caltech)