A brief yet longwinded discussion of Lagrangianrelaxation algorithms.
Lagrangian relaxation
Lagrangian relaxation is a sophisticated generalization of the greedy approach: to approximately solve a given “hard” optimization problem, one solves a sequence of carefully chosen “easier” problems, then aggregates the solutions of the easier problems, typically by taking a weighted combination of them.
The easy problems are obtained from the original problem by relaxing some of the more difficult constraints, replacing them by a single surrogate constraint. Typically, the surrogate constraint is a weighted combination of the constraints it replaces, and the weights come from the gradient of a smooth potential function that somehow proxies the replaced constraints. To prove the algorithm’s quality of approximation, one proves some invariant involving this potential function.
Historically, Lagrangianrelaxation algorithms have been used (i) for problem formulations involving huge numbers of variables (as a practical alternative to ellipsoid methods — similarly to column generation; the HeldKarp lower bound for the traveling salesman problem is one famous example), and (ii) for problems that are composed of many subproblems that are loosely coupled by side constraints (multicommodity flow is a canonical example). The last few decades have seen improved bounds on convergence rates — that is, bounds on the number of subproblem solutions needed to approximate the original problem to the desired extent. Similarly to greedy algorithms, the form that these algorithms take makes them useful in many contexts (for example, distributed, parallel, and online settings).
The basic observation — that a “hard” problem can be approximately solved by a weighted combination of “easy” problems — can also be used to prove some nice combinatorial theorems. This is roughly because, if the easyproblem solutions have some nice combinatorial property, the property may carry over to their weighted combination, and thus to some approximate solution to the hard problem.
The basic observation is also directly related to the learningtheory concept of boosting (the problem of strongly learning a concept reduces to the problem of weakly learning it). Relatedly, some Lagrangianrelaxation algorithms can be viewed as learningtheory algorithms for following expert advice (such as multiplicativeweightsupdate algorithms).
Replacing many constraints by a single surrogate
A typical optimization problem is of the form find a vector x\in {\mathbb R}^ n meeting some constraints, possibly while maximizing or minimizing some objective function.
We will write the constraints in the form x\in {\cal P},\, Ax \le \mathbf1. That is, we partition the constraints into two sets: the first set we write abstractly as “x\in {\cal P}”, where {\cal P} can be an arbitrary set, but is typically some polytope (these are the constraints we want to leave “in place”); the second set “Ax \le \mathbf1” we call side constraints. (Technically, we require that Ax\ge \mathbf0 for all x\in {\cal P}, that is, the side constraints are packing or covering constraints.)
The algorithm will replace the side constraints by a single “surrogate” constraint. Somehow, the algorithm will compute a coefficient vector \alpha \in {\mathbb R}_+^ m, such that \sum _ i \alpha _ i = 1. The surrogate constraint will be the single linear constraint that is their \alpha weighted average:
(Here A_ i is the ith row of A, so the ith side constraint is A_ i\cdot x \le 1.)
The resulting surrogate problem (for this \alpha ) is find x\in {\cal P} such that the single linear constraint \eqref{eq:1} holds (while maximizing or minimizing the objective function, if present). Clearly, the surrogate problem is a relaxation of the original problem. That is, any feasible solution to the original problem is also feasible for the surrogate problem.
The Lagrangianrelaxation algorithms that we discuss here all work as follows. In each iteration, the algorithm computes a particular vector \alpha \in {\mathbb R}_+^ m of constraint multipliers, then solves the surrogate problem for that particular \alpha . Finally, it returns a weighted average \overline x of the solutions to the surrogate problems.
A main point is that the surrogate problem is easier to solve than the original problem. As a simple example, suppose we start with the packing problem \max \{ x_1 \, :\, Ax \le \mathbf1, x\in {\mathbb R}_+^ n\} , where the polytope {\cal P} is simply the positive quadrant. Then the surrogate problem for a given \alpha \in {\mathbb R}_+^ m will be of the form
where \beta = \alpha ^{\scriptscriptstyle \sf T}A. The surrogate problem has just one constraint, so is solved by taking just one index j and taking that x_ j as large as possible (x_ j = 1/\beta _ j). The index j should be chosen to maximize the resulting cost, 1/\beta _ j.
The surrogate problems are feasible, and they bound OPT
Each surrogate problem is a relaxation of the original problem (including its objective), so, for each surrogate problem, the solution to the surrogate problem will meet the constraint x\in {\cal P} and will have cost at most (or, for maximization, at least) the optimal cost of the original problem. Hence the weighted average \overline x will also be in {\cal P} and will have cost at most (or at least) the optimal cost of the original problem. If the algorithm chooses the surrogate problems well, then the weighted average \overline x should also (approximately) meet all side constraints.
Each surrogate problem doesn’t have to be solved exactly. If each surrogate problem’s solution is, say, a \lambda approximate solution for its problem, it is also a \lambda approximate solution for the original problem (minus the side constraints). The weighted combination \overline x of such solutions will necessarily also be a \lambda approximate solution to the original problem (minus the side constraints).
Modify the correctness proof in the multicommodity flow example to show that, if in each iteration the algorithm chooses a path p’ such that \sum _{e\in p’} \alpha _ e/c_ e \le \lambda \, \min _{p\in P} \sum _{e\in p}\alpha _ e/c_ e, then the final flow f will have value at least (1O(\varepsilon ))f^*/\lambda .
Choosing the constraint multipliers
Typically, the algorithm will maintain a current solution x of some kind, and will be guided by a realvalued potential function \phi (x) that somehow smoothly proxies all of the side constraints. (For example, for the side constraints Ax\le \mathbf1, one might use the potential function \Phi (x) = \ln \sum _{i} \exp (A_ i x).) The constraint multiplier vector \alpha will typically be the gradient \alpha = \nabla \Phi (x) of the potential function with respect to x.
The algorithms are not typically pure gradientdescent algorithms, though. For example, in each iteration, the algorithm might choose just one coordinate x_ j of the current solution x, and then increase only that coordinate. It might choose the x_ j so as to maximize c_ j/(\delta \Phi /\delta x_ j) — the rate of increase in the cost divided by the rate of increase in the potential function.
Potentialfunction invariant for the side constraints
In this way, the algorithm will maintain some potentialfunction invariant on the current solution x. For example, the invariant might be \Phi (x) \le (c\cdot x)/{\textsc{opt}}+ \ln m. At termination (once c\cdot x is large enough) the invariant will imply that the side constraints are approximately met.
Step size and trust regions
The potential functions, which are typically not linear, are generally smooth. That is, at any point \tilde x, they are wellapproximated by their firstorder approximation: for any “small” d,
With each surrogate subproblem, the algorithm will typically take the current overall solution \tilde x, and then move it in the direction of the solution to the surrogate subproblem: \tilde x \leftarrow \tilde x + \lambda \tilde x, where \tilde x is the solution to the surrogate subproblem.
The algorithm will typically choose the step size \lambda so that the increase in the potential function is (1+O(\varepsilon ))approximated by its firstorder approximation \lambda \nabla \Phi (\tilde x) \cdot \tilde x. That is, it will move \tilde x to a new point that is within the trust region around the current point.
This will suffice to maintain the desired invariant. (A tolerance for this (1+O(\varepsilon ))approximation of the change will be built into the potential function or the invariant.)
Convergence
There is a large literature on Lagrangianrelaxation algorithms in Computer Science. Some of the more wellknown papers are by Plotkin, Shmoys, and Tardos (e.g. [12]), and by Grigoriadis and Khachiyan (e.g. [6]).
With current algorithms (e.g. [4]), the number of iterations needed in the worst case to converge to a (1\pm \varepsilon )approximate solution is typically O(\min (m,\omega )\log (m)/\epsilon ^2), where m is the number of constraints, and \omega is the socalled width of the problem (here, the maximum value of A_ i\cdot x for any x\in {\cal P} and matrix row A_ i).
The time to solve the subproblem in each iteration depends on the subproblem. For explicitly given pure packing or pure covering problems (where {\cal P} is the just \{ x~ ~ x\ge 0\} ), the time can (roughly) be reduced to O(1) per iteration [9].
Surrogate duality
The surrogate relaxation described above is a form of partial linearprogramming duality: for any constraintmultiplier vector \alpha , the optimal solution to the surrogate relaxation of the problem gives a bound on opt. This is a kind of duality called surrogate duality.
It often translates directly into standard LP dual solutions. The vector \alpha corresponds to a dual solution (for some reformulation of the original LP). For example, recall the surrogate problem for the simple packing problem \max \{ x_1 \, :\, Ax \le \mathbf1, x\in {\mathbb R}_+^ n\} . The surrogate is \max \{ x_1 \, :\, \beta \cdot x \ge 1, x\in {\mathbb R}_+^ n\} , where \beta = \alpha ^{\scriptscriptstyle \sf T}A. The surrogate problem is solved by finding the index j that maximizes 1/\beta _ j, taking that x_ j to be 1/\beta _ j, and leaving all other coordinates zero. Since this is the optimal solution for the surrogate problem, which is in turn a relaxation of the original problem, the cost achieved by this x is an upper bound on opt:
(Here A^{\scriptscriptstyle \sf T}_ j is the jth column of A.)
The standard LP dual of the original problem is \min \{ y_1 \, :\, A^{\scriptscriptstyle \sf T}y \ge 1, x\in {\mathbb R}_+^ n\} . We leave it as an exercise to show that scaling the constraintmultiplier vector \alpha gives a dual solution y whose cost (by weak duality) gives the same upper bound on opt. (Recall that \alpha _1=1.)
Problems having many variables
Lagrangianrelaxation algorithms can be used for problems with many variables, as long as the surrogate problem can be solved. For example, consider a simple multicommodityflow problem in feasibility form:
There is a, implicitly, a variable for every path p\in P. There can be exponentially many, but, applying Lagrangianrelaxation with the capacity constraints as side constraints, the surrogate problem that arises is this: for a given \alpha (with \alpha _1=1),
The solution (if it exists) will be to send 1 unit of flow along a path p\in P such that \sum _{e\in P} \alpha _ e \le c. That is, \eqref{surrogate} is equivalent to
Such a path can be found quickly, if it exists, using a shortestpath algorithm. In this way, Lagrangianrelaxation can reduce the multicommodityflow problem to a (polylength) sequence of shortestpath problems.
Column generation is a practical technique with a similar spirit. The idea is that if a problem formulation has many more variables than constraints, one may run the Simplex algorithm, but only explicitly write down columns (variables) that have been brought into the current solution with nonzero values. To do the pivot step at each iteration requires checking whether a new variable (column) should be brought in. DantzigWolfe decomposition explicitly combines this idea with Lagrangian relaxation. Benders’ decomposition brings in rows as needed, instead of columns.
Relation to dual Ellipsoid method
Recall that the Ellipsoid method uses a “separation oracle” framework: to try find a feasible point in a convex body {\cal B}, the Ellipsoid method supplies a candidate point \tilde\alpha to the separation oracle for {\cal B}. The oracle must either determine that \tilde\alpha is in {\cal B} (and we are done), or must return a linear constraint that is valid for {\cal B} but not for \tilde\alpha (a proof that \tilde\alpha is not in {\cal B}), in which case the algorithm continues. Crucially for many applications, the convex body {\cal B} may be defined (implicitly) as the intersection of exponentially many constraints. As long as the oracle can find a constraint violated by a given \tilde\alpha (if there is one) in polynomial time, the Ellipsoid method will finish in polynomial time.
Consider applying the separationoracle framework to the dual of the problem \eqref{primal} above. One dual formulation (after appropriate scaling) is
(Note that the last inequality is strict.) This dual is feasible iff the primal is infeasible.
Take the polytope {\cal B} to be the set of \alpha ’s meeting the constraints in \eqref{dual}, then apply Ellipsoid. For this polytope, Ellipsoid will give the separation oracle a candidate \alpha . To determine whether \alpha is feasible for {\cal B}, the separation oracle can check directly that \alpha _1=1. Determining whether the remaining constraints are feasible is equivalent to solving the surrogate problem \eqref{surrogate2}.
Generally, whenever one can apply Ellipsoid to solve a given packing or covering problem using a particular separation oracle, then it is also possible to solve the problem by applying Lagrangian relaxation to the dual: the surrogate problem that arises will be the one solved by the separation oracle. The converse is also true.
Machine learning
Some types of Lagrangianrelaxation algorithms are closely related to multiplicativeweights update algorithms from machine learning, which play a role in boosting and following expert advice with low regret. Boosting reduces the problem of strongly learning a concept to a sequence of problems of weakly learning the concept against some distribution \alpha . Technically, this reduction corresponds to the reduction done by Lagrangianrelaxation algorithms, reducing a hard problem to a sequence of surrogate problem. See [3] for more details.
Lagrangianrelaxation algorithms can also be interpreted as fictitiousplay algorithms for twoplayer zerosum games with nonnegative payoffs.
Differential equations and Lyapunov functions
Lagrangianrelaxation algorithms may be thought of as gradientdescent algorithms, at least in that each step is chosen to augment the current solution in a direction parallel to some coordinate axis, and along which the potentialfunction change is suitably bounded.
Making the step size \varepsilon infinitesmal gives a continuoustime algorithm. The resulting system avoids the discrete step, so is easier to study than a discrete algorithm: it can be modeled as a system of differential equations, where the potential function of the algorithm becomes a Lyapunov function for the system. Such systems are deeply studied in control theory [2], and are used to analyze congestioncontrol mechanisms [11, 7, 5].
Oblivious randomized rounding
Many of the notes here explore the following connection between greedy/Lagrangianrelaxation algorithms and randomized rounding: Lagrangianrelaxation algorithms can be understood as randomizedrounding algorithms, and can be systematically derived using the method of conditional expectations. In this view, the algorithms and their underlying potential functions are derived using standard probabilistic tools. The potential functions are “given meaning” as conditional expectations or pessimistic estimators.
Related
Lagrangianrelaxation algorithms have a long history. They were analyzed as early as the 1950s (e.g., by von Neumann [10]). Their main historical use has been to reduce a hard problem (such as multicommodity flow) to a sequence of simpler problems (such as shortest paths). For a brief summary of their history within computer science, see [8]. See also the technical survey [1]. For an operationsresearch perspective, see [13].
On Wikipedia

Column generation, DantzigWolfe decomposition, Benders’ decomposition

the Winnow algorithm and AdaBoost (multiplicativeweights algorithms in machine learning)

Lyapunov function, Congestion control, control theory (Caltech)