# Dense Discrete Optimization

### Introduction

This page contains the instructions for the practical. A PDF version of these instructions is also available here. The templates for the code to fill in is available here.

The main file orchestrating everything is part_1.m. It calls other functions that you will be guided through implementing. You need to run the code from the directory where those files are.

We provide you with an efficient implementation of the matrix-vector product for the specific pairwise potentials studied in this practical in the permutohedral subfolder. There is normally no need for you to change anything in this folder but if you encounter problems with loading the unary terms or with performing this product, you might have to recompile the Mex files for your own architecture. Some instructions to do so are given in the permutohedral/README file.

In this practical, we consider the classic problem of Semantic Segmentation. In this problem, given an image and a set of semantic labels (such as car, person, sheep...), the goal is to associate a single class to each pixel of the image.

This problem has been widely studied due to its importance for many industrial applications in robotics and heavy use in medical imaging. More recently, this problem has seen a new interest for self driving cars. Indeed, it is required to find, given an image, what it contains to be able to build a self driving system.

The semantic segmentation task is very challenging and can be solved using several techniques. A classic framework to solve this problem is based on a 2-step process:

• Get a coarse per-pixel estimate of the labelling.
• Use a graphical model to enforce local consistency of the labelling.

The first part used to be done using hand-crafted feature descriptors followed by simple classifiers. More recently, Convolutional Neural Networks have been used to obtain better quality labellings. The second step is done using a model such as a Conditional Random Fields (CRF), which is what is going to be studied in this practical.

### CRF model

CRFs are used as graphical models to represent a given problem. We define an energy on the CRF that will represent the quality of the current labelling. Optimizing the labelling to minimize this energy will then solve the corresponding problem.

When working with segmentation of images, we will consider that every pixel of the input image is a node of the CRF. These are the observed nodes, in grey in the image below. Associated to each of this pixel, is a label, in white on the graph below. These are not observed and will have to be inferred. Because we want to enforce continuous regions of the image to have the same label, we will make each label node dependent on its neighbours.

For this practical, we define the energy of the CRF as:

$E(\textbf{x}) = \sum_a \phi_a(x_a) + \sum_a \sum_b \psi_{a,b}(x_a,x_b)$

$\psi_{a,b}(l_1, l_2) = \gamma \cdot \mu(l_1, l_2) \cdot \omega_{a,b}$

As a reminder:

• $$x_a \in \{0, .. , L\}$$ is the label taken by pixel $$a$$, $$\textbf{x}$$ is the vector containing all $$x_a$$ for each pixel

• $$\phi_a(x_a)$$ is called the unary potential for the pixel $$a$$ taking the label $$x_a$$

• $$\psi_{a,b}(x_a,x_b)$$ is called the pairwise potential for the pixels $$a$$ and $$b$$ taking the labels $$x_a$$ and $$x_b$$ respectively.

• $$\gamma$$ is a weighting factor to balance the importance assigned between the smoothness terms of the pairwise potentials and the predictive terms of the unary.

• $$\mu(l_1, l_2)$$ represents the dissimilarity measure between labels. Here, we consider a Potts Model. The value is 0 if $$l_1 = l_2$$ or 1 if $$l_1 \neq l_2$$.

• $$\omega_{a,b}$$ is the weight assigned to this particular pairwise term. If two pixels are really similar, we want to assign more weight to the fact that they should have the same label. On the other hand, pixels that are really different shouldn't be penalized too much for taking different labels, even if they are in the same neighbourhood.

In this practical, we will consider bilateral gaussian edge weights:

$\omega_{a,b} = \exp \left( -\frac{\| p_a - p_b \|^2}{\sigma_{\text{spatial}}} - \frac{\| rgb_a - rgb_b \|^2}{\sigma_{\text{img}}} \right)$,

where $$\mathbf{p}_a$$ is the vector containing the position of the pixel a in the image ( (x,y) coordinates) and $$rgb_a$$ is the vector containing the RGB values of the pixel $$a$$. Pixels having similar values will have a higher similarity that different pixels.

In this practical, we are going to consider the problem where every pixel is connected to every other pixel. This allows to represent long range interactions between pixels, as opposed to traditional sparse neighbourhood where only the closest pixels are connected.

### Formulation as an IP

Our method to deal with these large neighbourhoods is to make use of continuous relaxations. We are first going to formulate it as the following Integer program (IP):

$\begin{split} \text{E}_{IP}^* = \min_{\mathbf{y}} \ \text{E}_{IP} (\mathbf{y}) = \min_{\mathbf{y}} \quad& \sum_{a=1}^{N} \sum_{i \in \mathcal{L}} \phi_{a}(i) y_{a}(i) + \sum_{a=1}^{N} \sum_{\substack{b=1\\b \neq a}}^{N} \sum_{i,j \in \mathcal{L}} \psi_{a,b}(i,j) y_{a}(i)y_{b}(j),\\ \text{s.t. }& \sum_{i \in \mathcal{L}} y_{a}(i) = 1 \quad \forall a \in [1, N],\qquad\qquad\qquad\qquad\qquad\qquad(1)\\ & y_{a}(i) \in \{0,1\} \quad \forall a \in [1, N] \quad \forall i \in \mathcal{L}.\\ \end{split}$

Where $$y_a(i)$$ are the variables that we optimize. When $$y_a(i)=1$$, that means that the pixel $$a$$ takes the label $$i$$.

Our approach to solve the problem is going to rely on solving a relaxation of the problem in order to obtain a fractional solution, that is, a solution respecting all the constraints except the Integer constraints. Once this solution is obtained, we will round it to obtain a proper integer solution.

Task 1: Implement a heuristic to initialize the state to a valid point in the feasible region. Fill in getInitialState.m. A possible choice of heuristics can be an argmin over the unaries only, disregarding the pairwise terms but a lot of options are possible.

Task 2: Even after solving the relaxation, we need to get back to an integer solution. Implement a rounding scheme to go from a fractional solution to an integer one.

The simplest rounding possible consists in performing an argmax for each possible pixel and picking the resulting label. More advanced rounding methods exists that provide better guarantees for certain models but we won't discuss them here.

You can now run the code of part_1.m until the end of the Data Loading section and should be able to see the initial segmentation, based only on your heuristic.

### Before everything: Compute what you want to optimize

In order to check the validity of our optimization, we need to be able to evaluate the value of our objective function. However, note that a naive implementation of the evaluation function will prove very slow. Indeed, the straightforward computation of the pairwise terms contains a double loop over the pixels (as in our model, we have assumed a dense connection model) and a double loop over the labels. The complexity of the computation performed this way is therefore of order $$\mathcal{O}(N^2L^2)$$. As $$N$$ is large (68160 for an image of size 213 by 320, which is still relatively small as far as images are concerned), such a complexity is not acceptable.

Luckily, our matrix $$\psi_{a,b}(i,j)$$ has a lot of structure that we can leverage to make the computation more efficient. You have been provided with a matlab function computeMvProduct that leverages a data structure called the permutohedral lattice that allows to perform matrix multiplication involving the $$\psi(a,b)$$ matrix in linear time! More information about the permutohedral lattice can be found here

Task 3: Using this efficient product, implement the evalEnergyQP.m function to evaluate the value of $$\text{E}_{IP}(\mathbf{y})$$.

## QP relaxation

### Part 1: Problem formulation

#### From Integer program to Quadratic Program

Most problems that contains Integer constraints, and are therefore in the domain of Discrete Optimization, are NP-Hard. We will start by relaxing the problem to not contain any Integer constraints. If we just drop those constraints from Problem (1), we obtain the following Quadratic Program (QP):

$\begin{split} \text{E}_{QP}^* = \min_{\mathbf{y}} \ \text{E}_{QP} (\mathbf{y}) = \min_{\mathbf{y}} \quad& \sum_{a=1}^{N} \sum_{i \in \mathcal{L}} \phi_{a}(i) y_{a}(i) + \sum_{a=1}^{N} \sum_{\substack{b=1\\b \neq a}}^{N} \sum_{i,j \in \mathcal{L}} \psi_{a,b}(i,j) y_{a}(i)y_{b}(j),\\ \text{s.t. }& \sum_{i \in \mathcal{L}} y_{a}(i) = 1 \quad \forall a \in [1, N],\qquad\qquad\qquad\qquad\qquad\qquad(2)\\ & y_{a}(i) \geq 0 \quad \forall a \in [1, N] \quad \forall i \in \mathcal{L},\\ & y_{a}(i) \leq 1 \quad \forall a \in [1, N] \quad \forall i \in \mathcal{L}.\\ \end{split}$

This is a quadratic program. If you order all the $$y_{.}(.)$$ into a vector and put the terms of the unaries and the pairwise into the appropriate forms, you can show that the objective function is of the form:

$\mathbf{\phi}^{T} \mathbf{y} + \mathbf{y}^{T} \mathbf{\psi} \mathbf{y},$

• $$\mathbf{\phi}$$ is the vector containing the unaries.

• $$\mathbf{\psi}$$ is a matrix whose elements are the $$\psi_{a,b}{i,j}$$ as described above, with the exception that they are 0 if $$a = b$$, so as to compute the correct sum.

which is clearly a quadratic function with respect to the variable $$\mathbf{y}$$. Similarly, it is easy to observe that the constraints are all linears.

(Optional) Question 1: Prove that $$\text{E}_{IP}^* \geq \text{E}_{QP}^*$$.

(Optional) Question 2: Prove that $$\text{E}_{QP}^* \geq \text{E}_{IP}^*$$.

Hint: you can write $$\text{E}_{QP}(\mathbf{y})$$ as an expectation of multiple solutions of the IP if you assign label $$i$$ to pixel $$a$$ with probability $$y_{a}(i)$$.

(Optional) Question 3: Prove that the QP relaxation that we just introduced is tight (its optimal solution is the same as the IP optimal solution).

It can be in fact shown that from any optimal solution of the problem (2), an optimal solution to the problem (1) can be obtained.

The relaxed problem is actually a continuous optimization problem called a Quadratic Program. In general, Quadratic Program are NP-hard to solve. Moreover, the problem that we presented above has no special structure (such as convexity) that would allow use to solve it efficiently in the general case. We therefore relax this formulation even further.

To do so, we will modify the problem by introducing an additional vector $$\mathbf{d}$$ such that each of its elements is given by:

$d_a(i) = \sum_{\substack{b,j\\b \neq a}} |\psi_{a,b}(i,j)|$

Given their definition, we know that all the $$\psi_{a,b}(i,j)$$ are positive so the computation of $$\mathbf{d}$$ can be done by multiplying the matrix $$\psi$$ by a vector containing only 1s.

$\mathbf{d} = \psi \mathbf{1}$

The problem that we are going to solve is therefore the following: $\begin{split} \min_{\mathbf{y}} \quad& \sum_{a=1}^{N} \sum_{i \in \mathcal{L}} [\phi_{a}(i) - d_{a}(i)] y_{a}(i) + \sum_{a=1}^{N} \sum_{\substack{b=1\\b \neq a}}^{N} \sum_{i,j \in \mathcal{L}} \psi_{a,b}(i,j) y_{a}(i)y_{b}(j) + \sum_{a=1}^{N} \sum_{i\in \mathcal{L}} d_{a}(i) y_{a}(i)y_{a}(i),\\ \text{s.t. }& \sum_{i \in \mathcal{L}} y_{a}(i) = 1 \quad \forall a \in [1, N],\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad(3)\\ & y_{a}(i) \geq 0 \quad \forall a \in [1, N] \quad \forall i \in \mathcal{L},\\ & y_{a}(i) \leq 1 \quad \forall a \in [1, N] \quad \forall i \in \mathcal{L}.\\ \end{split}$

Seen from a matrix point of view, this is equivalent to having the objective function be:

$\mathbf{\phi^{\prime}}^{T} \mathbf{y} + \mathbf{y}^{T} \mathbf{\psi^{\prime}} \mathbf{y},$

$\mathbf{\phi^{\prime}} = \mathbf{\phi}-\mathbf{d} \quad\quad\quad \text{ and } \quad\quad\quad \mathbf{\psi^{\prime}}=\mathbf{\psi}+\mathbf{diag(d)},$ where $$\mathbf{diag(d)}$$ is the square matrix containing the elements of the vector $$\mathbf{d}$$ on the diagonal.

Question 4: Compare the value of integer solutions of the problem using the energy terms $$\mathbf{\phi}$$ and $$\mathbf{\psi}$$ (Problem 2.)with the values of the problem using the energy terms $$\mathbf{\phi^{\prime}}$$ and $$\mathbf{\psi^{\prime}}$$ (Problem 3.).

Question 5: Prove that Problem 3. is a convex problem.

Hint: A symmetric diagonally dominant matrix is semi-definite positive. A diagonally dominant matrix is a matrix where the diagonal terms are bigger than the sum of the extradiagonal terms: $$\forall i, |a_{i,i}| \geq \sum_{j \neq i} |a_{i,j}|$$.

Task 4: Fill in the evalEnergyConvexQP function that, given a fractional solution, will evaluate the energy of Problem 3. (Reuse some of the code form evalEnergyQP).

You can now run the code until the end of the Compute the energy of the initialisation section.

In order to solve this quadratic problem, while still respecting the constraints, we are going to use the method of conditional gradients, also known as Frank-Wolfe algorithm. The goal of this section will be to implement this algorithm.

The general principle of the conditional gradient method is the following.

To solve an optimisation problem of the form: $\begin{split} \min_{\mathbf{y}} \quad& f(\mathbf{y})\\ \text{s.t. }& \mathbf{y} \in \mathcal{M}, \end{split}$ where we try to minimize a function $$f$$ over a convex set $$\mathcal{M}$$


- get y_0 in M
- While not converged:
- Compute the conditional gradient      :  s = argmin(s'*g), for s in M
- Pick a step size alpha                :  alpha = argmin(f(alpha y_t + (1-alpha) s)), alpha in [0, 1]
- Update the current position           :  y_{t+1} = alpha y_t + (1 - alpha) s
- end

Question 6: For other gradient descent based methods used for continuous optimization, a projection step is usually necessary. Indeed, when the optimisation is constrained on a given domain, you need to be careful that all your steps are going to remain in the feasible region. For this reason, after each gradient step, a projection onto the feasible region needs to be done. Explain why this step is not necessary in the case of the conditional gradients method when the domain $$\mathcal{M}$$ is convex.

### Derivation and implementation

#### Step 1: Compute the gradient

Question 7: Derive the formula of the gradient of the objective function of Problem 3.

Task 5: Implement it in computeConvexGradient.m.

#### Step 2: Compute the conditional gradient

Once the gradient has been obtained, what needs to be computed is the conditional gradient. The conditional gradient is the result of a Linear program, where the cost vector is the gradient of the objective function and the constraints are the constraints of the original problem.

Question 8: In our specific problem, the constraints have a lot of structure. While it would be straightforward to solve this LP using the built-in linprog matlab function, we can probably make it faster by implementing this optimisation ourselves. Provide an algorithm that, when given the gradient of the function, will compute the associated conditional gradient.

Task 6: Implement it in the function computeConditionalGradients.m.

#### Step 3: Compute the step size

The performance of gradient descent type algorithm is heavily dependent on the choice of step-size. A possibility is to try different values and annealing schedules of the step-size until finding a successful one.

The alternative is to use the structure of the problem to our advantage and adaptively choose the step size. In the case of our convex QP, it is possible to find the optimal step-size which is going to bring the largest decrease in the objective function.

Question 9: Derive analytically the value of the optimal step-size for our problem.

Hint: write the value of f(alpha y_t + (1-alpha) s) as a function of alpha.

Task 7: Implement the computation of the optimal step-size in optimalStepSize.m.

#### Step 4: Bringing it all together and solving the QP

All the necessary steps have been implemented and just need to be brought together to implement the Frank-Wolfe algorithm.

Task 8: runOneStepConvexQP.m contains the implementation of the Frank-Wolfe algorithm to solve the convex QP relaxation. Look at it and verify that this implement the described algorithm.

Task 9: Run the part_1.m script to perform the semantic segmentation. Observe the results obtained.

### Non convex QP

The relaxation that we implemented has the advantage of being convex, which means that there is a unique optimal solution and that it can be found efficiently. This notably allows us to have optimal step sizes at each step of our algorithm. On the other hand, we pay the cost of not directly minimizing our target objective function.

We can also decide to not perform the convex relaxation and apply Franke-Wolfe to our non-convex QP relaxation.

Task 10: Implement the computation of the gradient for the non-convex QP relaxation in the files computeGradient.m. Note that in runOneStepQP, we can't compute an optimal step size anymore so we need to pick an arbitrary value. Replace the value of optim in the beginning of part_1.m from cvxQP to QP.

Task 11: Perform some segmentation using the non-convex QP relaxation. Compare with the results obtained by the convex relaxation.