M. Pawan Kumar















M. Pawan Kumar and D. Koller



We consider the problem of simultaneously dividing an image into coherent regions and assigning labels to regions using a global energy function. We form a large dictionary of putative regions using bottom-up over-segmentation techniques and formulate the problem of selecting the regions and their labels as an integer program. We provide an efficient dual decomposition method to solve an accurate linear programming relaxation of the integer program.

Why are Regions Useful?

While pixel-based models provide very accurate solutions for low-level vision problems such as interactive binary segmentation or dense stereo correspondence, they have had limited success in high-level vision. The underlying reason is the noise present in features extracted from regularly shaped patches around the pixels. To overcome this problem, several researchers have suggested the use of regions (a set of coherent, connected pixels) and shown promising results on various applications, such as those shown below.

Scene Layout [1]
3D Reconstruction [2]
Scene Segmentation [3]

Region Selection using Energy Minimization

Region-based models are typically employed using a two-step strategy: (i) obtain the regions using a standard over-segmentation approach such as mean-shift; and (ii) assign labels to the region (depth values for 3D reconstruction, semantic classes for segmentation). However, regions obtained in this manner have no information about the high-level task.

We propose to simultaneously obtain the regions and their labels by minimizing a global energy function. This problem is extremely challenging since the number of pixel-to-region assignments is more than exponential in the number of pixels. To overcome this problem, we make use of the fact that the union and intersection of the segments obtained by an over-segmentation approach can provide us with a (large) dictionary of putative regions. We can then cast our problem as a combinatorial optimization task of selecting regions and their labels from the dictionary.

The regions and their labels are selected according to the following criteria: (i) the selected regions must cover the entire image; (ii) no two selected regions should overlap with each other; and (iii) the global energy function should be minimized. This can be cast as an integer program, which allows us to obtain an approximate solution using a linear programming relaxation that replaces the binary variables of the integer program with real variables in the interval [0,1].

Efficient Dual Decomposition

We solve the linear programming relaxation using dual decomposition [4] by diving the problem into the following subproblems.

Tree Subproblem. We consider minimizing the energy of a sub-dictionary such that the regions of the sub-dictionary form a tree. Note that two regions are connected by an edge if an only if they share a common boundary pixel and do not overlap with each other. This problem can be solved efficiently using dynamic programming (or belief propagation).

Covering Subproblem. The intersection of all the regions in a dictionary defines a set of super-pixels. In order to satisfy the constraints of region selection, we specify one subproblem for each super-pixel which enforces it to be covered by exactly one selected region. This ensures that no two selected regions overlap and the entire image is explained.

Clique Subproblem. The linear programming relaxation specified by the above two problems is very weak and often has fractional optimal solutions. In fact, the relaxation is not tight even with the inclusion of the so-called cycle inequalities, as can be shown using a counter example. In order to tighten the relaxation, we add subproblems that minimize the energy of a sub-dictionary that forms a clique. We solve these subproblems using efficient enumeration.


We demonstrate the efficacy of our method using scene segmentation as an example application. We compare the results obtained using our regions with other types of regions that are commonly used in computer vision, namely (i) Intersection - the intersection of a set of over-segmentations (in our case, 3 different over-segmentation obtained by changing the parameters of mean-shift); (ii) Segmentation - the best over-segmentation in terms of the energy funciton; and (iii) Gould et al. - the region selection method of [3]. We also show a comparison with the pixel-based model described in [3].

As can be seen from the table on the right, our approach provides solutions with the lowest energy. The low value of the energy also results in higher accuracy (measured as the average number of pixels that get assigned to the correct semantic class). The improvements obtained are statistically significant under the standard pairwise t-test.

Note that in all our experiments (except the pixel-based model) we use the same set of parameters that were learned using the method of [3]. However, learning the parameters using an inference-based technique (where the inference is performed by solving the linear programming relaxation as described above) may result in further improvements in accuracy. We plan to explore this direction in our future work.


The dataset used in this work can be downloaded from the DAGS scene understanding webpage. The code for region selection from a dictionary is available for download.


M. Pawan Kumar and D. Koller
Efficiently Selecting Regions for Scene Understanding
In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2010


[1] D. Hoiem, A. Efros and M. Herbert. Geometric context from a single image. In ICCV, 2005.
[2] A. Saxena, M. Sun and A. Ng. Make3D: Learning 3D scene structure from a single still image. In PAMI, 2008.
[3] S. Gould, R. Fulton and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV, 2009.
[4] D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.