The goal of this project was to implement one of several state-of-the art
region based image segmentation algorithm published since 1995. For this
report the algorithm published by Tapas Kanungo, Byron Dom, Wayne Niblack,
David Steele and Jacob Sheinvald, "MDL-Based Multi-Band Image Segmentation
Using a Fast Region Merging Scheme", IBM Research Report, RJ 9960
(87919) May 17, 1995 was implemented and tested on a set of synthetic and
real images.
The algorithm implemented for this project is utilizing the minimum description
length (MDL) principle. The image segmentation is driven by an objective
function that seeks to minimize the amount of information necessary to describe
the segmented image. The segmented regions are described by their boundaries,
polynomial surfaces and residuals. The degree of the polynomials are increased
during the segmentation. The segmentation starts with single pixel regions
and uses a region merging scheme that minimizes the objective function following
the greedy principle.
The MDL criterion was first used for gray level images with polynomial approximations
of 2nd order degree [2]. In [3], a continuation scheme was used to minimize
the description length of the segmentation, resulting in a potentially better
segmentation because local minima are avoided. However the continuation
scheme is computationally much more expensive. The MDL principle was also
used to obtain segmentation by encoding the topology of the regions [4]
and for model based segmentation [5]. Many more researchers applied the
MDL and other information related principles to the problem of image segmentation
and other applications. A more complete summary of related work and image
segmentation in general can be found in [1] and [6].
The algorithm tries to achieve a segmentation of the image by finding the
global optimum of an objective function that measures the amount of information
necessary to encode the image together with its segmentation. The segmentation
consists of the boundaries of the regions, the coefficients of the polynomials
that describe the contents of each region and the residuals of all the pixels
in each region, i.e., the difference in gray value between each pixel and
the polynomial approximation.
The algorithm assumes that the image consists of regions and that the pixel
values in each region arose from stochastic processes that lead to polynomial
intensity distributions plus white spatially uncorrelated Gaussian noise.
Furthermore, the algorithm assumes that the segmentation of the image can
be described with a minimal amount of information, i.e, that the segmentation
corresponds to the minimum of the objective function.
A segmentation is described by the boundaries of the regions. Assuming
the region boundaries are known, the parameters for the polynomials are
obtained through multivariate regression. Once the boundaries and the polynomial
approximations are known for each region, only the residuals of the pixel
values are necessary to reproduce the image exactly. Therefore, the objective
function measures the amount of information necessary to encode the boundaries
(or more specifically, their lengths), the parameters of the polynomial
approximations and the residuals.
At each stage and for a given polynomial degree, the algorithm seeks to
merge two regions such that the description length is maximally reduced
until no such regions can be found in which case the degree of the polynomials
used for the approximations is increased. The segmentation is stopped at
a given maximum polynomial degree.
One feature of the algorithm is that the merging of two regions is done
incrementally, i.e., the algorithm can use available information about two
regions to obtain the corresponding information for the combined regions.
In practice, the algorithm keeps pairs of neighboring regions the decrease (or increase) in description length that would result from a merging of the regions. At each step it selects the pair that reduces the DL maximally, merges the regions and recalculates all pairs that are affected by the merger.
For a successful implementation additional points had to be considered:
The algorithm was applied to a large number of different images. We considered
the following classes:
(i) single band vs. multi-band images
(ii) synthetic vs. natural images
(iii) noise vs. no noise (for synthetic images)
Single band synthetic noise free image
_
Figure 1: A synthetic single band image without noise and the result
of the segmentation algorithm (right). The image shown here is enlarged
by a factor of two.
In Fig. 1 the application of the algorithm in case of a noise free synthetic single band image can be seen. The algorithm produced the correct segmentation. A small deviation from the perfect result can be seen in the bottom right corner. However, the result is very satisfactory.
Single band synthetic noisy image
_
Figure 2: A synthetic single band image with noise (sigma=40 pixel)
and the result of the segmentation algorithm (right). The image shown
here is enlarged by a factor of two.
In Fig. 2 Gaussian noise with a sigma of 40 pixel has been added. The algorithm is able to handle the noise; it still produces a perfect segmentation.
The cost to encode noise vs. encoding the boundaries
_
Figure 3: Segmentation of a synthetic wheel.
When the algorithm is applied to the synthetic image of a 'wheel' it does not manage to capture the 'spokes' (Fig. 3). This is a surprising result. It appears as if the intensity difference of the spokes compared to the background is too small. The algorithm finds it better to encode the difference as noise rather than introducing additional boundaries. Due to its length, the boundary of the spokes is especially expensive to encode.
_
Figure 4: Same as Fig. 3 but with the intensity of the 'spokes'
increased.
Increasing the intensity of the spokes in Fig. 4 indeed leads to a different result but the algorithm still tries to save boundary length by merging the 'spokes' with the 'rim' of the wheel.
_
Figure 5: Fig. 3 with noise of sigma=40 added and the white corner
removed.
Adding noise leads to yet another result. The result is similar to the one for Fig. 3 but one of holes in the wheel was segmented. This indicates that the difference between encoding the 'spokes' and not encoding them is small.
The above series shows that the algorithm does not always produce results that are 'natural'. The decision whether or not to segment a region depends on the length of its boundary.
Figure 6: A single band image similar to the one considered in
[1] (without noise).
In Fig. 6 an image of a circle with a gray value gradient is shown. This image is similar to the one displayed by the authors of Ref. [1]. This image tests the influence of the degree of the polynomials that approximate the contents of the regions.
_
_
Figure 7: Segmentation with polynomials of degree 0, 1 and 2 (left
to right). The outline of the ball is segmented perfectly the region boundary
only blends with the image background.
The result of the segmentation algorithm applied to Fig. 6 reveals that the algorithm produces a different result than the implementation in [1]. Even with a degree of 0 the algorithm nearly encodes the whole circle in one region. This behavior could not be traced back to an error in the algorithm. The implementation for this project appears to weigh the cost of encoding region boundaries higher than the cost of encoding the 'noise' (i.e., the residuals).
Single band natural scene
Figure 8: A single band image of a dolphin.
_
_
Figure 9: Segmentation of Fig. 8 with polynomials of degree 0,
1 and 2 (left to right).
Applying the algorithm to a natural scene, the image of a dolphin, Fig.
8, yields the result in Fig. 9 for varying degrees of the polynomials. Except
for the dark region of the left fin, the algorithm manages to produce
a result that does not seem unlikely. Here as with many other images of
natural scenes, the question of what exactly the ground truth is, arises.
The MDL objective function produces good results in many situations but
the concept of MDL is not necessarily always compatible with the 'objective
function' of a human.
Figure 10: A common natural image.
_
_
Figure 11: Segmentation of Fig. 10 with polynomials of degree 0,
1 and 2 (left to right).
Fig. 10 and 11 show another segmentation of natural scene. The result of the segmentation is not what one would expect. The algorithm seems to tend to combine neighboring regions with low contrast but also obvious regions such as the arch to the right in the background have been missed. The most likely cause of the relatively bad segmentation seems to be the low overall contrast in the image. Again, the algorithm seems to merge regions more happily than it should.
_
Figure 12: A part of the previous image with equalized histogram
to increase the contrast.
Fig. 12 shows parts of Fig. 10 enlarged with the contrast increased. The segmentation of this image is considerably better indicating that the low contrast in Fig. 10 was indeed the reason for the bad result.
Multi-band synthetic images
_
Figure 13: A synthetic three band image without noise (RGB) and
the result of the segmentation (right).
Fig. 13 shows the segmentation of a synthetic noise free three band color image. The result is almost perfect.
_
Figure 14: A synthetic three band image with noise (RGB) and the
result of the segmentation (right).
Fig. 14 shows the same image as Fig. 13 with noise added. The algorithm fails to merge some small regions. Noise will most likely always have an affect like this on a segmentation.
Multi-band natural scenes
Figure 15: A four band color image (CMYK) of a natural scene.
_
Figure 16: Segmentation of Fig. 13 with polynomial degrees of
0 and 2 (left to right), segmentation of degree 1 is identical to the
one of degree 2.
Fig. 15 and 16 show the segmentation of a four band image of a natural scene. Again the question of ground truth arises but the balloons are not segmented and blend with the crowd. Some colors of the balloons where similar to the sky. All the flying balloons where segmented but the top-most was divided into many small regions. It is worth noting that the increase in degree of the polynomial 'ate' one of the balloons.
This shows how the algorithm is, unlike humans, unaware of texture information. Although the blue color of the balloons is somewhat similar to the sky, a human can clearly distinguish the blue areas of the balloons from the sky because of the variation in texture (ignoring the obvious contextual information). The algorithm as it was applied here is not adapted to handle texture.
It is desirable to test the algorithm on an image where the 'solution' is more obvious.
Figure 16: Parts of the image from Fig. 13.
_
_
Figure 17: Segmentation of Fig. 16 with varying polynomial degrees
(as before).
Fig. 16 and 17 show the application of the algorithm to a natural scene where the ground truth is more obvious than in the previous example. The resulting segmentation can be considered to be very good. The balloons are clearly segmented and almost surprisingly, the blue color of the larger balloon did not merge with the sky.
It is worth noting that the computational time of the algorithm is very high. The segmentation of Fig. 16 (141 pixel x 182 pixel x 4 bands) took 146 minutes on a 400Mhz Pentium PC.
Increasing the weight of the residuals
The algorithm did not show the expected behavior when the degree of the approximating polynomials was increased. As stated above our implementation of the algorithm seemed to have a tendency to merge and encode the residuals rather than encoding additional region boundaries. An ad hoc solution to this problem was to multiply the term that encodes the residuals with a small factor greater than one.
Applying the modified algorithm to an image containing a gray value gradient gives the following result:
_
_
Figure 18: Segmentation of a gray value gradient giving more weight
to the cost of encoding the residuals. The right image is a segmentation
with polynomials of degree at most 2 which is virtually identical to order
0 and 1.
Fig. 18 shows that this changed the behavior of the segmentation but that
the algorithm is now reluctant to merge regions. Even with 2nd order polynomials
the different gray levels are still segmented separately.