Keywords

1 Introduction

This paper describes a computer vision system for the automatic inventory of a commercial cooler. The goal is to count, for each brand, the number of beverage products (bottles and cans) contained in the cooler at any given moment in order to efficiently schedule a refill if necessary. This is done through the continuous analysis of the images of the cooler’s shelves taken by (low-cost) wide-angle cameras.

Although at first glance the task looks trivial, as the objects to be recognized are clearly distinguishable, rigid and in a well-known static environment, it is in fact a challenging one due to a combination of several factors. In particular, a first difficulty arises from the severe occlusion conditions under which the system has to work. In fact, in a typical scenario involving densely packed shelves, visibility decreases row by row, the rear products being almost completely hidden from the front ones (see Fig. 1 for some typical examples). The items are also typically very close to each other and this makes segmentation and detection more difficult. Recognition is also complicated by the lighting conditions: light is not uniform in the images, not only due to the shadows generated by the shelves and by the products themselves, but also due to the influence of external light. As a result, our images have typically poorly defined edges and distorted color representation, thereby making segmentation and brand classification more difficult. Also, the system has to be flexible enough to recognize new products after software installation. These difficulties are exacerbated by the need to cut off production costs and by the consequent use of low-quality cameras and limited computational resources. Indeed the whole system has to run on an embedded low-performance computer and this poses serious limitations as to the kind of algorithms that can be used, as computationally intensive techniques are clearly not feasible.

Fig. 1.
figure 1

Typical images analyzed by our system.

Fig. 2.
figure 2

Flow-chart of the proposed system.

The proposed system uses a combination of simple techniques to address these limitations. It is implemented into a pipeline of simple modules, as shown in Fig. 2. The pipeline begins with an edge detector which extracts the features that will be used by the distance transform module to construct a distance image. The next step in the pipeline is chamfer matching [1], which detects the shape of beverage products by shifting their templates at various locations of the distance image. A matching measure is used to detect a candidate beverage shape, which is then checked by a false positive elimination module. Finally, the brand of the beverage products is recognized using simple color histogram matching. The color histogram of the pixels which lie under a detected shape is compared with the color histograms build from the images of reference products. Despite the simplicity of the used techniques, preliminary results show the effectiveness of the proposed system in terms of both detection accuracy and computational time.

2 The Pipeline

The proposed pipeline is based on simple techniques applied in a cascaded way to enhance the recognition accuracy and to provide robustness. As previously mentioned, the pipeline begins with a learning-based edge detector [4] which extracts the most useful product edges that will be used to construct a distance image. This is used by the chamfer matching module [1] in order to detect candidate product shapes which will be checked by the false positive elimination module. The last module is the histogram matching that allows brand recognition. The algorithm is optimized by using 3D modeling techniques for template generation and by a space management system which allows faster image scan and avoids the need of a non-maximum suppression. Further accuracy is achieved by splitting a beverage into its main characterizing parts, processing them independently and considering the results as a whole. Occlusion is dealt by building an occlusion mask which keeps track of the image portions occupied by the detected beverages and masks the templates occluded parts. Figure 2 shows the flow chart of the proposed pipeline.

2.1 Edge Detection

Edge detection is the preprocessing stage of the pipeline. It relies on the OpenCV 3.2.0 [7] implementation of the fast edge detector proposed by Dollár and Zitnick [4], which is inspired by the work of Kontschieder et al. [8]. It exploits the high interdependence of the edges in a local image patch. In particular, edges exhibit well-known patterns that can be used to train a structured learning model. Dollár and Zitnick’s algorithm segments an image into local patches used to train a structured random forest model. This model provides a local edge mask which is applied to extract edges in an accurate and efficient way. Figure 3 shows edge detection results obtained from Dollár and Zitnick’s algorithm.

Fig. 3.
figure 3

Example of the edge detection results: the original shelf image on the left and the edge image on the right.

2.2 Shape Detection

Template matching is the first stage of the proposed system in which beverage candidates are evaluated and discarded if they do not satisfy the shape requirements. It relies on a chamfer template matching [6] for the shape detection, on a 3D modeling for the template generation, on a smart sliding window for the space management and on a simple yet essential mechanism for the occlusion management.

Chamfer matching is a simple template matching algorithm which offers high performance and a robust detection as it is very flexible and more tolerant to low quality edges than other algorithms of the same kind. First, a morphological transformation, known as distance transform [5], is applied to the previously extracted edges. The resulting picture will be a gray-scale image in which each pixel will have the value of the distance from that pixel to the nearest edge. Finally, a query template is slided onto the distance image. At each position, a matching measure is computed by summing the pixel values of the distance transform image which lie under the edge pixels of the template. If the computed matching measure lies below a certain threshold, the target beverage shape is considered detected. The template threshold should be chosen to achieve a desired trade-off between false positives and false negatives.

Chamfer matching is very inefficient as all beverage templates of varying shape and size have to be tested at each locations of the distance image. Thus, a 3D model of the shelf is introduced to speed-up the matching process. It allows to check only one template per product at each location of the distance image avoiding to check, for each products, a bunch of templates of varying shape and size. To achieve this aim, we exploit the available information related to the objects, the cooler and the camera in order to render the shelf and to build the template for the shape matching. In particular, each object is accurately measured as follows: first the bottom diameter is measured, then, going up, for each change in the shape the value of the height and the corresponding diameter are collected. In this way we sum up the product contour as a collection of diameter discontinuities and their relative heights. The beverage partition into contour and horizontal parts can reproduce well most of the bottles and cans, even those which are not circular based, with a little error. Furthermore, cameras intrinsic parameters are collected, while real position of the camera and rotation angles are measured. For this purpose we introduce an artificial reference points in the picture: a special sheet of paper with a printed grid is laid on the shelf, while the same grid is rendered in a 3D representation of that shelf, using the cooler information. At the beginning the virtual grid is in a random position but, using special buttons on the keyboard, a user is able to modify the camera position and the rotation angles in order to match as close as possible the virtual grid with the real grid. When the grids match, we obtain the camera position and the orientation with a good accuracy. This whole process should be done only once, when the camera is installed. Figure 4 shows the calibration process. Finally, after the calibration step, the template of each product is rendered at any desired point of the shelf (Fig. 5).

Fig. 4.
figure 4

Calibration procedure: the goal is to match the grid on the shelf. (1): Real grid in the shelf. (2): Starting virtual grid with predefined camera position and orientation. (3): Close match of the grids. (4): Good grid match; now the camera parameters are known.

Fig. 5.
figure 5

Examples of templates generated by the 3D modeling.

To further speed-up the matching process, a smart sliding window for the space management, named smart scan module, is introduced. It relies on a 3D shelf model which allows to switch from virtual coordinates (pixels of the image) to physical ones (millimeters of the real shelf) (Fig. 6). The scan is then performed referring to the physical shelf position (xz) so that the spatial information can be exploited to avoid points in which the template cannot fit due to the lack of space. In particular, the scan starts from the lowest right angle (\(x = maxLength\), \(z = 0\)) and goes up column-wise: at each detection step we keep the x fixed and we increase the z by \(step_z\), until the innermost part is reached; then we reset z to 0, we shift left by \(step_x\) (\(x = x\)\(step_x\)) and we start increasing the z again; this procedure goes over until the left highest corner is reached. Thus, the 3D model and the smart scan allow to check only one template per product at each permissible position (xz) speeding up the template matching phase.

Fig. 6.
figure 6

Real shelf and camera coordinate systems.

To deal with the occlusion conditions, a binary image, called occlusion mask (see Fig. 7), keeps track of the detections found at every step. The occlusion mask has the same size of the shelf image, and it can be thought as a sort of shelf shadow doublet: each time a detection is confirmed in an image point, the occlusion mask is updated accordingly by setting to zero all the pixels belonging to the filled template shape at that same point. In this way the occlusion mask will be a binary image in which black pixels denote the scan image space occupied by the products found until that moment, while white pixels denote the free space left. We then update the query template by masking it with the occlusion mask, so that only the visible template portion is used in the subsequent matching. If the remaining template portion is under a certain threshold, it is discarded as not reliable enough. This solution offers good performance while keeping the problem at a very simple level, but it is not always accurate enough as it is based on a strong assumption which sometimes does not hold: products are considered to be picked in order from the visible ones to the most occluded.

Fig. 7.
figure 7

Example of the mask image during an ongoing detection. The source image is on the left, the occlusion mask is in the middle and the objects found until that moment are on the right. (Color figure online)

Finally, to achieve better accuracy, a procedure known as false positive elimination is performed: each beverage part of a candidate detection is compared against the results achieved by the chamfer matching applied to a reference background image. If the results are too close to each other, the algorithm states that the match is a fake one (the match is a part of the background which is wrongly detected as a real object).

2.3 Color Classification

Histogram matching is the second and last stage of the proposed pipeline in which the brand of a previously detected shape is recognized. In particular, the histogram matching module exploits all the elements defining a visual beverage, i.e. shape and color, to enhance the correctness of the shape detection and to recognize the brand of previously detected shapes.

This module relies on the same distinction between the product parts done in the template matching one: a product is split into its main components (cap, bottle liquid and logo for the bottles, the top part of the can and the can surface for the cans) so that it is possible to focus on simple algorithms while keeping the spatial color information (as an example, the cap should be blue while the liquid is green, and not the opposite). It is worth nothing that in the same product part the color is often uniform, so there is no need to split the objects further.

The color analysis is based on simple color histograms [2, 3, 9] guided by the 3D model: only the image portion under the filled template is used to build the histogram. The color space is divided into n sub-parts, called bins, covering specific color ranges. Three normalized color histograms, one for each channel, are then computed. Finally, the histogram of each product part is compared against histograms build from the products database in order to decide the fitness of the detection.

The product database contains reference photos of each product the algorithm should recognize. In particular, for each product, a series of photos are snapped in controlled conditions: the middle shelf of the reference fridge is divided into 9 zones and for each zone four pictures are snapped using 90\(^\circ \) rotation.

Histogram comparison is based on the following measure:

$$\begin{aligned} d(H(I),H(I')) = d_{mode}(H(I),H(I')) (1 - H(I) \cap H(I')) \end{aligned}$$
(1)

where H(I) and \(H(I')\) is a pair of normalized histograms, each containing n bins; \(d_{mode}\) is the distance between the bins of each histogram having the highest frequency indexes and \(H(I)\cap H(I')\) is the sum of the smallest corresponding bins between two histograms, i.e. the histogram intersection.

The measure (1) is a weighted distance which is robust against color distortion because of the modes, while keeping a deeper histogram comparison because of the intersection.

3 Experimental Results

We have performed a series of experiments to verify the performances and the accuracy level that can be obtained by our system. All the module of the pipelines have been implemented using GNU C++ and have been run on dual core CPU with 1.6 GHz/core and 1 GB of RAM.

The results here presented are divided into two sections:

  • the first section shows examples of products placed at random in the shelf;

  • the second section shows examples of real cooler cases, where a shelf is filled by columns and each column will contain only bottles/cans of the same brand.

The experiments have been conducted in a \(654\times 594\) mm cooler shelf with 10 beverage brands. For each test it is shown: the original shelf image (on the left); the beverage edge image where detected caps are highlighted in red (in the middle) and, finally, the 3D rendering of the products detected by the pipeline (on the right).

3.1 Random Shelf Configurations

Figure 8 shows some examples of products randomly placed in the shelf and a few products placed at the rear. The recognition is high, even if some Lipton cans are seen as Kickstarter, since they are very similar; we can also note that the difference between the cans themselves is very little, as just a little part of the logo is different. It is worth to note that Gatorade are detected despite having a different shape from the one in our database: this highlights the algorithm is flexible enough to recognize even unknown products sharing similar properties to the known ones. As for the cans, the Lipton bottle brands (brown bottles) are so similar that it is almost impossible to distinguish between them. Finally, Pepsi and MtnDew (green bottles) have a distinctive color, hence we can achieve a good accuracy on them.

Fig. 8.
figure 8

Examples of products randomly placed in the shelf and a few products placed at the rear.

3.2 Ordered Shelf Configurations

Figure 9 shows some examples of real cooler cases, where a shelf is filled by columns and each column will contain only bottles/cans of the same brand. In the first row there are two tea bottles placed in the rear of an almost empty fridge which are correctly recognized, while the second row there are two Gatorade and three Lipton cans which are correctly recognized too. The cooler is recognized to be almost empty in both cases. In the third row there are some missed Pepsi. This is due to weak edges which are not recognized by the template matching. In the last row, there is a shelf full of bottles and, in this case, some products are missed.

Fig. 9.
figure 9

Examples of real cooler cases.

From the analysis of 100 experiments we can state that:

  • the overall average accuracy level we have obtained is over 80%. In particular, an empty shelf can be identified with 100% precision, while the accuracy decreases to 70% if the shelf is almost full, because of the product occlusion that forces the algorithm to rely only on the top part of the product instead of considering it in its entirety.

  • Since the system should send a cooler inventory every 10 min, the performances are quite satisfactory, as the whole scan of a \(654\times 594\) mm cooler shelf takes approximatively 100 s.

  • Some products are more easily detectable than others since the colors of beverages like Pepsi, MtnDew, Gatorade create a well defined contrast with the background and are very different from the colors of other products. By contrast, Aquafina is very difficult to be identified because of its transparent bottle and its white cap which blends into the background.

4 Conclusions

We have described a simple yet effective system for monitoring the content of a commercial cooler through the visual analysis of the shelves’ images taken with low-cost wide-angle cameras. The difficulty of this task lies mainly in the challenging set-up in which it has to be carried out, such as severe or almost complete occlusion, uneven lighting conditions, poor image quality, and low-cost hardware. The proposed solution combines simple techniques which effectively work under these challenging conditions.

Despite the simplicity of the used techniques, we achieved a satisfactory accuracy level, being able to detect from 70% to 95% of the whole shelf in 100 images. Since the system should send a cooler inventory every 10 min, the computational performances are acceptable as a full shelf scan takes approximately 100 s using limited computational resources. Finally, the system is very flexible, as it needs just a simple and quick learning phase to add new products.

In future, we are planning to better handle irregular light intensity and color distortion in order to improve the recognition accuracy.