
Grocery stores through the eyes of AI: Building real-time product recognition

Contributors: Luka Slibar, Matej Balun and Vito Pauletic
At Microblink, we apply artificial intelligence to real world problems with the goal of making life easier for as many people as possible. Recently, our ML teams have been looking into ways of using computer vision to bring the best of online shopping into the supermarkets across the world. We wanted to let shoppers interact with supermarket products from their smartphone to quickly surface things like:
- Product reviews
- Cashback offers
- Allergen information and nutritional values
- Anything else that will make their purchase experience better
On the other hand, we wanted suppliers, retailers and Consumer Packaged Goods (CPG) brands to use this technology to run targeted promotions, guide their pricing strategy and keep the store running smoothly.

We’re not the first ones to work on this problem – Google has been trying to gain ground with a similar solution – but we were better positioned to tackle it effectively. Our Shopper Intelligence product already captures purchase data from retail receipts so that brands can create – and shoppers cash in on – data driven loyalty programs. Over the years, we processed over 5 billion unique purchases and created a comprehensive catalog of supermarket products. We’ve also built a number of ML models designed to run efficiently on mobile devices. One of those models used to detect receipts also served as a baseline for detecting products on the shelves.
Getting the data
Apart from our product catalog, we’ve tried using a number of open-source datasets like SKU-110K to get training data. These were a good starting point for data collection but their commercial use is prohibited. We were left with no choice but to do things ourselves and so our team took to major retail stores nationwide, snapping images of shelves from afar as well as individual product images. They were also asked to take a picture of each product’s barcode, so that we could connect it with its UPC (Unique Product Code) and get a reliable identifier for our retrieval. In just a few months, we collected millions of photos from retail stores nationwide.

Next, our annotation team went over all of these images trying to identify products that appeared in them. We have a 100-strong team of annotators. To speed up their workflow, we built an initial model for pre-annotations that allowed us to label millions of product packages and tens of thousands of shelf images.
We were now ready to start doing some proper model training.
Detecting products from a camera stream
The first thing we need to do is detect distinct products as they sit on the shelves so that they can be classified by other models later on in the process. Our detector had to run in real time on mobile devices to crop out product images both when the user is scanning the shelf at close range and from afar.
The model we trained puts bounding boxes around each individual product package. In the future, we might switch to polygon segmentation as consumer goods tend to come in all shapes and sizes. We used a relatively harsh intersection over union threshold of 0.7 and were still able to achieve an f1 score of 92%. We also brought model inference time down to well under 100ms on iPhone 8 and above thanks to our in-house inference engine.
The biggest challenge for our product detector at the moment is image perspective. When an image is taken from an angle, a product package might end up having a completely different, distorted shape. To address this problem, we added a shelf detection model to the mix. We can use the shelf as a reference point to quickly dewarp images and boost our chances of making accurate detections.

Telling the products apart
Once detected, products need to be classified into their respective classes. When we say classes, we really mean UPCs – and there’s A LOT of them. Not only are there millions of supermarket products out there but brands love to change their packaging whenever they feel like it. It’s definitely not your everyday classification problem.
The sheer scale of potential classes nudged us to try a different approach, using an embedding and retrieval system. The idea is simple – convert each product crop to a feature vector and then retrieve similar vectors from the database. The condensed representation of product images can then be compared much more quickly and isn’t sensitive to changes in glare and scanning angle.
We currently have around one million indexed products that are stored and queried against input embeddings using a k-NN algorithm, with the output value for each pair ranging from -1 to 1. The closer the dot product is to 1, the more similar the products. In our case, the majority of retrievals above 0.75 have been found to be correct, but this threshold is bound to go up as we continue to expand our index.

We experimented with a variety of model architectures to get to the setup that works best. We wanted the model to retrieve k-most similar vectors when given a new unseen image of a product. But this becomes a real challenge with fine-grained differences like package sizes and flavors, which are sometimes hard to distinguish even for humans.

Our goal was to optimize for hit rate at the first retrieval result, achieve maximum recall and have a model generalize well for products it hasn’t seen during training. Hit rate at the first retrieval means that after we create the embedding vector and use it to retrieve, say, a hundred nearest neighbors from the database, the closest vector is actually the correct product. This is the perfect scenario though, and the embedder may still struggle to differentiate between slight product variations. That’s why it’s important to have strong recall so that our re-ranking methods stand a better chance of picking the real winner in case it’s not the first one.
Our current best model has a 93% hit rate at the first result on unseen product examples. Hit rate at 10 nearest neighbors is 98% which means that re-ranking should yield precise results in most cases.
Just like the detector, this model needs to run in real-time on mobile devices. And since we could have a couple dozen products in any given frame from the user’s camera feed, high per-product performance is paramount. Again, our in-house inference engine proved its true worth here. The inference time is under 10ms per product which means that we can do detection and embedding for an average shelf in well under a second. What we now have is a snappy system our design team can build a stunning UX on top of.

The retrieval process depends on an internet connection but by embedding products on the device, we eliminated the need to send images to the backend. We only have to send about 1Kb per product which is really not that bad.
Product Retrieval
The embedder model is where the magic happens in our pipeline but backend retrieval is equally important in terms of performance and accuracy. The performance part is easy as there are some great open source vector databases available but the accuracy part is tricky. The more images you have in your retrieval system, the easier it will be to catch the right one, especially if you can have multiple variations of product images taken in different lighting conditions and under different angles.

Looking ahead
The product recognition pipeline we’ve outlined opens up a world of opportunities for consumers and businesses alike. From augmenting the in-store shopping experience to improving store execution, we are excited to explore all of the potential use cases of this technology and continue improving it.