Shelf Image Capture is an experimental project exploring the optimal UX for Large area capture


ShelfView Mobile supports capturing high-resolution images of shelf with mobile phone and analyzing on-shelf availability with computer vision. The images are used for planogram compliance, stock management, and price label verification. The design team has explored the possibility of capturing large objects like retailer shelves and also evaluated the ergonomics of spatial interactions with live AR feedback.

The project is still in the concept exploration stage. As a UX team lead, I’ve been giving usability testing and design iteration guidance to my direct report, Stephanie Gysler, and prototyping guidances to ScanvengAR, an external AR prototyping startup based in Berlin.

 
 
 
 
 

INITIATIVE


Shelf Mobile was conceived with the idea of ‘smart shelf management’ by analyzing images of current shelf layouts. Retailers use display blueprints, known as planograms, to specify how and where products should be arranged on shelves. A key responsibility of store associates is to ensure these shelves match the planogram by identifying discrepancies and adjusting the layout accordingly—a process known as planogram compliance.

For most U.S. retailers, planogram compliance still requires significant human involvement, relying on manual comparison and correction. This approach is not only time-intensive but also prone to human error, leading to deviations from the intended planogram. Recognizing these challenges, Scandit sought to revolutionize shelf management by leveraging image analysis to assess shelf conditions and streamline the compliance process.

 
 
 
 
 
 
 

Challenges

However, scanning retail shelves is a challenging thing to do because they are large (at least 5m or 16ft width). Some innovative US retailers adopt scanning robots that capture images of shelves, but it costs a lot to purchase and maintain them. Therefore, more and more retailers want to adopt a ‘bring-your-own-device’ policy to replace such dedicated scanning devices to save cost. Given this trend, Scandit has explored a possibility of capturing images of shelves using one’s smartphone.

Our UX team was asked to evaluated the usability of shelf image capture using smartphone. While evaluating each concept candidate, we focused on finding answers on the following questions:


  • Is it possible to capture large images with good level of resolution with smartphone?

  • Can first-time users learn how to capture image without tutorial or thorough training?

  • Is the ergonomic of the capturing UX good enough to use it during one’s full 8-hour shift?

 
 
 
 

blue-sky explorations

The biggest challenge that our team faced was that the area to capture is very large compare to the size of the mobile phone screen. Because users have to consistently move across shelf to capture large screenshots, the control of their movement was critical to get high quality images.



The mocks on the right are blue-sky concepts that I explored:

  • Left: Prompt users to aim at a static central dot in screen and capture photos

  • Middle: Show a big AR plane upon phototaking that indicates which section of shelf users captured. (middle)

  • Right: Panoramic (horizontal) image capture

From this stage I handed over the project to Stephanie and asked her to lead the concept design.

 
 

Stephanie led a team design sprint to gather insights from the engineering team, focusing on identifying the best implementation methods. The team explored two potential approaches: capturing a large shelf image as a continuous video or as a still image. I encouraged her to think outside the box and break down the concepts further, categorizing them into video versus image capture, and to evaluate the pros and cons of each method. Below are the UX mockups she created to visualize the options.

 
 
 
 
 

There are notable trade-offs between image capture and video capture. While continuous video recording tends to be more straightforward, achieving a high-quality still image is often hindered by motion blur. Although AI tools could potentially mitigate motion blur effects, utilizing them would considerably extend the project's timeline, leading us to forgo this approach. Considering our time constraints and the technical challenges involved, we decided to focus exclusively on image capture methods.

Subsequently, Stephanie refined our options based on her previous explorations, incorporating concepts I had explored in the past. This collaborative effort resulted in three new concept candidates that are now ready for prototyping.

  • Concept 1: Defining ‘modules’ manually using a static frame, take photos of each and stitch them (left)

  • Concept 2: Draw a big AR plane on shelf to capture super-wide angle shots. (middle)

  • Concept 3: Panoramic image capture (right)

 
 
 
 
 
 

CONCEPT EVALUATIONS

To make working prototypes, Stephanie and I reached out ScanvengAR and shared the 1st round of design iterations and estimated a feasibility of each. Based on their opinion, we decided to prototype Concept 2 with some modification including:

  • An extensive guidance to define the entire shelf capture area.

  • A capturing UX that ‘stitch’ small areas to a big area - due to a technical limitation of ARKit, we needed to slice a defined shelf area to multiple pieces to capture high-quality images.

  • A UI that helps aim at each area to capture.


ScanvengAR suggested 2 UX options for the shelf area drawing experience - 1) manual shelf area drawing and 2) automatic drawing by anchoring one point of a shelf.

 
 

With the working prototypes, Stephanie facilitated a hallway usability testing with other employees who don’t know about this project. She asked 8 participants to play with two AR prototypes in the lab and say out loud about how they feel.

 
 
 

ROUND 1 - MANUAL AREA DEFINITION

In the first round we tested the usability of ‘drawing a scope of shelf area manually and capture image pieces created within’. We asked 6 participants to capture a shelf with the prototype.


VERDICT : NO-GO
Overall, they felt they had little control over setting points and they weren’t really confident to make sure they set up the plane properly. Because the space is 3D, they often drew the plane tilted to front or back. Also when the AR plane was divided by more than 20 pieces, they felt it’s overwhelming to capture them.

 
 
 

ROUND 2 - AUTOMATIC PLANE DEFINITION
In the second round, we replace the manual plane drawing to the automatic rendering. We thought that R2 is better in the way that 1. users don’t have to move a lot to draw the plane, and 2. the plane is always in upright position and not tilted.


VERDICT: NO-GO
Unfortunately, the R2 prototype got more negative feedback than R1. Some participants couldn’t understand how the static plane drawing UI is related to the AR rendering in the live camera view.  And even though they drew a shelf area in the static screen, dragging it to adjust grid is difficult, and takes more time than defining a plane with R1.

 
 
 

We learned from the previous rounds that defining the shelf area is very challenging and hard to understand. Although ARKit could guarantee the quality of screenshots to certain extent, we concluded that the learning curve of using AR experience is way too huge for our target users. From R3, we pivoted our ideas to refer to the ‘traditional’ way to capture shelf images and create a big stitched image at the end of process. To reduce the learning curve, we added an illustration to explain how to capture images. For this prototyping, we used ProtoPie instead.

ROUND 3 - PANORAMIC SHOT WITH NORMAL CAPTURE
We were very lucky to grab a chance to test our R3 concept with real target users in the exact target context. Stephanie traveled to a Kroger store and asked 6 store associates to play with the R3 prototype and think out loud about the capturing experience. In addition to hand-held capturing experience, Krogers also asked us to mount the phone to a moving pole and let their associates to simulate shelf image capturing to compare the different ergonomics.

VERDICT: USABLE THAN THE OTHER VERSIONS BUT NEEDS AN AID
Not surprisingly, the panoramic shot approach of R3 could successfully reduce the learning curve of shelf image capturing. However, because the width of shelves is very wide, the participants wanted to take photos with a physical aid such as carts or moving poles.

 
 
 
 

KEY TAKEAWAYS FROM ON-SITE TESTING

From the on-site testing, the team got a couple of important takeaways from their responses:

Take a conservative approach for the capturing UX
We need to be mindful that our target end users are not so much tech-savvy in general. They also don’t receive proper on-the-job training to learn how to use such solution. Therefore, referring to most commonly used capturing UI (e.g., camera shutter) would be helpful for them to understand how to use it instantly.


Capturing shelves in narrow aisles is challenging
Small stores often has aisles narrower than 1.8m (6ft). In that case, it’s even more challenging to capture shelves because they can’t put the entire module to a phone screen. We need to consider how to optimize the capturing experience in such constraints.

 
 
 
 
 

DEEP DIVE INTO THE ‘NARROW AISLE PROBLEM’

Visiting Kroger gave us another huge physical challenge to delve into, which is the Narrow aisle problem. While storage-type stores have enough gap between aisles, small stores usually have hallways narrower than 6ft (1.8m). In the narrow aisle it’s even harder to capture shelf image with a single shot. The PM wanted us to deep dive more into the narrow aisle problem. Stephanie explored 3 candidates for the narrow aisle shelf capture UX.

 
 

Option 1 - Benchmarking Instagram’s layout capture
Instagram Story supports the layout capture mode so users can stitch up to 4 images at a time. We benchmarked its UX and enabled stitching up to 3 images to capture a module, hoping that users might feel familiar with the layout capture.


Option 2 - Give static capturing guidance
Instead of using the AR feedback, give a static capturing guidance in the live view and shows the captured images with a carousel UI.


Option 3 - Give AR feedback on the captured area
We applied ARKit and visualized the captured area in the immersive view. Users can stitch areas in realtime to make sure they captured the entire module.

 
 
 
 
 

HALLWAY USER TESTING


We took a hallway testing with 8 participants and asked them to capture the entire shelf in only a few steps away from the shelf. From there we got some overlapping feedback about the usability of 3 candidates:


Participants preferred a solid step-by-step guidance of Option 2
Because none of them were familiar with this type of workflow, the participants wanted to see how to capture such large area. In that sense, they liked the Option 2 most in terms of the understandability. In contrast of our hypothesis, not even the regular Instagram users appreciated the benchmarked UX of Option 1.



The immersive feedback of Option 3 was the most straightforward for Ps to understand which area they captured
Because it was challenging to capture the entire module at a time, the Ps appreciated a lot for the real-time feedback that ARKit provided. That said, the performance of the AR feedback was still unstable.

 
 
 
 

ENVISION

Based on what we learned from the series of user testing, we build a MVP that helps capturing shelves in narrow aisle with multiple captures and showcased it in NRF 2024. The prototype supports both high-end and low-end devices with a different feature and UIs to optimize the experience under hardware limitations.

For the high-performance devices, the team applied the live AR feedback on captured area to the prototype. Users can see which areas are captured with live AR highlights. We decided not to show the live AR feedback on low-end scanning devices because the tracking quality was not as good as high-performance phones, therefore, the AR feedback only brings more confusion. Instead of letting users stitch the images in the live view, the MVP for low-performance devices shows a static guidance on which part they need to capture.