OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality

University of California, Santa Barbara
*Equal contribution

Overview of various methods we experimented with, including OCTO+. To determine where a cupcake should be placed, we performs three stages: 1) image understanding: generate a list of all objects in the image (OCTO+ uses RAM++); 2) reasoning: select the most natural object with GPT-4; 3) locating: locate the 2D coordinate of the selected object in the image and ray cast to determine the 3D location in the AR scene.

Introduction

One key challenge in Augmented Reality is the placement of virtual content in natural locations. Most existing automated techniques can only work with a closed-vocabulary, fixed set of objects. In this paper, we introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models. Through a multifaceted evaluation, we identify a new state-of-the-art method, OCTO+. We also introduce a benchmark for automatically evaluating the placement of virtual objects in augmented reality, alleviating the need for costly user studies. Through this, in addition to human evaluations, we find that OCTO+ places objects in a valid region over 70% of the time, outperforming other methods on a range of metrics.

In summary, our contributions are three-fold:

  • We present the state-of-the-art pipeline OCTO+, outperforming GPT-4V and the predecessor OCTOPUS method on virtual content placement in augmented reality scenes.
  • Extensive experimentation with state-of-the-art multimodal large language models, image editing models, and approaches that utilize a series of models, leading to an overall 3-stage conceptualization for the automatic placement problem.
  • We introduce PEARL, a benchmark for Placement Evaluation of Augmented Reality ELements.

Results

Metrics Unnatural Random GPT-4V OCTOPUS GPT4V& CLIPSeg OCTO+ (ours) Human
SCORE (↑) -176.375 -106.113 -34.282 -15.300 -4.492 7.634 17.987
IN MASK (↑) 0.010 0.161 0.321 0.588 0.692 0.702 0.907

Leaderboard

Scores on various methods on PEARL benchmark.

# Tagger Filter Selector Locator In Mask Score MTurk Expert
- Natural Placement* 0.907 17.987 1.000 1.000
1 RAM++ G-DINO GPT-4 G-SAM (Center) 0.702 7.634 0.527 0.690
2 GPT-4V(ision) CLIPSeg (Max) 0.692 -4.492 0.582 0.620
3 GPT-4V(ision) G-SAM (Center) 0.686 4.317 0.580 -
4 RAM++ G-DINO GPT-4 CLIPSeg (Max) 0.671 -4.185 0.547 -
5 LLaVa-v1.5-13B CLIPSeg (Max) 0.649 -13.17 - -
6 SCP G-DINO GPT-4 G-SAM (Center) 0.615 -6.464 - -
7 SCP CLIPSeg GPT-4 G-SAM (Center) 0.613 -10.783 - -
8 SCP G-DINO GPT-4 CLIPSeg (Max) 0.596 -13.005 - -
9 SCP ViLT GPT-4 CLIPSeg (Max) 0.588 -15.300 0.514 0.570
10 SCP CLIPSeg GPT-4 CLIPSeg (Max) 0.572 -20.730 - -
11 GPT-4V (Pixel Location) 0.321 -34.282 - -
12 InstructPix2Pix G-SAM (Bottom) 0.283 -60.852 - -
13 Random Placement* 0.161 -106.113 0.467 0.040
14 Unnatural Placement* 0.010 -176.375 0.167 0.020
Overall results of different models on the PEARL dataset. The best-performing model in each category is in red, the second best is underlined, and the human performance is in bold. The OCTO+ method is denoted with , the Octopus method is denoted with , and the baselines are denoted with *.

PEARL Score



Is PEARL Score aligned with human preferences?

Strong positive correlation between automated metrics and human scores



OCTO+ Runtime

BibTeX

@article{sharma2024octo,
      title     = {OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality}, 
      author    = {Aditya Sharma and Luke Yoffe and Tobias Höllerer},
      journal   = {arXiv preprint arXiv:2401.08973},
      year      = {2024},
    }