OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality

Introduction

One key challenge in Augmented Reality is the placement of virtual content in natural locations. Most existing automated techniques can only work with a closed-vocabulary, fixed set of objects. In this paper, we introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models. Through a multifaceted evaluation, we identify a new state-of-the-art method, OCTO+. We also introduce a benchmark for automatically evaluating the placement of virtual objects in augmented reality, alleviating the need for costly user studies. Through this, in addition to human evaluations, we find that OCTO+ places objects in a valid region over 70% of the time, outperforming other methods on a range of metrics.

In summary, our contributions are three-fold:

We present the state-of-the-art pipeline OCTO+, outperforming GPT-4V and the predecessor OCTOPUS method on virtual content placement in augmented reality scenes.
Extensive experimentation with state-of-the-art multimodal large language models, image editing models, and approaches that utilize a series of models, leading to an overall 3-stage conceptualization for the automatic placement problem.
We introduce PEARL, a benchmark for Placement Evaluation of Augmented Reality ELements.

Metrics	Unnatural	Random	GPT-4V	OCTOPUS	GPT4V& CLIPSeg	OCTO+ (ours)	Human
SCORE (↑)	-176.375	-106.113	-34.282	-15.300	-4.492	7.634	17.987
IN MASK (↑)	0.010	0.161	0.321	0.588	0.692	0.702	0.907

Metrics

Unnatural

Random

GPT-4V

OCTOPUS

GPT4V& CLIPSeg

OCTO+ (ours)

Human

SCORE (↑)

-176.375

-106.113

-34.282

-15.300

-4.492

7.634

17.987

IN MASK (↑)

0.010

0.161

0.321

0.588

0.692

0.702

0.907

Leaderboard

Scores on various methods on PEARL benchmark.

#	Tagger	Filter	Selector	Locator	In Mask	Score	MTurk	Expert
-	Natural Placement*				0.907	17.987	1.000	1.000
1	RAM++	G-DINO	GPT-4	G-SAM (Center)	0.702	7.634	0.527	0.690
2			GPT-4V(ision)	CLIPSeg (Max)	0.692	-4.492	0.582	0.620
3			GPT-4V(ision)	G-SAM (Center)	0.686	4.317	0.580	-
4	RAM++	G-DINO	GPT-4	CLIPSeg (Max)	0.671	-4.185	0.547	-
5			LLaVa-v1.5-13B	CLIPSeg (Max)	0.649	-13.17	-	-
6	SCP	G-DINO	GPT-4	G-SAM (Center)	0.615	-6.464	-	-
7	SCP	CLIPSeg	GPT-4	G-SAM (Center)	0.613	-10.783	-	-
8	SCP	G-DINO	GPT-4	CLIPSeg (Max)	0.596	-13.005	-	-
9	SCP	ViLT	GPT-4	CLIPSeg (Max)	0.588	-15.300	0.514	0.570
10	SCP	CLIPSeg	GPT-4	CLIPSeg (Max)	0.572	-20.730	-	-
11				GPT-4V (Pixel Location)	0.321	-34.282	-	-
12			InstructPix2Pix	G-SAM (Bottom)	0.283	-60.852	-	-
13	Random Placement*				0.161	-106.113	0.467	0.040
14	Unnatural Placement*				0.010	-176.375	0.167	0.020

red

underlined

bold

@article{sharma2024octo, title = {OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality}, author = {Aditya Sharma and Luke Yoffe and Tobias Höllerer}, journal = {arXiv preprint arXiv:2401.08973}, year = {2024}, }

OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality

Introduction

Results

Leaderboard

PEARL Score

Is PEARL Score aligned with human preferences?

Strong positive correlation between automated metrics and human scores

OCTO+ Runtime

BibTeX