ImaginateAR

AI‑Assisted In‑Situ Authoring in Augmented Reality

Accepted to UIST 2025

Jaewook Lee1*Filippo Aleotti2*Diego Mazala2Guillermo Garcia‑Hernando2Sara Vicente2Oliver Johnston2Isabel Kraus‑Liang2Jakub Powierza2Donghoon Shin1Jon E. Froehlich1Gabriel Brostow2,3Jessica Van Brummelen2

1University of Washington    2Niantic Spatial, Inc.    3University College London
*Equal contribution

A system overview and user flow diagram. On the left, a person is scanning a real-world environment, which ImaginateAR uses to perform scene understanding offline. On the right, a user arrives at the same location and localizes against the real world environment. They can then speak to ImaginateAR’s AI agents and make manual modifications, before exploring the generated AR scene.

Abstract


While augmented reality (AR) enables new ways to play, tell stories, and explore ideas rooted in the physical world, authoring personalized AR content remains difficult for non-experts, often requiring professional tools and time. Prior systems have explored AI-driven XR design but typically rely on manually defined VR environments and fixed asset libraries, limiting creative flexibility and real-world relevance. We introduce ImaginateAR, the first mobile tool for AI-assisted AR authoring to combine offline scene understanding, fast 3D asset generation, and LLMs—enabling users to create outdoor scenes through natural language interaction. For example, saying “a dragon enjoying a campfire” (P7) prompts the system to generate and arrange relevant assets, which can then be refined manually. Our technical evaluation shows that our custom pipelines produce more accurate outdoor scene graphs and generate 3D meshes faster than prior methods. A three-part user study (N=20) revealed preferred roles for AI, how users create in freeform use, and design implications for future AR authoring tools. ImaginateAR takes a step toward empowering anyone to create AR experiences anywhere—simply by speaking their imagination.


Video



Example Creations


We first showcase AR scenes authored using ImaginateAR. These include 6 scenes created by the research team for the purposes of proof by demonstration, as well as 24 scenes authored by the participants in our user study (N=20 + 4 pilot) during a free-form authoring phase.


A proof by demonstration showing screenshots of ImaginateAR with various AR assets in real-world scenes, such as a colorful ‘silly hat’ on a real-world, metallic pig statue.
Six example creations from our technical evaluation, situated in a park, schoolyard, playground, shopping center, and backyard. Each scene was first generated with AI tools, then refined with light manual adjustments to reflect typical ImaginateAR use. Some are whimsical (A, F), while others are educational (C, E) or playful (B, D).

A collage of 24 AR scenes created by participants during Part 2 of the user study. Examples include ‘a dog enjoying a campfire’, ‘animal kingdom with cat, horse, and fox’, and ‘a dancing T-Rex on grass’.
AR experiences created by participants (N=20 + 4 pilot) while interacting with the ImaginateAR prototype. Users were encouraged to create freely without limitations. 'PP' denotes pilot participants and 'P' study participants.

System Overview


ImaginateAR integrates three technical innovations: (1) Outdoor scene understanding using enhanced OpenMask3D with GPT‑4o semantic labeling and HDBSCAN clustering to build structured scene graphs; (2) Fast 3D mesh generation via GPT‑4o prompt expansion, reference image synthesis, segmentation (DIS) and mesh lifting (InstantMesh); (3) LLM‑driven speech interaction, enabling users to place and refine assets through natural spoken commands in real time.


Diagram of the 3D scene understanding pipeline. Point cloud becomes an initial set of masks through 3D mask prediction, which becomes semantic point cloud thorugh filtering and classification, then becomes final set of masks through clustering. Result is 3D bounding boxes of objects’ corresponding masks.
Diagram of the 3D scene understanding pipeline. Given an input point cloud, we first estimate 3D masks. Next, we assign a semantic label to each mask using a VLM and propagate the label to all points within the mask, producing a semantic point cloud. We then cluster nearby points with the same label to infer the final set of 3D masks, from which we extract 3D bounding boxes. For visualization, we show only the bounding boxes, not the underlying masks. The Pavement box is enclosed within the Road box and is therefore not visible.

Result of the 3D scene understanding module on Vase, House, and Garden 3D environments. Different real-world features such as flowers, sidewalks, trees, and grass are represented as 3D bounding boxes.
Results of the 3D scene understanding module. For each of the three scans—Vase, House, and Garden—we visualize the input point cloud (left) and the final set of labeled 3D bounding boxes inferred by our scene understanding pipeline (right). We also report the total time (in minutes) required to estimate the scene graph for each scan. Note that some bounding boxes may be enclosed within others and may therefore be occluded.

Example 3D asset generation. A simple prompt, ‘a roman statue’ is prompt boosted with visual examples to include beneficial words like `detailed’. This goes through text-to-image generation, then image-to-3D for the final generated mesh.
Example of 3D asset generation. Given a user prompt, we first apply prompt boosting, then use Dall-E 2 to generate a consistent image by editing the center region of a white canvas. The image is then lifted to 3D using InstantMesh. The 'Bad' example (right) illustrates a failure case because it would produce a partial 3D object (i.e., only the dragon’s head). Prompt boosting helps avoid such incomplete generations.

User Study & Evaluation


We evaluated ImaginateAR through a technical assessment and a three-part user study (N=20) in a public park. Our scene understanding pipeline outperformed the base OpenMask3D model and ablated variants, while our asset generation pipeline matched state-of-the-art quality with a faster, sub-minute runtime. The user study included: a comparison task across three authoring modes—manual, AI-assisted, and AI-decided—to explore control vs. automation (Part 1); a free-form phase where participants designed their own AR experiences (Part 2); and a co-design session reflecting on AI’s role and envisioning future features (Part 3). Overall, participants enjoyed creating diverse AR scenes and favored a hybrid approach—using AI for rapid, creative generation while retaining manual control for customization. Examples of both research team–created and user-authored scenes appear earlier on this page. For more details, please see our paper.


A comparison of bounding boxes between the Ground Truth, OpenMask3D, and Ours. It is clear that OpenMask3D has an excessive number of bounding boxes compared to the Ground Truth and Ours.
From left to right: bounding boxes from the ground truth, OpenMask3D, and our proposed method. OpenMask3D predicts a large number of masks, resulting in excessive bounding boxes that over-represent the same scene objects. In contrast, our method produces fewer, more accurate boxes. (Box colors are arbitrary and can be ignored.)

UI layout of ImaginateAR. Users can access different features of ImaginateAR through buttons such as lightbulb icon for brainstorming and mic icon for speaking to an AI agent.
Different screen captures of the ImaginateAR's mobile interface showing the UI layout and functionalities. Users can access manual, AI-assisted, and AI-decided modes across different features through buttons on the screen.

Resources


Paper

Paper thumbnail

Supplemental

Supplemental thumbnail

BibTeX

If you find this work useful for your research, please cite:

@inproceedings{lee2025imaginatear,
  author = {Lee, Jaewook and Aleotti, Filippo and Mazala, Diego and Garcia‑Hernando, Guillermo and Vicente, Sara and Johnston, Oliver James and Kraus‑Liang, Isabel and Powierza, Jakub and Shin, Donghoon and Froehlich, Jon E. and Brostow, Gabriel and Van Brummelen, Jessica},
  title = {ImaginateAR: AI‑Assisted In‑Situ Authoring in Augmented Reality},
  year = {2025},
  isbn = {9798400720376},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3746059.3747635},
  doi = {10.1145/3746059.3747635},
  booktitle = {Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
  location = {Busan, Republic of Korea},
  series = {UIST '25},
}

Acknowledgements


We thank the ImaginateAR research team and study participants. ImaginateAR builds on earlier work by many of the same authors (e.g., CoCreatAR). Specific funding and acknowledgements are referenced in the full preprint.

© This webpage was inspired by this template.