PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

Amazon, Barcelona
PLACID teaser showing multi-object compositing results

PLACID enables simultaneous compositing of multiple objects into seamless natural scenes, with optional text guidance for captions, color instructions, and creative compositions.

Abstract

Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations.

To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve object consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image.

Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with fewer omitted objects and visually appealing results.

Example Results

Given a set of object images, an optional background, and a text caption, PLACID produces a coherent multi-object composite. Below we show inputs (object photos + background) and the corresponding outputs on our evaluation set.

watch
background
result: watch on yellow backdrop
1 object
ring 1 ring 2
background
result: rings on purple backdrop
2 objects
ceramics vase
background
result: ceramics and vase
2 objects
duck truck robot
background
result: duck, truck, robot
3 objects
orange chair wooden table multicolored vase
background
result: chair, table, vase
3 objects
scarf necklace bracelet
background
result: scarf, necklace, accessories
3 objects
teddy bear rubik's cube toy toy
background
result: teddy bear and toys
4 objects
plaid chair blue chair table lamp
background
result: armchairs, table, lamp
4 objects
wine salad sauce watermelon plate
background
result: food items on table
5 objects

Video Outputs

PLACID generates videos where objects transition from random initial positions to coherent layouts. The final frame serves as the composite image.

Method Overview

PLACID model architecture

Our architecture builds upon an image-to-video diffusion transformer (DiT) with text guidance. The visual inputs, encoded via CLIP, include: (i) first frame F1 (a random assembly of unprocessed object images), (ii) individual object images I1..N, and (iii) an optional background B.

A caption describing the desired composition is encoded via T5. Image and text encodings are fed to the DiT through separate cross-attention mechanisms. The model flexibly handles varying numbers of objects, with or without a background image.

Training Data Generation

We synthesize short videos where objects follow smooth, physically plausible trajectories from initial random locations to desired final positions. This temporal scaffold helps preserve object identity during movement and prevents object erasure or duplication. We obtain training data from three complementary sources, yielding about 50K diverse annotated tuples: in-the-wild Unsplash compositions, manually designed layouts, Subject-200k paired data, and synthetic side-by-side 3D renders.

Training data generation pipeline

Top: Naive interpolation produces half-faded objects. Bottom: Our motion-based trajectories yield temporally consistent videos.

Unsplash (in-the-wild)
Manual designs
Subject-200k pairs
Synthetic side-by-side

Comparison to State of the Art

Comparison to state of the art methods

Qualitative comparison of PLACID to VACE, UNO, DSD, OmniGen, MS-Diffusion, NanoBanana and Qwen-Image-Edit. PLACID achieves superior identity preservation, background fidelity, and fewer missing objects across diverse compositing scenarios.

Quantitative Results & User Studies

Method CLIP-I ↑ DINO ↑ CLIP-T ↑ MSE-BG ↓ Chamfer ↓ Missing ↓
Multi-Subject Guided Image Generation
UNO 0.6960.4500.3460.06214.7330.099
DSD 0.6500.3620.3470.08311.8860.102
OmniGen 0.7240.4780.3370.11915.1200.128
MS-Diffusion 0.5740.2450.3140.16616.3220.071
Image and Video Editing Models
VACE 0.6890.4390.3430.0969.9480.096
NanoBanana 0.6620.3900.3440.02913.1460.138
Qwen 0.6250.3080.3170.09749.3170.115
Ours (PLACID) 0.7050.4400.336 0.0194.6410.044

Metrics: Missing (% objects missing), CLIP-I/DINO (identity preservation), CLIP-T (text alignment), Chamfer (background color fidelity), MSE-BG (background faithfulness).

User study results

User studies with 1265 side-by-side comparisons by eight external users, assessing identity preservation and overall visual preference. PLACID outperforms all open-source alternatives in both studies.

Emerging Capabilities

Emerging capabilities of PLACID

While PLACID is trained for text-guided multi-object compositing, it exhibits several emerging capabilities:

  • Creative Composite Layouts: Automatically arranges objects in plausible scenes without explicit layout guidance.
  • Multi-Entity Subject-Driven Generation: Creates photorealistic scenes from text even without a background image, including interacting elements.
  • Virtual Try-On: Integrates new objects into existing scenes, enabling applications such as virtual clothing try-on.
  • Image Editing: Leverages text guidance, identity preservation, and fine-grained color control for tasks from simple color adjustments to complex compositional changes.
  • Video Generation: Produces short, consistent videos for edits or scene completions, usable as animated creative content.

BibTeX

@article{canettarres2026placid,
  title={PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories},
  author={Canet Tarr{\`e}s, Gemma and Baradad, Manel and Moreno-Noguer, Francesc and Li, Yumeng},
  journal={arXiv preprint arXiv:2602.00267},
  year={2026}
}