PLACID: Identity-Preserving Multi-Object Compositing

Abstract

Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations.

To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve object consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image.

Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with fewer omitted objects and visually appealing results.

Example Results

Given a set of object images, an optional background, and a text caption, PLACID produces a coherent multi-object composite. Below we show inputs (object photos + background) and the corresponding outputs on our evaluation set.

Objects

Background

▼

1 object

Objects

Background

▼

2 objects

Objects

Background

▼

2 objects

Objects

Background

▼

3 objects

Objects

Background

▼

3 objects

Objects

Background

▼

3 objects

Objects

Background

▼

4 objects

Objects

Background

▼

4 objects

Objects

Background

▼

5 objects

Method Overview

Our architecture builds upon an image-to-video diffusion transformer (DiT) with text guidance. The visual inputs, encoded via CLIP, include: (i) first frame F₁ (a random assembly of unprocessed object images), (ii) individual object images I_1..N, and (iii) an optional background B.

A caption describing the desired composition is encoded via T5. Image and text encodings are fed to the DiT through separate cross-attention mechanisms. The model flexibly handles varying numbers of objects, with or without a background image.

Training Data Generation

We synthesize short videos where objects follow smooth, physically plausible trajectories from initial random locations to desired final positions. This temporal scaffold helps preserve object identity during movement and prevents object erasure or duplication. We obtain training data from three complementary sources, yielding about 50K diverse annotated tuples: in-the-wild Unsplash compositions, manually designed layouts, Subject-200k paired data, and synthetic side-by-side 3D renders.

Top: Naive interpolation produces half-faded objects. Bottom: Our motion-based trajectories yield temporally consistent videos.

Unsplash (in-the-wild)

Manual designs

Subject-200k pairs

Synthetic side-by-side

Comparison to State of the Art

Qualitative comparison of PLACID to VACE, UNO, DSD, OmniGen, MS-Diffusion, NanoBanana and Qwen-Image-Edit. PLACID achieves superior identity preservation, background fidelity, and fewer missing objects across diverse compositing scenarios.

Quantitative Results & User Studies

Method	CLIP-I ↑	DINO ↑	CLIP-T ↑	MSE-BG ↓	Chamfer ↓	Missing ↓
Multi-Subject Guided Image Generation
UNO	0.696	0.450	0.346	0.062	14.733	0.099
DSD	0.650	0.362	0.347	0.083	11.886	0.102
OmniGen	0.724	0.478	0.337	0.119	15.120	0.128
MS-Diffusion	0.574	0.245	0.314	0.166	16.322	0.071
Image and Video Editing Models
VACE	0.689	0.439	0.343	0.096	9.948	0.096
NanoBanana	0.662	0.390	0.344	0.029	13.146	0.138
Qwen	0.625	0.308	0.317	0.097	49.317	0.115
Ours (PLACID)	0.705	0.440	0.336	0.019	4.641	0.044

Metrics: Missing (% objects missing), CLIP-I/DINO (identity preservation), CLIP-T (text alignment), Chamfer (background color fidelity), MSE-BG (background faithfulness).

User studies with 1265 side-by-side comparisons by eight external users, assessing identity preservation and overall visual preference. PLACID outperforms all open-source alternatives in both studies.

Emerging Capabilities

While PLACID is trained for text-guided multi-object compositing, it exhibits several emerging capabilities:

Creative Composite Layouts: Automatically arranges objects in plausible scenes without explicit layout guidance.
Multi-Entity Subject-Driven Generation: Creates photorealistic scenes from text even without a background image, including interacting elements.
Virtual Try-On: Integrates new objects into existing scenes, enabling applications such as virtual clothing try-on.
Image Editing: Leverages text guidance, identity preservation, and fine-grained color control for tasks from simple color adjustments to complex compositional changes.
Video Generation: Produces short, consistent videos for edits or scene completions, usable as animated creative content.

BibTeX

@article{canettarres2026placid,
  title={PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories},
  author={Canet Tarr{\`e}s, Gemma and Baradad, Manel and Moreno-Noguer, Francesc and Li, Yumeng},
  journal={arXiv preprint arXiv:2602.00267},
  year={2026}
}

PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

PLACID enables simultaneous compositing of multiple objects into background scenes, by exploiting natural priors of the world learnt by video models.

Abstract

Example Results

Method Overview

Training Data Generation

Comparison to State of the Art

Quantitative Results & User Studies

Emerging Capabilities

BibTeX