Edit3r

Instant 3D Scene Editing from Sparse Unposed Images

Anonymous Authors

Instruction-driven 3D Gaussian scene editing without per-scene optimization.

Feed-forward 3D editing SAM2 recoloring Asymmetric inputs
Edit3r teaser showing editing results

One-pass 3D Gaussian edits from sparse, unposed, and view-inconsistent 2D edits.

Abstract

We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike optimization-heavy pipelines, Edit3r directly predicts instruction-aligned 3D Gaussians, enabling fast, photorealistic rendering without pose estimation. Training leverages (i) SAM2-based recoloring to create cross-view-consistent supervision and (ii) asymmetric edited/original input pairs that teach the network to fuse inconsistent observations. We also introduce DL3DV-Edit-Bench, a 100-edit benchmark over 20 diverse scenes. Edit3r delivers stronger semantic alignment and multi-view consistency than recent baselines while running in real time.

Overview & Demo

End-to-end Edit3r walkthrough: instruction -> 3D Gaussian edit -> rendered result.

Pipeline sketch

Instruction-driven 3D editing: asymmetric inputs, cross-view fusion, and direct Gaussian prediction.

Training Recipe

  • SAM2-based recoloring builds reliable, cross-view supervision.
  • Asymmetric edited/original pairs inject realistic inconsistencies during training.
  • Feed-forward reconstruction predicts canonical 3D Gaussians directly.
  • 2D + 3D losses mix CLIP, LPIPS, low-freq MSE with center/geometry regularizers.
Training pipeline overview

Four-step training loop: SAM2 recolor -> asymmetric inputs -> reconstruction -> joint 2D/3D losses.

Training pipeline animation: recolor supervision, asymmetric pairs, reconstruction, and mixed 2D/3D losses.

SAM2 recoloring pipeline

SAM2 Recoloring

Automatic mask discovery + propagation to synthesize edited supervision that stays consistent across views.

Masks filtered by IoU, stability, area, aspect ratio, and boundary margins before recoloring.

Benchmark: DL3DV-Edit-Bench

100 text-driven edits over 20 indoor/outdoor scenes, covering Add / Remove / Modify / Global instructions.

  • Source: DL3DV test split; resolution 512x512.
  • Metrics: CLIPt2i for edit faithfulness; C-FID / C-KID for multi-view realism.
  • Filtered for ambiguous prompts, low texture, and motion blur.
DL3DV-Edit-Bench examples

Representative prompts and edited results across all four categories.

Quantitative & Ablations

Quantitative results Ablation results

Full model leads CLIPt2i, C-FID, and C-KID; removing recolor, 3D loss, SAM, or random-drop hurts every metric.

Qualitative Comparisons

Qualitative comparison against baselines

Edit3r vs. EditSplat, GaussCtrl, and NoPoSplat on Add / Remove / Modify / Global edits (DL3DV-Edit-Bench).

One-pass Inference

Feed-forward 3D Gaussian prediction without test-time optimization enables real-time interactive editing.

Inference speed illustration

3D Regularization Matters

Effect of 3D losses

3D center and geometry consistency losses prevent depth drift and layered Gaussians caused by view-inconsistent 2D edits.

Edited Examples

More edits: party decor, pink blossoms, garden lamps, low-poly style, checkerboard plazas, sky lanterns.

More qualitative results

Example Videos

Takeaways

  • Edit3r: instant, instruction-driven 3D scene editor on 3D Gaussians.
  • SAM2-based multi-view recoloring + asymmetric inputs teach consistency from view-inconsistent 2D edits.
  • Single forward pass, no per-scene optimization -> real-time edits.
  • Strong instruction following and multi-view realism; benchmark + ablations available.