We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images.
Unlike optimization-heavy pipelines, Edit3r directly predicts instruction-aligned 3D Gaussians, enabling fast, photorealistic rendering without pose estimation.
Training leverages (i) SAM2-based recoloring to create cross-view-consistent supervision and (ii) asymmetric edited/original input pairs that teach the network to fuse inconsistent observations.
We also introduce DL3DV-Edit-Bench, a 100-edit benchmark over 20 diverse scenes.
Edit3r delivers stronger semantic alignment and multi-view consistency than recent baselines while running in real time.