OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal

GitHub arXiv Model CORNE CORNE-Val AnimeEraseBench TextEraseBench

Authors

Qinming Zhou^1,2,*, Chenxi Sun^1,3,*, Deyang Kong^1,3, Junhao He¹, Xiangheng Tang^1,4, Peike Yu^1,5, Haotian Wu¹, Leilei Cao⁶, Linfeng Zhang¹

¹Shanghai Jiao Tong University, ²Tsinghua University, ³University of Electronic Science and Technology of China, ⁴Xidian University, ⁵Tongji University, ⁶Transsion

Abstract

Real-world object removal is challenging due to two key difficulties: the target object's non-local effects, such as shadows and reflections, which are difficult to model, and the fact that user-provided masks are often inaccurate or incomplete. With billions of parameters and tens of denoising steps, diffusion-based models achieve strong removal performance at the expense of substantial computational cost, limiting their use in interactive applications and on edge devices. To address these challenges, we present OSOR (One-Step Object Removal), which simultaneously achieves efficient, effect-aware, and mask-robust object removal. Concretely, OSOR introduces: (1) an occupancy-guided discriminator for precise boundary supervision, enabling stable single-step diffusion training; (2) an alpha head that leverages knowledge from pretrained diffusion models to predict appropriate removal regions with minimal overhead, thereby handling imperfect masks; and (3) a semantic-anchored verification pipeline (SAVP) that filters noisy instruction-based triplets to produce effect-aware supervision at scale. Using SAVP, we curate CORNE, which contains 280K verified removal pairs, and further annotate AnimeEraseBench and TextEraseBench to evaluate performance on more complex removal tasks. Experiments show that OSOR surpasses strong multi-step diffusion baselines in perceptual quality while achieving 4x to 30x faster inference.

Visual Results

Examples from RORD-Val and RemovalBench.

Press any image to see the transformation.

CORNE

We introduce CORNE, a large-scale training dataset for effect-aware object removal with 280K removal pairs, together with CORNE-Val.

Press any image to see the transformation. (Erased results are generated by OSOR.)

AnimeEraseBench

We introduce AnimeEraseBench, an object removal benchmark for anime-style scenes.

Press any image to see the transformation. (Erased results are generated by OSOR.)

TextEraseBench

We introduce TextEraseBench, a text removal benchmark for evaluating object removal and inpainting methods on text overlays and scene-text objects.

Press any image to see the transformation. (Erased results are generated by OSOR.)

Scribble-Guided Object Removal

OSOR removes objects marked by irregular scribbles while preserving the surrounding scene structure and visual consistency.