We enable users to edit images with simple a cut-and-paste like approach, and fixup those edits automatically.
We enable users to edit images with simple a cut-and-paste like approach, and fixup those edits automatically.
We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserve the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user's input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.
Videos carry useful information on how objects deform and interact in the real world. Leveraging that insight, we use video datasets to supervise photo editing as following: for each video, we sample a reference and target frames. Then, we warp the reference frame (using a combination of optical flow warping and affine transformations) to be aligned with the target and produce a "coarse" edit. Then, we train a diffusion-based model to clean up the coarse edit by computing the reconstruction loss against the target frame as ground truth.
By spatially rearranging the scene in a coarse way, we can quickly clean up the edit and make it photorealistic through Magic Fixup, fixing up the global illumination, connecting edited pieces together, and addressing moving objects to different regions of focus.
Inspired by ZoomShop (Liu et al. 2022), we edit the scene perspective by rendering regions at different depths with different focal lengths. To clean up the editing, ZoomShop required as many as 4 hours of manual editing. However, with Magic Fixup, this can be done in less than 5 seconds! The ZoomShop outputs are not directly comparable, since they do not use the coarse edit we create, but they are results taken directly from the ZoomShop paper for reference.
Although the model was not trained to see any color editing, we find that by coloring objects with partial opacity brush, we can get a much cleaner edit through Magic Fixup. Here we show multiple samples to highlight the diverse generation we get from Magic Fixup.
Since our model only expects the original and edit images as input, we can directly do our editing in Photoshop and take advantage of all the editing tools it provides! Here we use puppet warp to repose a bonsai and make Lex finally have a big smile.
While we filter out the majority of non-photorealistic videos in our training data, we notice that Magic Fixup can still generalize to new domains. By providing the reference image to the model through the detail extraction network, we can preserve the global image identity.
We created a user interface, we call it the Collage Transform, where users can simply segment objects (or parts), and apply affine transformation, duplicate, or delete the segments as a simplified yet expressive editing interface. Here we show an example for using our interface (we speed up the segmentation and sampling time for brevity).
Here we compare our method against text-based editing methods InstructPix2Pix (Brooks et al. 2023) and Masa-ctrl (Cao et al. 2023). We attempt our best effort to describe the user edits in text prompts, but we want to stress that text is strictly less expressive than simply editing the image directly. Note that for methods that rely on DDIM inversion (like Masa-ctrl below), because DDIM inversion is not always reliable, the reconstruction can be significantly different from the input image (as shown in the fox example). On the other hand, our method takes the edited image directly, so it is consistently faithful to the input and faster to run. More comparison results here.
By augmenting our user interface to keep track of dense correspondences, we can generate dragging key-handles for DragDiffusion (Shi et al. 2024), and the dense flow needed for Motion Guidance (Geng et al. 2024).
However, we find that both of these SoTA methods are unable to handle complex reposing scenarios.
What happens if we pass in a reference image that is completely unrelated to the user edit? Here we attempt to understand how the reference and the edited image play a role in the generated image. We see that the reference image influences the global style with some influence on the details, but the content is driven by the provided edit. So the detail extraction network essentially passes the global appearance of the reference, and this explains why the model can handle domains it has not seen during training (like cartoons and sketches, as shown above).
During training, to align the reference and target frames from a video, we use two motion models: (1) flow-based warping where we use forward warping to warp the reference, and (2) coarse affine transformations where we segment everything in the image, and estimate the best affine transform per segment to align with the target. We show that using both types of transformation is essential, as using the flow model can help us preserve the details, and using the coarse affine transformations help us harmonize mis-aligned objects better and synthesize more content when needed.
Diffusion models struggle to generate images with deep dynamic range when starting from pure noise, due to misalignment between training and inference. During training, the model always sees a noisy version of the ground truth (with variable level of noise), but never denoise from pure noise. To address this misalignment, we start the denoising process from a very noisy version of the user edit (instead of step T with pure noise, we start from step T-1)
Since we train the model on spatially editing the reference image, the model does not preserve the identity of inserted objects. Similar to what we show in the style transfer section above, the model will stylize the inserted objects in the style of the original image rather than preserving its identity well.
@misc{alzayer2024magicfixup,
title={Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos},
author={Hadi Alzayer and Zhihao Xia and Xuaner Zhang and Eli Shechtman and Jia-Bin Huang and Michael Gharbi},
year={2024},
eprint={2403.13044},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2403.13044},
}
Template from Nerfies