Magic Fixup: Streamlining Photo Editing
by Watching Dynamic Videos

Hadi AlZayer^{1, 2} Zhihao Xia¹ Xuaner (Cecilia) Zhang¹ Eli Shechtman¹ Jia-Bin Huang² Michael Gharbi¹

¹Adobe ²University of Maryland

Accepted to Transaction on Graphics 2025

We enable users to edit images with simple a cut-and-paste like approach, and fixup those edits automatically.

High level approach

Videos carry useful information on how objects deform and interact in the real world. Leveraging that insight, we use video datasets to supervise photo editing as following: for each video, we sample a reference and target frames. Then, we warp the reference frame (using a combination of optical flow warping and affine transformations) to be aligned with the target and produce a "coarse" edit. Then, we train a diffusion-based model to clean up the coarse edit by computing the reconstruction loss against the target frame as ground truth.

Spatial recomposition results

By spatially rearranging the scene in a coarse way, we can quickly clean up the edit and make it photorealistic through Magic Fixup, fixing up the global illumination, connecting edited pieces together, and addressing moving objects to different regions of focus.

Reference image User edit Magic Fixup

Misc. editing

Perspective editing

Inspired by ZoomShop (Liu et al. 2022), we edit the scene perspective by rendering regions at different depths with different focal lengths. To clean up the editing, ZoomShop required as many as 4 hours of manual editing. However, with Magic Fixup, this can be done in less than 5 seconds! The ZoomShop outputs are not directly comparable, since they do not use the coarse edit we create, but they are results taken directly from the ZoomShop paper for reference.

Reference image User edit MagicFixup (ours) SDEdit ZoomShop (for reference only)

Colorization

Although the model was not trained to see any color editing, we find that by coloring objects with partial opacity brush, we can get a much cleaner edit through Magic Fixup. Here we show multiple samples to highlight the diverse generation we get from Magic Fixup.

Reference image User edit Ours (sample 1) Ours (sample 2) Baseline (SDEdit)

Fixing up photoshop

Since our model only expects the original and edit images as input, we can directly do our editing in Photoshop and take advantage of all the editing tools it provides! Here we use puppet warp to repose a bonsai and make Lex finally have a big smile.

Reference image User edit MagicFixup (ours) SDEdit

Beyond real photos

While we filter out the majority of non-photorealistic videos in our training data, we notice that Magic Fixup can still generalize to new domains. By providing the reference image to the model through the detail extraction network, we can preserve the global image identity.

Reference image User edit Magic Fixup (ours) SDEdit

User interface demo

We created a user interface, we call it the Collage Transform, where users can simply segment objects (or parts), and apply affine transformation, duplicate, or delete the segments as a simplified yet expressive editing interface. Here we show an example for using our interface (we speed up the segmentation and sampling time for brevity).

Comparison with baselines

comparison with text-based methods

Here we compare our method against text-based editing methods InstructPix2Pix (Brooks et al. 2023) and Masa-ctrl (Cao et al. 2023). We attempt our best effort to describe the user edits in text prompts, but we want to stress that text is strictly less expressive than simply editing the image directly. Note that for methods that rely on DDIM inversion (like Masa-ctrl below), because DDIM inversion is not always reliable, the reconstruction can be significantly different from the input image (as shown in the fox example). On the other hand, our method takes the edited image directly, so it is consistently faithful to the input and faster to run. More comparison results here.

Reference image User edit Magic Fixup (ours) InstructPix2Pix Masa-Ctrl

prompt: reflect the fox to the other side

prompt: photo of a fox drinking on the right side

prompt: duplicate the cup on the table

prompt: two golden cups on table

Comparison with reposing methods

By augmenting our user interface to keep track of dense correspondences, we can generate dragging key-handles for DragDiffusion (Shi et al. 2024), and the dense flow needed for Motion Guidance (Geng et al. 2024). However, we find that both of these SoTA methods are unable to handle complex reposing scenarios.

Reference image User edit Magic Fixup (ours) DragDiffusion MotionGuidance

Runtime: 5 seconds Runtime: 3 minutes Runtime: 50 minutes

Ablations

Style transfer effect

What happens if we pass in a reference image that is completely unrelated to the user edit? Here we attempt to understand how the reference and the edited image play a role in the generated image. We see that the reference image influences the global style with some influence on the details, but the content is driven by the provided edit. So the detail extraction network essentially passes the global appearance of the reference, and this explains why the model can handle domains it has not seen during training (like cartoons and sketches, as shown above).

Reference image User "edit" Sample 1 Sample 2 Sample 3

Motion model ablation

During training, to align the reference and target frames from a video, we use two motion models: (1) flow-based warping where we use forward warping to warp the reference, and (2) coarse affine transformations where we segment everything in the image, and estimate the best affine transform per segment to align with the target. We show that using both types of transformation is essential, as using the flow model can help us preserve the details, and using the coarse affine transformations help us harmonize mis-aligned objects better and synthesize more content when needed.

Reference image User edit Both motion models (ours) Flow motion model only Affine motion model only

Latent noise initialization

Diffusion models struggle to generate images with deep dynamic range when starting from pure noise, due to misalignment between training and inference. During training, the model always sees a noisy version of the ground truth (with variable level of noise), but never denoise from pure noise. To address this misalignment, we start the denoising process from a very noisy version of the user edit (instead of step T with pure noise, we start from step T-1)

Reference image User edit starting from our initialization starting from pure noise

Limitations

Since we train the model on spatially editing the reference image, the model does not preserve the identity of inserted objects. Similar to what we show in the style transfer section above, the model will stylize the inserted objects in the style of the original image rather than preserving its identity well.

Reference image User edit Sample 1 Sample 2

Magic Fixup: Streamlining Photo Editing
by Watching Dynamic Videos

Abstract

High level approach

Spatial recomposition results

Misc. editing

Perspective editing

Colorization

Fixing up photoshop

Beyond real photos

User interface demo

Comparison with baselines

comparison with text-based methods

Comparison with reposing methods

Ablations

Style transfer effect

Motion model ablation

Latent noise initialization

Limitations

BibTeX

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Abstract

High level approach

Spatial recomposition results

Misc. editing

Perspective editing

Colorization

Fixing up photoshop

Beyond real photos

User interface demo

Comparison with baselines

comparison with text-based methods

Comparison with reposing methods

Ablations

Style transfer effect

Motion model ablation

Latent noise initialization

Limitations

BibTeX

Magic Fixup: Streamlining Photo Editing
by Watching Dynamic Videos