VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models


Hyeonho Jeong* Geon Yeong Park* Jong Chul Ye
*Indicates Equal Contribution

BISPL, Korea Advanced Institute of Science and Technology
CVPR 2024

Video Motion Customization refers to the task of adapting pre-trained video generative models to create videos that feature a particular motion across various visual contexts and scenes. The goal is to retain the original motion patterns from an input video while presenting them in diverse visual settings. For example, given a video depicting sharks swimming, VMC aims to generate videos following the same motion of the sharks but in entirely distinct scenarios, such as airplanes flying in the sky or spaceships navigating in the space.


Input Video of Motion M
Goldfish + M + In the ocean
Iron Sharks + M + In the sky

Airplanes + M + In the sky
Spaceships + M + In space

Abstract

Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (1) accurately reproducing motion from a target video, and (2) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive noisy latent frames as a motion reference. The diffusion process then preserve low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.



Overview



The proposed Video Motion Customization (VMC) framework distills the motion trajectories from the residual between consecutive (latent) frames, namely motion vector $\delta v_t^n$ for $t>=0$. Specifically, we fine-tune only the temporal attention layers of the key-frame generation model by aligning the ground-truth and predicted motion vectors. After training, the customized key-frame generator is leveraged for target motion-driven video generation with new appearances context, e.g. "A chicken is walking in a city".


Video Motion Customization Results

from Low-resolution & Short Input Video to High-resolution, Long Output Video



Motion of Cars

Input Video of Motion M
Tank + M + On the snow
Police car + M + In a town
Lamborghini + M + In desert
Lamborghini + M + In space
Car + M + Underwater

Input Video of Motion M
Tank + M + On the road
Car + M + In desert

Input Video of Motion M
Tank + M + On the road
Horse + M + On the road
Car + M + On the ice
Bus + M + On the ice

Input Video of Motion M
Elephant + M + In Africa
Bus + M + In a town
Ferrari + M



Motion of Airplanes

Input Video of Motion M
Spaceship + M + In sapce
Satellite + M
Shark + M + Under the sea

Input Video of Motion M
Spaceship + M
Satellite + M
Shark + M + In the ocean



Motion of Birds (flying)

Input Video of Motion M
Phoenix + M + Above lava
Majestic dragon + M + In forest



Motion of Birds (taking off)

Input Video of Motion M
Owl + M + From a cliff
Eagle + M + On the stone
Seagull + M + On the sand



Motion of Birds (walking)

Input Video of Motion M
Chicken + M + On the road
Chicken + M + In a city
Seagull + M + Underwater

Input Video of Motion M
Eagle + M + On edge
Duck + M + Around a pond
Owl + M + In the forest



Motion of Birds (floating)

Input Video of Motion M
Puppy + M + On the water
Turtle + M + On water
Boat + M + On water
Ballon + M + On water



Motion of Human

Input Video of Motion M
Spider-man + M
Astronaut + M + underwater in deep sea

Input Video of Motion M
M + In the desert
M + On the ice
M + motorbike
Monkey + M



Motion of Plants

Input Video of Motion M
Rose + M
Starfish + M
Starfish + M + Chinese watercolor style



Motion of Diffusion

Input Video of Motion M
Ink + M + In water
Jellyfish + M
Flower + M



Motion of Fall

Input Video of Motion M
Gem stones + M
Colorful confetti + M
Bubbles + M
Feathers + M
Stars + M
Snow + M
Stones + M + Underwater



Motion of Space

Input Video of Motion M
M + In deep water
Stars + M + In sky



Motion of Huge animals

Input Video of Motion M
Tiger + M
Bulldog + M + Around flowers

Input Video of Motion M
Origami dinosaur + M
Pigeon + M + In the style of oil art

Input Video of Motion M
Bear + M + In a bamboo grove
Bear + M + On the pond
Lion + M + On the pond

Input Video of Motion M
Panda + M
Panda + M + On the snow
Tiger + M + On the snow



Motion of Small animals

Input Video of Motion M
Dog + M + On the grass
White fox + M + in the beach
Wolf + M + in the flowers

Input Video of Motion M
Puppy + M
Tiger + M
Tiger + M + On the grass
Wolf + M + On the snow


Input Video of Motion M
Dog + M + Watermelon
Fox + M + Watermelon
Rabbit + M + Watermelon + On the grass
Rabbit + M + Watermelon + On the sand
Squirrel + M + Watermelon
Squirrel + M + Orange
Squirrel + M + Watermelon + On the grass
Squirrel + M + Watermelon + On the Sand
(Train M for 300 Steps)
Squirrel + M + Watermelon + On the Sand
(Train M for 400 Steps)


Comparisons to Baselines

VMC is compared against 4 state-of-the-art baselines
⇨   VideoComposer,   Gen-1,   Tune A Video,   Control A Video



A car is moving.   ⇨   A car is moving, underwater.

Input Video

VideoComposer

Gen-1

Ours
Tune A Video
Control A Video


Two sharks are moving.   ⇨   Two airplanes are moving in the sky.

Input Video

VideoComposer

Gen-1

Ours

Tune A Video

Control A Video


An owl is taking off.   ⇨   A seagull is taking off on the sand.

Input Video

VideoComposer

Gen-1

Ours

Tune A Video

Control A Video


Ink is spreading.   ⇨   Flower is spreading.

Input Video

VideoComposer

Gen-1

Ours

Tune A Video

Control A Video


Video Style Transfer

Input Video
Starry Night by Vincent Van Gogh.
Input Video
Oil painting of flowers.
Input Video
Classic anime from 1990.
Input Video
Starry Night by Vincent Van Gogh.


Backward Motion Customization

We customize video diffusion model to learn extermely unprobable motions,
backward motions from reversed real-world videos.


Real-world Video with Motion M
Reversed Video with Motion M-1
Ink + M-1 + In water
Jellyfish + M-1
Real-world Video with Motion M
Reversed Video with Motion M-1
Car + M-1 + In desert
Tank + M-1 + On the road
Real-world Video with Motion M

Reversed Video with Motion M-1

Lamborghini + M-1 + In space

Car + M-1 + Underwater

Real-world Video with Motion M


Reversed Video with Motion M-1


Eagle + M-1 + On edge


Duck + M-1 + Around a pond


BibTex

        @article{jeong2023vmc,
                title={VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models}, 
                author={Jeong, Hyeonho and Park, Geon Yeong and Ye, Jong Chul},
          	journal={arXiv preprint arXiv:2312.00845},
                year={2023},
        }