Tutorial

FramePack: A Novel “Frame-Packing” Approach to Video Generation

Published on May 1, 2025

AI/ML

FramePack: A Novel “Frame-Packing” Approach to Video Generation

Introduction

With video generation, we typically observe two problems: forgetting and drifting. Forgetting manifests as the model struggles to retain earlier contexts, leading to a fractured narrative. Drifting, or exposure bias, is an insidious process where visual quality degrades over time due to the accumulation of errors.

The dilemma here is that adjusting for one often makes the other worse. Improving memory to tackle forgetting can accelerate error accumulation, where initial errors in individual frames persist and accumulate across subsequent layers, leading to more drifting. On the other hand, disrupting error propagation to reduce drifting can weaken temporal dependencies, exacerbating forgetting.

To optimize against these two obstacles, researchers from Stanford University came up with FramePack, a memory-aware approach designed to more efficiently handle the intensive computational demands of video generation.

Prerequisites

The structure of this article will begin with an overview of the model and end with a code implementation on DigitalOcean GPU Droplets. To understand the overview section, familiarity with Deep Learning fundamentals and how video generation models like Wan2.1 and HunyuanVideo work would be helpful. Feel free to skip to the implementation section if you’re only interested in running the model.

In the implementation, DigitalOcean GPU Droplets (H100s) will be used to launch a gradio demo of FramePack. This demo will allow users to upload an image and provide a descriptive prompt that guides the generation of a video from the image, specifying actions, movements, or transformations to be applied to the image. Note that this particular next-frame-section prediction model can be sensitive to differences in noise and hardware, so results may vary slightly between devices.

Overcoming Forgetting

The FramePack architecture tackles “forgetting” by compressing input frames according to their relative significance. In other words, newer frames are compressed the least.

This approach ensures that the total transformer context length remains within a fixed limit, irrespective of the video’s duration. As a result, the model can encode a much larger number of frames without escalating computational demands, thus promoting better memory retention.

Optimizing GPU Memory Layout for Frames with Patchifying Kernels

Below is a picture depicting the GPU memory layout of frames, where each frame is encoded with patchifying kernels. We can see that F0, the most recent frame, followed by F1 and so forth, are progressively compressed.

Patchifying a kernel is used to reduce the dimensionality of the input data by breaking it down into smaller patches. A kernel size of (pf, ph, pw) divides the input frame into patches of size pf × ph × pw in the temporal, height, and width dimensions, respectively. For instance, a 480p frame in HunyuanVideo is typically represented by 1536 tokens with a (1, 2, 2) patchifying kernel. Switching to a (2, 4, 4) kernel reduces this to 192 tokens per frame (1536/2^3). This makes it easier to process the data with transformer models, which can handle sequences of patches more efficiently than entire frames.

The authors tested a number of different FramePack variants- which are depicted below.

Overcoming Drifting

The researchers addressed the problem of drifting in next-frame prediction models by introducing bi-directional sampling techniques. They observed that drifting only occurs in causal sampling scenarios where models only have access to past frames (causal/vanilla sampling). The researchers observed that providing access to future frames, even a single future frame, eliminates drifting – and therefore, they proposed two sampling methods:

They modified the vanilla sampling method to a bi-directional method where the first iteration simultaneously generates both beginning and ending sections of the video.
They developed a variant that inverts the sampling order, which is particularly effective for image-to-video generation as it treats user input as a high-quality first frame and continuously refines generations to approximate this frame.

Additionally, the researchers implemented necessary modifications to RoPE (Rotary Position Embedding) to support non-consecutive phases in the time dimension, allowing the model to skip non-queried frames.

anti-drift

These approaches prevent drifting by establishing ending frames early and having all future generations attempt to approximate them, maintaining video quality even as video length increases.

Implementation Details

We will be running demo_gradio.py. The implementation uses HunyuanVideo as the base model, but this can be swapped out if desired.

After setting up a Digital Ocean GPU Droplet and opening up your Web console, the implementation of FramePack involves the following steps.

In the Web console:

Step 1: Install Pip and PyTorch

apt install python3-pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Step 2: Clone the repository

git clone https://github.com/lllyasviel/FramePack

Step 3: Install Requirements

pip3 install -r requirements.txt

Step 4: Run the Demo

python3 demo_gradio.py --share

You will be presented with a gradio link that you can paste into your web browser, which will give you something like below.

gradio

For our demo, we will be using an image of Iron Man and generate a prompt using Claude 3.7 Sonnet. claude

We’re quite impressed with the speed (~1 min) and quality of this generation. The video captures the dynamic feeling of flight with the figure’s arms extended outward.

Conclusion

We discussed FramePack, an approach that leverages techniques such as progressive frame compression and smarter sampling of input frames to overcome common issues observed with video generation: forgetting and drifting. We’re big fans of clever optimizations, and so FramePack was a no-brainer for us to cover. We’d love to hear your take on the approach and how the implementation played out. Comment your thoughts below!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products