Video-ASTAR

Abstract

Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR, for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions. Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy. Extensive experiments and ablation studies show that our method significantly outperforms baseline methods, generating videos with improved semantic and compositional consistency alongside improved temporal coherence.

Overview

The following are the contributions of our work improving text-based video synthesis:

Optimization-Based Framework for Compositional Video Generation: e present a training-free approach that leverages the pre-trained VideoCrafter2 [1] diffusion model to generate videos with compositionally rich and coherent content from textual descriptions.
Centroid Loss for Attention Interaction: We introduce a novel loss function, Centroid Loss, which promotes stronger interaction between entities in the attention maps. This helps maintain consistent and accurate representations of objects and actions throughout the denoising steps in the diffusion process.
ASTAR [2] Inspired Method for Attention Control: To improve attention segregation and retention across video frames, we incorporate ASTAR loss and extend it with a mean-threshold-based mask creation strategy. Specifically, we generate a mask from the first frame using mean attention values and apply this mask consistently across all frames. This enables stable tracking of tokens throughout the video. Additionally, we identify the cross-attention layers within the diffusion model that are most critical to controlling the generation, enabling more targeted manipulation of visual content.

Framework. The figure illustrate VideoCrafter2 (VC2) [1] framework with proposed latent optimization step. The VC2 video latent diffusion model conditions the noisy latent input over the given text prompt, and during the denoising step, the proposed method ensures the attention map attends to each token as per the text. The final denoised latent is then given as input to the pre-trained latent-to-frame decoder network of VC2 for video generation.

Video-ASTAR

Video-ASTAR

Loss Functions. The figure shows how each loss operates on the attention map generated for each token.

Token Visualization. The figure shows the cross-attention map between text and video frame pooled from a downsampling layer in a spatial transformer block of VideoCrafter2 [1].

Effect of Optimization Method

(a) The figure illustrates the effect of different optimization functions with VideoCrafter2 [1], (b) shows the effect of different parameters on the Video-ASTAR loss.

We performed an ablation study to evaluate the effect of different optimization methods on the VideoCrafter2 (VC2) [1] pipeline. As shown in above figures for the given prompts, the baseline VC2 fails to generate the object "bench." Applying the A&E loss [5] successfully recovers the missing object for the first prompt but fails on the second and shows alignment issues for the first prompt. In contrast, incorporating Video-ASTAR recovers the missing entities in both prompts, demonstrating better contextual understanding and consistent generation across video frames.
The Video-ASTAR loss combines two components: attention segregation and attention retention. Attention segregation reduces the overlap between attention maps of different entities within a frame. Attention retention maintains token focus during denoising by applying a consistent attention mask. In the above Figure (b–c) shows that fixed thresholding fails to capture all entities, while a mean-based threshold recovers them effectively. Figures \ref{fig:different_methods_astar_ablation}(d–e) demonstrate that reusing a binary mask from the first frame across all frames helps preserve context, such as consistently representing a "big rock." In our final setup, we generate a binary mask using the mean threshold from the first frame and apply it across the video. We also introduce a centroid-based loss to improve compositional generation by enhancing entity interaction in attention maps. Figure \ref{fig:effectofloss} compares different optimization methods on VideoCrafter2, showing that our approach better captures the entities described in the prompt. We provide further results and details in the supplement.

In this figure, we have shown the effect of different methods when integrated with the VideoCrafter2 [1] text-to-video generation network.

In addition to attention segregation and retention, we also used centroid-based loss for compositional generation in Video-ASTAR. The above figure shows a video generated for a given text prompt across different optimization methods on the videocrafter2, where our proposed optimization method helps generate the given entities in the text description.

Quantitative Comparison

Quantitative comparison with baseline methods using Vbench [3] across three different prompt categories on three random seed values. We have reported the average across all the runs.

As shown in the above table, the proposed optimization method achieves competitive performance and, in some cases, outperforms the baseline across all VBench [3] dimensions. Notably, the Multiple Object generation task shows a significant improvement over other methods. This gain can be attributed to the attention retention and centroid loss, which help maintain active attention maps and ensure object attention remains spatially consistent.

Results

Multiple Objects

Prompts

VideoCrafter2

VideoCrafter2+A&E

VICO

Ours

a boat and an airplane

a bottle and a chair

a car and a train

a chair and a couch

an apple and a cell phone

Overall Consistency

Prompts

VideoCrafter2

VideoCrafter2+A&E

VICO

Ours

an ice cream is melting on the table.

A teddy bear washing the dishes

Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks.

Vincent van Gogh is painting in the room

A cute fluffy panda eating Chinese food in a restaurant

Spatial Relation

Prompts

VideoCrafter2

VideoCrafter2+A&E

VICO

Ours

a baseball bat on the left of a baseball glove, front view

a boat on the left of an airplane, front view

an elephant on the left of a bear, front view

a parking meter on the right of a bench, front view

a snowboard on the bottom of a kite, front view

Multi-Action Video Generation

Token Swapping. Qualitative comparison between the results generated using the FIFO [4] baseline and with the token swapping method.

Latent Interpolation. Qualitative comparison between the results generated using the FIFO [4] baseline and with our proposed optimization method for MAVG. In our case, the bench remains consistent and avoids hallucination.

References

[1] Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., & Shan, Y. (2024). Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7310-7320).

[2] Agarwal, A., Karanam, S., Joseph, K. J., Saxena, A., Goswami, K., & Srinivasan, B. V. (2023). A-star: Test-time attention segregation and retention for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2283-2293).

[3] Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., ... & Liu, Z. (2024). Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 21807-21818).

[4] Kim, J., Kang, J., Choi, J., & Han, B. (2024). Fifo-diffusion: Generating infinite videos from text without training. Advances in Neural Information Processing Systems, 37, 89834-89868.

[5] Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM transactions on Graphics (TOG), 42(4), 1-10.