1CVIG Lab, IIT Gandhinagar, 2Adobe, Bangalore
Published at [WACV, 2026]
Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR, for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions. Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy. Extensive experiments and ablation studies show that our method significantly outperforms baseline methods, generating videos with improved semantic and compositional consistency alongside improved temporal coherence.
The following are the contributions of our work improving text-based video synthesis:
Framework. The figure illustrate VideoCrafter2 (VC2) [1] framework with proposed latent optimization step. The VC2 video latent diffusion model conditions the noisy latent input over the given text prompt, and during the denoising step, the proposed method ensures the attention map attends to each token as per the text. The final denoised latent is then given as input to the pre-trained latent-to-frame decoder network of VC2 for video generation.
Loss Functions. The figure shows how each loss operates on the attention map generated for each token.
Token Visualization. The figure shows the cross-attention map between text and video frame pooled from a downsampling layer in a spatial transformer block of VideoCrafter2 [1].
(a) The figure illustrates the effect of different optimization functions with VideoCrafter2 [1], (b) shows the effect of different parameters on the Video-ASTAR loss.
In this figure, we have shown the effect of different methods when integrated with the VideoCrafter2 [1] text-to-video generation network.
Quantitative comparison with baseline methods using Vbench [3] across three different prompt categories on three random seed values. We have reported the average across all the runs.
Prompts
VideoCrafter2
VideoCrafter2+A&E
VICO
Ours
a boat and an airplane
a bottle and a chair
a car and a train
a chair and a couch
an apple and a cell phone
Prompts
VideoCrafter2
VideoCrafter2+A&E
VICO
Ours
an ice cream is melting on the table.
A teddy bear washing the dishes
Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks.
Vincent van Gogh is painting in the room
A cute fluffy panda eating Chinese food in a restaurant
Prompts
VideoCrafter2
VideoCrafter2+A&E
VICO
Ours
a baseball bat on the left of a baseball glove, front view
a boat on the left of an airplane, front view
an elephant on the left of a bear, front view
a parking meter on the right of a bench, front view
a snowboard on the bottom of a kite, front view
Token Swapping. Qualitative comparison between the results generated using the FIFO [4] baseline and with the token swapping method.
Latent Interpolation. Qualitative comparison between the results generated using the FIFO [4] baseline and with our proposed optimization method for MAVG. In our case, the bench remains consistent and avoids hallucination.
[1] Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., & Shan, Y. (2024). Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7310-7320).
[2] Agarwal, A., Karanam, S., Joseph, K. J., Saxena, A., Goswami, K., & Srinivasan, B. V. (2023). A-star: Test-time attention segregation and retention for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2283-2293).
[3] Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., ... & Liu, Z. (2024). Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 21807-21818).
[4] Kim, J., Kang, J., Choi, J., & Han, B. (2024). Fifo-diffusion: Generating infinite videos from text without training. Advances in Neural Information Processing Systems, 37, 89834-89868.
[5] Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM transactions on Graphics (TOG), 42(4), 1-10.