The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Axogion Yin1* Kai Shen2* Yichong Leng2 Xu Tan2† Xinyu Zhou2 Juncheng Li1† Siliang Tang1
1Zhejiang University 2Moonshot AI
*Equal contribution Corresponding author

Abstract

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a 14,000× compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field.

Overview

The architecture of LanDiff. Given text inputs, we first extract text embeddings and employ an LLM to generate semantic tokens in the first stage. Subsequently, we utilize a diffusion model to synthesize perceptual features conditioned on these semantic tokens, followed by a VAE decoder that transforms these features into the final video frames.

Video Generation Models Performance Comparison

Radar chart visualization of performance comparison across different dimensions on VBench. The plot compares LanDiff against five competitive baselines: Sora, Hailuo, HunyuanVideo, Kling, and CogVideoX-5B. For better readability, the values in the radar chart have been normalized to a scale ranging from 0.3 to 0.8. The normalization was performed using the min-max scaling formula: normalized=0.3+0.5×valuemin_valuemax_valuemin_value

Text-to-Video Generation Demos
Loading...
A snail with a brown and tan shell is seen crawling on a bed of green moss. The snail's body is grayish-brown, and it has two prominent tentacles extended forward. The environment suggests a natural, outdoor setting with a focus on the snail's movement across the mossy surface.
Loading...
A vintage 1980s kitchen, with checkered floor tiles and pastel-colored appliances, is the setting for a surreal scene. In the center, an ostrich stands tall, its feathers a stark contrast against the dated decor. The camera slowly pushes in, capturing the ostrich's curious gaze as it peers around the room, its long neck gracefully arching. The lighting is soft and warm, casting a nostalgic glow over the scene. The camera continues to move closer, the focus shifting from the room to the ostrich, until its large, expressive eyes fill the frame, a symbol of the strange and unexpected in the mundane.
Loading...
A close-up shot reveals a clear glass container filled with water, with a dropper poised above it. The dropper releases a golden stream of olive oil, which descends into the water, forming a mesmerizing, slow-motion dance. The oil gradually separates into distinct droplets, each suspended in the water, creating a captivating interplay of light and shadow. Next, a second dropper introduces a dark stream of balsamic vinegar, which cascades into the container, mingling with the oil droplets and forming intricate patterns. The camera captures the fascinating interaction between the two fluids, as they swirl and separate, highlighting the unique properties of oil and vinegar.
Loading...
A camera slowly tilts upward from the base of a towering, intricately-carved marble statue, revealing its grandeur and craftsmanship. The statue, dressed in ancient robes, stands in a sunlit square, its majestic head adorned with a laurel wreath. As the camera moves, the statue's detailed features, including its strong jawline and piercing eyes, become more prominent. The sunlight casts dramatic shadows on the statue's face, highlighting the artist's skill. The final shot captures the statue's serene expression, as if it's surveying its domain with wisdom and authority.
Loading...
A close-up shot of a Victoria crowned pigeon reveals its striking blue plumage and red chest, with a delicate, lacy crest and striking red eyes. The bird tilts its head slightly, exuding regality and majesty. The background is blurred, drawing attention to the bird's striking appearance. The pigeon's feathers shimmer under the soft lighting, highlighting its majestic stature. The camera slowly zooms in, capturing the intricate details of the bird's plumage and the subtle movements of its head. The pigeon's eye, a deep, captivating red, seems to hold a story of its own. The video ends with a slow, graceful movement of the bird's head, leaving a lasting impression of its regal presence.
Loading...
A drone captures the breathtaking view of waves crashing against the rugged cliffs along Big Sur's Garrapata State Park beach. The blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff's edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff's edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.
Loading...
A group of playful golden retriever puppies, with their fluffy fur and sparkling eyes, frolic in a pristine snow-covered field. Their heads pop out of the snow, covered in white powder, as they joyfully wrestle and tumble, their tails wagging with excitement. The puppies, with their warm golden coats, are a stark contrast against the snowy landscape, creating a heartwarming scene. As they play, their faces are covered in snow, making them look like little snowballs, adding to the charm of the moment.
Loading...
A series of images depict the Earth from space, focusing on the continent of Australia. The images showcase the continent's vast deserts, green vegetation, and coastal regions. The curvature of the Earth is visible, with the blue of the oceans contrasting against the brown and green hues of the land. Cloud formations can be seen over parts of the continent, and the darkness of space is visible in the background.
Loading...
A young man in his 20s, wearing a casual white t-shirt and blue jeans, sits on a fluffy white cloud in a clear blue sky, engrossed in a hardcover book. His brown hair is tousled, and he sports a pair of black-rimmed glasses. The sun casts a warm glow on his face, highlighting his focused expression. As he reads, a gentle breeze rustles the pages of the book, and the cloud slowly drifts across the sky, creating a serene and whimsical scene.
Loading...
A close-up shot reveals a vibrant blue parrot, its feathers shimmering with an iridescent glow as they catch the sunlight, showcasing the bird's unique plumage and dazzling colors. The parrot's feathers are a rich, deep blue, with hints of green and purple, creating a mesmerizing visual effect. The bird's eyes are a striking contrast, a piercing orange that seems to hold a depth of wisdom. The camera slowly pans over the parrot's body, capturing the intricate patterns and the way the feathers seem to dance with each movement. The video ends with a close-up of the parrot's beak, a vibrant red that complements the blue feathers, as it opens its mouth to reveal a bright yellow tongue, completing the stunning display of this magnificent creature's vibrant colors.
Loading...
A group of majestic woolly mammoths, their fur a stark contrast against the snow, tread through a pristine meadow, their long, woolly coats gently swaying in the breeze. In the distance, snow-dusted trees and dramatic, snow-capped mountains create a breathtaking backdrop. The mid-afternoon light filters through wispy clouds, casting a warm glow over the scene. The camera, positioned low, captures the grandeur of these colossal creatures with stunning clarity, the depth of field accentuating their size and the vastness of the landscape.
Loading...
A woman with long, wavy red hair and a black sleeveless top is holding and looking at a smartphone with a smile. She is standing on a city street with buildings and a storefront in the background. Her expression changes from neutral to smiling as she interacts with the device. The lighting suggests it is daytime, and the focus is on the woman with a shallow depth of field blurring the background.
Text-to-Video Model Comparison
CogVideoX-5B
LanDiff
"A colossal, human-shaped cloud towers over the earth, its massive form casting a shadow across the landscape. The cloud man's features are distinct, with a stern expression and outstretched arms. Suddenly, the cloud man releases a barrage of lightning bolts, illuminating the sky as they streak towards the earth. The scene is set against a backdrop of a stormy sky, with dark clouds and distant thunder adding to the dramatic atmosphere."
"A life-sized ice sculpture of a playful dog, with intricate details and a joyful expression, stands in the middle of a sunlit, grassy field on a sweltering summer day. The ice dog, initially solid and vibrant, begins to melt under the relentless heat, with droplets of water forming on its surface. As the day progresses, the ice dog's form gradually diminishes, with its once sharp features becoming blurred and distorted. The melting process accelerates, and the ice dog's body starts to collapse, pooling into a puddle of water on the ground. By the end of the day, all that remains is a shallow puddle, reflecting the cloudless sky, with the memory of the once majestic ice dog now just a memory."
"Two vibrant hot air balloons, one red and the other blue, are seen soaring through a clear blue sky, their baskets gently bumping against each other mid-air. The red balloon features intricate gold patterns on its surface, while the blue balloon boasts a white and silver design. As they collide, the passengers in the baskets, dressed in casual attire, react with surprise and excitement. The scene is set against a backdrop of a picturesque landscape, with lush green hills and a sparkling river below. The balloons' vibrant colors contrast beautifully with the azure sky, creating a visually stunning and dynamic scene."
CogVideoX-5B
LanDiff
"Wind turbines stand in a vast, open field, their blades spinning gracefully in the breeze. The elegant motion of the turbine blades is captured against a clear blue sky. The wind turbines appear serene in the landscape, highlighting the vastness of the open field as they become smaller in the distance."
"A breathtaking view of a majestic snowy mountain peak is reflected in a pristine alpine lake, creating a flawless mirror image. The sun's rays cast a subtle shimmering effect on the water's surface, enhancing the serene ambiance. The camera captures the scene from a distance, highlighting the symmetry and beauty of nature's perfect reflection. The mountain's snow-capped peak, the lake's clear waters, and the shimmering effect all contribute to a mesmerizing and tranquil atmosphere."
"A close-up view of a Christmas tree reveals a variety of decorations including a purple ornament with a gold pattern, a gold textured ornament, a small white house-shaped ornament with red roof and gold details, and a brown pine cone. The tree branches are dense and green, providing a natural backdrop for the ornaments. The camera pans slightly across the scene, maintaining focus on the ornaments while subtly shifting the perspective."
CogVideoX-5B
LanDiff
"A sleek, modern train glides effortlessly over a towering steel bridge. The polished exterior of the train reflects the golden hues of the setting sun. The bridge, an architectural marvel, spans a deep, verdant valley, with lush forests and a winding river far below. As the train moves, its rhythmic clatter harmonizes with the distant calls of birds and the gentle rustling of leaves. The majestic bridge stands silhouetted against a vibrant, twilight sky, as the train continues its journey into the horizon."
"A sleek white sailboat glides gracefully across a calm, azure sea, its sails billowing gently in the breeze. Above, a silver airplane soars through a clear blue sky. The boat's hull reflects the sunlight, creating a shimmering effect on the water's surface. The airplane, seen in a high-altitude flyover, casts a shadow that momentarily aligns with the boat's path, creating a fleeting connection between sea and sky. The scene is captured in a wide shot, ensuring both the boat and airplane are prominently centered, emphasizing their contrasting yet harmonious presence in the vast expanse."
"A group of silver-colored fish with darker fins swim among green aquatic plants in an aquarium setting. The fish move gracefully through the water, navigating around the plants, which are of various sizes and shades of green. The aquarium environment is designed to mimic a natural habitat, with rocks and shadows in the background contributing to the underwater scene."
Long Video Generation Comparison
StreamingT2V
FreeNoise
OpenSora
LanDiff
"a dog drinking water."
"a boat accelerating to gain speed."
"a car accelerating to gain speed."
Video Reconstruction Results
Reference Video
Reconstruction Video
Reference Video
Reconstruction Video