The world's largest open source video model is now created in China, produced by Jieyue

FutureAIIn the world of big models, Chinese power is never absent, nor will it lag behind.

Author: Hengyu from Aofei Temple

全球最大开源视频模型,现在也 Created in China了,阶跃出品

图片来源:由无界AI生成

Just now, Step Star and Geely Auto Group jointly opened up two large multi-modal models!

There are 2 new models:

  • The world's largest open sourceVideo generation model Step-Video-T2V

  • The industry's first product-level open sourceVoice interaction model Step-Audio

Multimodal Volume King has begun to open source multimodal models, among which Step-Video-T2V still adopts the most open and relaxed MIT open source protocol.Free to edit and commercially apply.

(As usual, GitHub, Hug Face, and Magic Express can be found at the end of the article)

During the development of the two large models, the two parties complemented each other's strengths in areas such as computing power algorithms and scenario training, "significantly enhancing the performance of multimodal large models."

According to the official technical report, the two open source models performed well in the Benchmark, and their performance exceeded that of similar open source models at home and abroad.

The official account of Hugface also forwarded the high praise given by the head of the China region.

Highlight the key points: “The next DeepSeek” and “HUGE SoTA”.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Oh, really?

As for quantum bits, we will separate the technical report + first-hand measurements in this article to see whether they are worthy of the name.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Quantum bit verification, currently, there are 2 new open source modelsThey have all been connected to the Yuewen App and everyone can experience it.

Multimodal Volume King first open-sources its multimodal model

Step-Video-T2V and Step-Audio are the first open-source multi-modal models released by StepStar.

Step-Video-T2V

Let’s first take a look at the video generation model Step-Video-T2V.

Its parameter volume reaches 30B,It is the largest open source video generation model with the largest number of parameters known in the world., natively supports Chinese and English bilingual input.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

According to the official introduction, Step-Video-T2V has four major technical features:

First, can directly generate videos with a maximum length of 204 frames and 540P resolution, ensuring that the generated video content has extremely high consistency and information density.

second, designed and trained a high-compression Video-VAE for video generation tasks. While ensuring the quality of video reconstruction, it can compress the video by 1616 times in the spatial dimension and 8 times in the temporal dimension.

Currently, the compression ratio of most VAE models on the market is 8x8x4. At the same number of video frames, Video-VAE can provide an additional 8 times compression, thus increasing both training and generation efficiency by 64 times.

thirdStep-Video-T2V conducts in-depth system optimization for the hyperparameter setting, model structure and training efficiency of the DiT model to ensure the efficiency and stability of the training process.

fourth, details the complete training strategy including pre-training and post-training, including the training tasks, learning objectives, and data construction and screening methods at each stage.

also,Step-Video-T2V introduces Video-DPO at the end of training(Video Preference Optimization)——This is an RL optimization algorithm for video generation, which can further improve the quality of video generation and enhance the rationality and stability of generated videos.

The end result is smoother motion, richer details, and more accurate command alignment in the generated video.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

In order to comprehensively evaluate the performance of open source video generation models, Stepwise also released a new benchmark dataset for cultural video quality evaluation.Step-Video-T2V-Eval.

The dataset is also open source.

It contains 128 Chinese evaluation questions sourced from real users, aiming to evaluate the quality of generated videos on 11 content categories, including sports, scenery, animals, combined concepts, surrealism, etc.

The evaluation results of Step-Video-T2V-Eval on it are shown in the figure below:

全球最大开源视频模型,现在也 Created in China了,阶跃出品

It can be seen that Step-Video-T2V outperforms the previous best open source video model in terms of command compliance, motion smoothness, physical rationality, and aesthetics.

This means,The entire field of video generation can be studied and innovated based on this new strongest basic model.

As for the actual effect, the official introduction of Step is:

In terms of generation effect, Step-Video-T2V has powerful generation capabilities in complex movements, aesthetic characters, visual imagination, basic text generation, native Chinese and English bilingual input, and lens language, and has outstanding semantic understanding and instruction-following capabilities, which can efficiently help video creators achieve accurate creative presentation.

What are you waiting for? Let’s take a look at it——

According to the order of official introduction, the first level is to test whether Step-Video-T2V can handle complex movements.

Previous video generation models would always produce strange images when generating various complex sports clips such as ballet/national standard dance/Chinese dance, artistic gymnastics, karate, and martial arts.

For example, the third leg that suddenly appears, the crossed and fused arms, etc., are quite scary.

For this kind of situation, we conduct a directional test and send a prompt to Step-Video-T2V:

Indoor badminton court, horizontal perspective, fixed lens recorded a scene of a man playing badminton. A man wearing a red short-sleeved shirt and black shorts, holding a badminton racket, stood in the middle of the green badminton court. The net spanned the court, dividing it into two parts. The man swung the racket and hit the badminton to the opposite side. The light was bright and even, and the picture was clear.

The scenes, characters, shots, lighting, and movements all match.

Generating images containing "aesthetic characters" is the second challenge that QuantumBit has launched for Step-Video-T2V.

To be honest, the level of real-life images generated by the current Wenshengtu model is absolutely lifelike in terms of static and local details.

However, when the video is generated, once the characters move, there are still recognizable physical or logical flaws.

As for the performance of Step-Video-T2V——

Prompt:Close-up of a man wearing a black suit, dark tie and white shirt, with scars on his face and a serious expression.

“It doesn’t feel like AI.”

This is the unanimous evaluation of Xiao Shuai in the video by the students in the Quantum位 editorial department after they circulated it around.

The facial features are regular, the skin texture is real, and the scars on the face are clearly visible, which makes it look "not very AI-like".

It is also realistic, but the protagonist does not have that "lack of AI feel" with empty eyes and stiff expressions.

In the above two levels, Step-Video-T2V is kept in a fixed lens position.

So, how does push, pull, pan and tilt perform?

The third level tests Step-Video-T2V's mastery of camera movements, such as push-pull, pan, rotate, and follow.

If you want it to rotate, it rotates:

Not bad! I can carry a Steadicam on my shoulder and be a camera master on the set (not really).

After some testing, the generated effect gives the answer:

As the evaluation results show, Step-Video-T2V has outstanding capabilities in semantic understanding and instruction following.

evenBasic text generation is also easy to master:

Step-Audio

Another open source model, Step-Audio, is the industry's first product-level open source voice interaction model.

In the StepEval-Audio-360 benchmark test, a multi-dimensional evaluation system built and open sourced by Step, Step-Audio achieved the best results in logical reasoning, creative ability, command control, language ability, role-playing, word games, emotional value and other dimensions.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

In five major public test sets, including LlaMA Question and Web Questions, Step-Audio's performance exceeded that of similar open source models in the industry, ranking first.

It can be seen that its performance in the HSK-6 (Chinese Proficiency Test Level 6) assessment is particularly outstanding.

The actual measurement is as follows:

The Step-Audio team introduced that Step-Audio can generate emotions, dialects, languages, songs and personalized styles according to the needs of different scenarios, and can have natural and high-quality conversations with users.

At the same time, the voice generated by it not only has the characteristics of realistic, natural and high emotional intelligence, but also can achieve high-quality tone reproduction and role-playing.

In short, the Step-Audio package can fully satisfy your application needs in film and television entertainment, social networking, games and other industry scenarios.

The open source ecosystem is snowballing

How to say it, just one word: volume.

Step is really a test, especially in the multimodal model, which is its specialty——

The multimodal models in its Step series have been the first regular visitors to major authoritative evaluation collections and arenas at home and abroad since their birth.

Just looking at the past three months, it has topped the list several times.

  • On November 22 last year, the multimodal understanding big model Step-1V was listed in the latest list of the Big Model Arena, with a total score equal to that of Gemini-1.5-Flash-8B-Exp-0827, ranking first among Chinese big models in the field of vision.

  • In January this year, the newly released Step-1o series model won the first place in the real-time list of multimodal model evaluation on the domestic large model evaluation platform "Sinan" (OpenCompass).

  • On the same day, according to the latest ranking of the Big Model Arena, the multimodal model Step-1o-vision won the first place in the domestic vision field.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Secondly, the step multimodal model not only has good performance and quality, but also has a high frequency of R&D iterations.

Up to now, Step Star has released 11 large multimodal models.

Last month, we released 6 models in 6 days, covering the entire track of language, speech, vision, and reasoning, further solidifying our title of multimodal king.

Two more multimodal models were open sourced this month.

As long as we maintain this rhythm, we can continue to prove our position as a "family-level multi-modal player."

With strong multi-modal capabilities,Since 2024, the market and developers have recognized and widely accessed the Step API, forming a huge user base.

Consumer GoodsFor example, Cha Baidao has connected thousands of stores across the country to the multimodal understanding big model Step-1V, explored the application of big model technology in the tea beverage industry, and conducted intelligent inspections and AIGC marketing.

Public data shows that on average, millions of cups of tea and hundreds of tea drinks are delivered to consumers every day under the protection of large-scale intelligent inspection models.

Step-1V can save Cha Baidao supervisors an average of 75% of self-inspection and verification time every day, providing tea consumers with more secure and high-quality services.

Independent DeveloperFor example, the popular AI application "Stomach Book" and the AI psychological healing application "Forest Chat and Healing Room" finally chose the step multimodal model API after AB testing most domestic models.

(Whispering: Because it has the highest payment rate)

Specific data shows that in the second half of 2024, the number of calls to the step multimodal large model API increased by more than 45 times.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Besides, what is open sourced this time is the multimodal model that Stepwise is best at.

We have noticed that the market and developer reputation and quantity have increased significantly.This open source project is designed to facilitate in-depth access to the model.

On the one hand, Step-Video-T2V adopts the most open and relaxed MIT open source protocol, which can be edited and used commercially at will.

It can be said that there is “nothing to hide”.

On the other hand, Jieyue stated that it would "make every effort to lower the threshold for industry access."

Take Step-Audio for example. Unlike the open source solutions on the market that require redeployment and redevelopment, Step-Audio is a complete real-time conversation solution that can be used for real-time conversations with simple deployment.

Enjoy an end-to-end experience starting with zero frames.

After a whole set of actions, an open source technology ecosystem unique to Step Star has been initially formed around Step Star and its multimodal model trump card.

In this ecosystem, technology, creativity and commercial value are intertwined, jointly driving the development of multimodal technology.

andWith the continued development and iteration of the Step model, the rapid and continuous access of developers, and the support and joint efforts of ecological partners, the "snowball effect" of the Step ecosystem has occurred and is growing.

China's open source forces are speaking with strength

Once upon a time, when people talked about the leaders in the field of large model open source, what came to mind were Meta's LLaMA and Albert Gu's Mamba.

By now, there is no doubt that the open source power of China's large-scale modeling industry has shined around the world, rewriting the "stereotype" with its strength.

January 20, the eve of the Spring Festival of the Year of the Snake, is a day when big model gods from home and abroad fight each other.

The most eye-catching thing is that DeepSeek-R1 was released on this day. Its reasoning performance is comparable to OpenAI o1, but its cost is only 1/3 of the latter.

The impact was so huge that Nvidia lost $589 billion (about 4.24 trillion yuan) overnight, setting a record for the largest single-day drop in U.S. stocks.

What is more important and dazzling is that the reason why R1 has risen to the height that has excited hundreds of millions of people is thatIn addition to its excellent reasoning and affordable price, what is more important is its open source nature.

A single stone stirs up a thousand waves. Even OpenAI, which has long been ridiculed as "no longer open", has a CEOXiaobai NavigationUltraman has made public speeches many times.

“In open-sourcing the weighted AI model, we are on the wrong side of history,” Altman said.

He added: “There’s a real need for open source models in the world that can provide a lot of value to people. I’m glad there are some great open source models out there.”

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Now, Jieyue has also begun to open source its new trump card.

And open source is the original intention.

Officials said that the purpose of open-sourcing Step-Video-T2V and Step-Audio is to promote the sharing and innovation of large-model technology and promote the inclusive development of artificial intelligence.

As soon as open source appeared, it demonstrated its strength in multiple evaluation sets.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

On the current open source big model table, DeepSeek has strong reasoning, Step has heavy multimodality, and there are all kinds of players who are constantly developing...

Their strength is not only top-notch in the open source circle, but also impressive in the entire large model circle.

——China's open source power, having made its mark, is taking a further step.

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Taking the open source of Stepwise as an example, it has made a breakthrough in the technology in the multimodal field and changed the selection logic of developers around the world.

Eleuther AI and many other open sourceCommunityActive technology experts have taken the initiative to test the step model, "thank China open source."

全球最大开源视频模型,现在也 Created in China了,阶跃出品

全球最大开源视频模型,现在也 Created in China了,阶跃出品

Wang Tiezhen, head of HugFace China, directly stated that Jieyue will be the next "DeepSeek".

全球最大开源视频模型,现在也 Created in China了,阶跃出品

From "technological breakthrough" to "ecological openness", China's big model is moving towards a more stable path.

That being said, the open source dual model of Stepwise may just be a footnote in the 2025 AI competition.

On a deeper level, it demonstrates the technological confidence of China's open source power and sends a signal:

In the future world of AI big models, Chinese power will never be absent, nor will it lag behind.

The article comes from the Internet:The world's largest open source video model is now created in China, produced by Jieyue

Related recommendations: Trump's coin issuance stirs up Solana DeFi landscape: Meteora's daily trading volume soars 8 times, Raydium's share falls to less than 30% in a short period of time

In terms of growth, Meteora may be the biggest beneficiary of the TRUMP token craze. Is this dividend a flash in the pan or the beginning of change? Author: Frank, PANews With the market heat caused by US President Trump's issuance of personal tokens TRUMP, Solana ecosystem has become the biggest beneficiary. Not only has the DEX trading volume been up for two consecutive days...

share to
© 版权声明

相关文章