Byte presses AI Agent accelerator key

After experiencing the impact of DeepSeek and Manus in early 2025, major companies are redefining their next strategies.

Author: Wan Chen

DeepSeek-R1's good writing, GPT-4o's Ghibli style, OpenAI o3's inference of geographic location by looking at pictures...

This is a phenomenal AI product that has been on the screen for the past two months. You can clearly see that reinforcement learning can finally be generalized and multimodal models are becoming more and more usable. This also means that 2025 is really the time for Agent applications to be implemented and accelerated.

The team of the previously popular AI agent Manus once revealed that at the end of last year, Claude 3.5 Sonnet had reached the level required for an agent in terms of its ability to plan long-term tasks and solve problems step by step, which was the premise for the birth of Manus.

Now, with the further maturity of deep thinking models and multimodal model capabilities, there will definitely be more agents that can handle complex tasks.

Based on this judgment, on April 17, ByteDance's cloud and AI service platform "Volcano Engine" released a more powerful model for the enterprise market - Doubao 1.5·Deep Thinking Model, which is also the first appearance of the reasoning model behind ByteDance's AI application Doubao App. Also launched is Doubao·Venshengtu Model 3.0 and an upgraded version of the visual understanding model.

Regarding the model released this time, Tan Dai, president of Volcano Engine, believes that "deep thinking models are the basis for building agents. Models must be able to think, plan, and reflect, and must support multimodality, just like humans have vision and hearing, so that agents can better handle complex tasks."

As AI evolves end-to-end autonomous decision-making and execution capabilities and moves towards core production links, Volcano Engine has also prepared architectures and tools for agents to operate in the digital and physical worlds - OS Agent solutions and AI cloud-native reasoning suites, helping companies build and deploy agent applications faster and more economically.

In Tan Dai's view, developing an agent is like developing a website or an app. Model APIs alone cannot completely solve the problem. Many cloud-native AI components are needed. In the past, cloud-native had its core definitions, such as containers and elasticity. Now, AI cloud-native will also have similar key elements. Through continuous thinking, exploration and rapid action in AI cloud-native, such as making various middleware, evaluation, monitoring, observability, data processing,SafetySecurity and related components such as Sandbox, Volcano Engine is committed to becoming the optimal solution for infrastructure in the AI era.

01 Doubao's deep thinking model, like a human, watching, thinking and searching at the same time

Since the release of DeepSeek-R1 at the beginning of the year, many ToC applications have been connected to the R1 reasoning model, except for Doubao App. The "Deep Thinking" mode launched on Doubao App in early March is based on the Doubao Deep Thinking model developed by ByteDance.

Now, this reasoning model - Doubao 1.5 · Deep Thinking Model has been officially released and can be experienced and called on the Volcano Ark platform.

Click on the online mode and Doubao can think about a problem just like humans do: think, search, and then think again... with the ultimate goal of solving the problem.

This is an example of a shopping scenario, where Doubao recommends a suitable set of camping equipment given constraints such as budget and size.

On this issue, Doubao first disassembled the precautions, planned the information needed, then determined the missing information and searched online. Here, it searched for 3 rounds, first searching for price and performance to ensure that it met the budget and needs; it also considered the needs of children alone, and finally considered the weather and searched for relevant detailed reviews. It searched while thinking until it obtained all the necessary context for making a decision and gave a reasoned answer.

In addition to searching and thinking at the same time, Doubao's deep thinking model also has visual reasoning capabilities. Like humans, it can think not only based on text, but also based on the pictures it sees.

Take the scenario of ordering food for example. The May Day Golden Week is coming soon. Friends traveling abroad no longer need to take pictures and upload them to translation software to translate the menu. Doubao's deep thinking model can help you order food directly based on the pictures.

In the example below, Doubao’s deep thinking model first performed exchange rate conversion to control the budget, then took into account the preferences of the elderly and children, while carefully avoiding dishes they were allergic to, and directly provided a menu plan.

Networking, thinking, reasoning, and multimodality, Doubao 1.5's deep thinking model demonstrates comprehensive reasoning capabilities and can solve more complex problems.

According to the technical report, Doubao 1.5 Deep Thinking model has a high degree of completion in professional reasoning tasks. For example, its score in the mathematical reasoning AIME 2024 test is equal to OpenAI o3-mini-high, and its scores in programming competitions and scientific reasoning tests are close to o1. In general tasks such as creative writing and humanities knowledge question-answering, the model also shows excellent generalization ability and can handle a wider range of usage scenarios.

The Doubao Deep Thinking Model also has the characteristic of low latency. Its technical report shows that the model adopts the MoE architecture, with a total parameter of 200B and an activation parameter of only 20B, achieving the effect comparable to the top models with smaller parameters. Based on efficient algorithms and high-performance reasoning systems, the Doubao model API service ensures high concurrency while keeping the latency as low as 20 milliseconds.

At the same time, it also has multimodal capabilities and can apply deep thinking models to a variety of scenarios. For example, it can understand complex enterprise project management process charts, quickly locate key information, and use its powerful instruction-following capabilities to strictly follow the flow charts to answer customers' questions. When analyzing aerial photos, it can combine the topographic features to determine the feasibility of regional development.

In addition to the inference model, the Doubao model family also brought two model updates. In terms of the text-based image model, Doubao launched the latest 3.0 upgrade version, which can achieve better text layout performance, real-life image generation effects, and 2K high-definition image generation.

字节按下 AI Agent 加速键

The new model not only solves the problem of generating small characters and long texts, but also improves the layout of pictures. For example, the two posters "Appearance" and "Harvest Plan" generated on the far left have finer details and more natural layout, and can be used right away.

Another upgrade is the Doubao 1.5 visual understanding model. The new version has two key updates: more accurate visual positioning and smarter understanding of videos.

In terms of visual positioning, the Doubao 1.5 visual understanding model supports frame positioning and point positioning of multiple targets, small targets, and general targets, and supports positioning counting, description of positioning content, and 3D positioning. The improvement of visual positioning capabilities can allow the model to further expand application scenarios, such as offline store inspection scenarios, GUI agents, robot training, and autonomous driving training.

The model has also significantly improved its video understanding capabilities, such as memory, summary and understanding, speed perception, and long video understanding. Enterprises can create more interesting commercial applications based on video understanding. For example, in home scenarios, we can perform semantic search on surveillance videos at home based on video understanding capabilities and vector search.

For example, in the following example, cat owners want to know the cat's daily activities. Now they can directly search "What did the kitten do at home today?" to quickly return semantically related video clips for users to view.

With the help of reasoning models with visual understanding and a large reserve of reasoning capabilities, many things that were previously impossible can now be achieved, and more scenarios can be unlocked. For example, cameras with such functions will definitely be more popular, and AI glasses, AI toys, smart cameras, door locks, etc. will also have new development space.

02 Cloud, entering the era of Agentic AI

In the past two days, OpenAI researcher Yao Shunyu (core author of Deep Research and Operator) pointed out in the article "The Second Half of AI" that as reinforcement learning has finally found a path to generalization, it is not only effective in specific fields, such as AlphaGo that defeated human chess players, but can also achieve a level close to human competition in software engineering, creative writing, IMO-level mathematics, mouse and keyboard operation, etc. In this case, it will be easier to compete on the list scores and get higher scores on more complex lists, but this evaluation method is outdated.

What is being competed now is the ability to define problems. In other words, what problems does AI have to solve in real life?

In 2025, the answer is productivity agent. Currently, AI application scenarios are rapidly entering the era of agentic AI, and AI is gradually able to complete complete tasks that are highly professional and time-consuming. In this case, Volcano Engine has also built a series of infrastructure for enterprises to "define their own general agents."

The most important thing is the model, which can plan, reflect, make decisions and execute end-to-end autonomously, and move towards the core production link. At the same time, multimodal reasoning capabilities are also needed, so that it can complete tasks through ears, mouth and eyes in the real world.

In addition to the model, the Infra technology stack also needs to continue to evolve. For example, as the MoE architecture shows its more efficient advantages and gradually becomes the mainstream architecture of the model, scheduling and adapting the MoE model requires more complex and flexible cloud computing architecture and tools.

Now in the scenario of enterprise general agent, Volcano Engine has launched a better architecture and tool - OS Agent solution, which supports large models to operate the digital and physical worlds, such as Agent operating the browser, searching product pages, and completing iPhone price comparison tasks, and even Agent using Jianying to perform video editing and music composition on a remote computer.

Currently, Volcano Engine OS Agent solution includes Doubao UI-TARS model, as well as veFaaS function service, cloud server, cloud phone and other products, which realize the operation of code, browser, computer, mobile phone and other agents. Among them, Doubao UI-TARS model integrates screen visual understanding, logical reasoning, interface element positioning and operation, breaking through the limitation of traditional automation tools relying on preset rules, and providing a model foundation for intelligent interaction of agents that is closer to human operation.

In the general agent scenario, Volcano Engine uses this OS Agent solution to allow enterprises, individuals or specific fields to define and explore agents as needed.

In terms of vertical agents, Volcano Engine will explore based on its own areas of strength, such as the previously launched "Intelligent Programming Assistant Trae" and the data product "Data Agent", the latter of which maximizes data processing capabilities by building a data flywheel.

On the other hand, with the penetration of Agent, it will also bring more model inference consumption. In the face of large-scale inference needs, Volcano Engine has specially created the AI cloud-native ServingKit inference kit to make model deployment faster and inference costs lower, and GPU consumption is reduced by 80% compared to traditional solutions.

In Tan Dai's view, in order to meet the needs of the AI era, Volcano Engine will continue to work hard in three aspects: continuously optimize models to maintain competitiveness; continuously reduce costs, including fees, delays and improve throughput; make products easier to implement, such as tools for developers such as Buttons and HiAgent, and cloud-native components such as OS Agent. Maintaining product and technology leadership will also lead in market share. The "China Public Cloud Big Model Service Market Landscape Analysis, 1Q25" released by IDC previously showed that Volcano Engine ranked first with a market share of 46.4%.

In December last year, the average daily token call volume of the Doubao model was 4 trillion. As of the end of March this year, this number has exceeded 12.7 trillion, which is a rapid growth of more than 106 times in less than a year compared to when the Doubao model was first released. In the future, with the further maturity of deep thinking models and visual reasoning and the optimization of AI cloud infrastructure, Agent will drive even greater token calls.

The article comes from the Internet:Byte presses AI Agent accelerator key

share to
© 版权声明

相关文章