Let’s talk about Agent in detail. Is it a “colleague” or a “tool”? What are the entrepreneurial opportunities and value?
Compiled by: Moonshot
Source: Geek Park
2025 is the year when Agent presses the accelerator.
From the surprise caused by DeepSeek at the beginning of the year to the successive appearances of GPT-4o and Claude 3.5, the boundaries of large models have been rewritten time and time again. However, what really makes the AI industry chain nervous is not the performance iteration of the model, but the emergence of Agent.
The popularity of products such as Manus and Devin reiterates a consensus: large models will no longer be just tools, but will become intelligent entities that can self-schedule.
Agent has thus become the second hot topic that the global technology community has reached consensus on the fastest after big models.
From the strategic reconstruction of giants to the rapid follow-up of the entrepreneurial track, Agent is becoming the next direction for everyone to bet on. However, while C-end products are emerging intensively and developers are enthusiastic about them, projects that truly run through the user value closed loop are rare, and more and more products are caught in the anxiety of "using old needs to fit new technologies."
After the heat wave, the market has returned to calm: Is Agent a paradigm reconstruction or a new packaging? Can the so-called "universal" and "vertical" path division really bring sustainable market space? And behind the "new entrance", is it the evolution of the interaction method or the projection of the old world?
Following these questions, we will find that the real threshold of Agent may not lie in the model capability, but in the underlying infrastructure on which it depends. From the controllable operating environment to the memory system, context awareness, and tool calling, the absence of each basic module is the biggest obstacle for Agent to move from demonstration to practical use.
These underlying engineering problems constitute the biggest obstacle for Agent to move from a "trendy toy" to a "productivity tool", and are also precisely the most certain and highest-value entrepreneurial blue ocean at the moment.
At a stage where supply is overflowing and demand is unclear, we would like to use this conversation to answer an increasingly pressing question: Where are the real problems and opportunities for Agents hidden?
In this in-depth conversation, we invited Li Guangmi, founder of Shixiang Technology, and Zhong Kaiqi, AI Research Lead of Shixiang Technology, who are both on the front line. The two practitioners will analyze the real problems and opportunities of current agents from multiple dimensions such as product form, technical path, business model, user experience and even Infra construction.
We will follow their thinking and explore where the real opportunities for startups are hidden at the poker table surrounded by giants; how a pragmatic growth path that smoothly transitions from "Copilot" to "Agent" is verified step by step; and why coding, a seemingly vertical field, is regarded as the "highest value" and "key indicator" leading to AGI.
Ultimately, this conversation will extend into the future, providing a glimpse into the new collaborative relationship between humans and agents, as well as the core challenges and unlimited opportunities faced in building the next generation of intelligent infrastructure.
Highlights
-
The best approach in the general agent field is “Model as Agent”.
-
When developing an agent, you don’t have to “start with the end in mind” and aim for a fully automated agent from the beginning. You can start with Copilot. In this process, you can collect user data, improve user experience, occupy the user’s mind, and then slowly transform.
-
AGI may be first realized in a coding environment, because this environment is the simplest and can exercise the core capabilities of AI. Coding is the "universal machine" in this world. With it, AI can build and create. Coding may take away the value of the entire large model industry at a certain stage.
-
AI Native products are not just for humans, they must also serve AI. A true AI Native product should have a built-in two-way mechanism that serves both AI and humans.
-
Today’s AI products are moving from “tools” to “relationships.” People don’t build relationships with tools, but they will build relationships with an AI that has memories, understands you, and can “be in tune” with you.
The following is the summary of the live broadcast of "Tech Talk Tonight" that day, compiled by Geek Park.
01 Which Agent products have emerged in this boom?
Zhang Peng: In the past period of time, everyone has been discussing Agent, believing that this may be an important issue at this stage and a rare development opportunity for startups.
I have seen that Shixiang Technology has conducted in-depth research on the Agent system and has also experienced and analyzed many related products. I would like to first hear from you two, which Agent-related products have left a deep impression on you recently? Why?
Li Guangmi: The two that impressed me the most are: one is Anthropic's Claude's programming ability, and the other is OpenAI ChatGPT's Deep Research function.
Regarding Claude, the main thing is its programming ability. I have a view that programming (Coding) is the most critical a priori indicator for measuring AGI. If AI cannot develop software applications on a large scale and end-to-end, then progress in other fields will be slow. We must first achieve strong ASI (Artificial Superintelligence) in the coding environment before other fields can be accelerated. In other words, we first realize AGI in a digital environment and then expand to other fields.
Devin, the world's first AI programmer | Source: Cognition Labs
Regarding Deep Research, it has been very helpful to me and I use it almost every day. It is actually a search agent that helps me retrieve a large number of web pages and materials. The experience is very good and it has greatly expanded my research space.
Zhang Peng: Kaiqi, from your perspective, which products have left a deep impression on you?
Cage: I can introduce the thinking model I usually follow when observing and using Agents, and then introduce one or two representative products in each category.
First of all, people often ask: general agent or vertical agent? We believe that the best general agent is "Model as Agent". For example, Deep Research of OpenAI mentioned by Guangmi just now, and the newly released o3 model of OpenAI, it is actually a standard example of "Model as Agent". It stitches all the components of the agent - large language model (LLM), context, tool use and environment - together, and performs end-to-end reinforcement learning training. The result after training is that it can complete the task of information retrieval performed by various agents.
So my "bold theory" is: the demand for general agents is basically information retrieval and light code writing, and GPT-4o has already completed these two categories very well. Therefore, the general agent market is basically the main battlefield for large model companies, and it is difficult for startups to grow by just serving general needs.
The startups that impressed me the most are basically focused on vertical fields.
If we talk about the vertical field of ToB first, we can divide people's work into front-end work and back-end work.
The characteristics of background work are strong repetitiveness and high concurrency requirements. There is usually a long SOP (Standard Operating Procedure). Many of these tasks are very suitable for AI Agents to perform one-on-one and are suitable for reinforcement learning in a relatively large exploration space. The more representative ones I want to share here are some startups for AI for Science, which are engaged in Multi-agent systems.
In this system, various scientific research tasks are included, such as literature retrieval, experimental planning, prediction of frontier progress, and data analysis. Its feature is that it is no longer a single agent like Deep Research, but a very complex system that can achieve higher resolution for scientific research systems. It has a very interesting function called "Contradiction Finding", which can handle adversarial tasks, such as finding contradictions between two top journal papers. This represents a very interesting paradigm in research agents.
Front desk work often involves dealing with people and doing external liaisons. Currently, voice agents are more suitable, such as nurse follow-up calls, recruitment, logistics communication, etc. in the medical field.
Here I want to share a company called HappyRobot. They found a scenario that sounds small, specializing in telephone communication in the field of logistics and supply chain. For example, when a truck driver encounters a problem or the goods arrive, the agent can quickly call him. This is a very special ability of AI Agent: responding and reacting quickly 24 hours a day, 7 days a week. This is enough for most logistics needs.
In addition to the above two categories, there are some special ones, such as Coding Agent.
02 From Copilot to Agent, is there a more pragmatic growth path?
Zhong Kaiqi: In the field of code development, there has been a lot of enthusiasm for entrepreneurship recently. A good example is Cursor. The release of Cursor 1.0 basically turned a product that originally looked like a Copilot (assisted driving) into a complete Agent product. It can operate asynchronously in the background and has a memory function, which is exactly what we imagined an Agent to be.
The comparison between it and Devin is very interesting. It gives us the inspiration that when developing an agent, you don’t have to “start with the end in mind” and aim for a fully automated agent from the beginning. You can start with Copilot. In this process, you can collect user data, improve user experience, occupy the user’s mind, and then slowly transform. Minus AI, which has done well in China, also started with Copilot as their earliest product.
Finally, I will use the "environment" thinking model to distinguish different agents. For example, Manus's environment is a virtual machine, Devin's environment is a browser, flowith's environment is a notebook, SheetZero's environment is a table, Lovart's environment is a canvas, etc. This "environment" corresponds to the definition of environment in reinforcement learning, which is also a classification method worth referring to.
Flowith, created by a domestic startup team | Source: Flowith
Zhang Peng: Let’s talk in depth about the Cursor example. What is the technology stack and growth path behind it?
Cage: The example of autonomous driving is very interesting. To this day, Tesla still doesn’t dare to remove the steering wheel, brakes, and accelerator. This shows that AI cannot completely surpass humans in many key decisions. As long as AI’s capabilities are similar to those of humans, some key decisions will definitely require human intervention. This is exactly what Cursor thought clearly from the beginning.
So the first feature they adapted was the one that humans need most: autocompletion. They made this feature a tab-triggered one. With the emergence of models like Claude 3.5, Cursor increased the accuracy of Tab to more than 90%. With this accuracy, I can use it 5 to 10 times in a task flow, and the flow experience appears. This is the first stage of Cursor as Copilot.
In the second phase, they worked on the feature of code refactoring. Both Devin and Cursor wanted to do this, but Cursor did it more cleverly. It would pop up a dialog box, and when I entered the requirement, it would start a parallel modification mode outside the file to refactor the code.
When this feature was first released, the accuracy was not high, but because users expected it to be a Copilot, everyone accepted it. And they accurately predicted that the model's coding ability would improve rapidly. So while they were polishing the product features and waiting for the model's ability to improve, the Agent's ability emerged smoothly.
The third step is the Cursor state we see today, a relatively end-to-end Agent running in the background. It has a sandbox-like environment behind it, and I can even assign tasks I don’t want to do to it while at work, and it can use my computing resources in the background to complete them. At the same time, I can focus on the core tasks I want to do most.
Finally, it tells me the result in the form of asynchronous interaction, just like sending an email or Feishu message. This process smoothly realizes the transformation from Copilot to Autopilot (or Agent).
The key is to grasp people's interactive mentality and make users more willing to accept synchronous interaction from the beginning, so that a large amount of user data and feedback can be collected.
03 Why is Coding the “key testing ground” on the road to AGI?
Zhang Peng: Guangmi just said, "Coding is the key to AGI. If we can't achieve ASI (super intelligence) in this field, it will be difficult in other fields." Why?
Li Guangmi: There are several logics. First, the data of Code is the cleanest and easiest to close the loop, and the results are verifiable. I have a guess that Chatbot may not have a data flywheel (a feedback loop mechanism that continuously optimizes AI models by collecting data from interactions or processes, thereby generating better results and more valuable data). But the Code field has the opportunity to run a data flywheel because it can perform multiple rounds of reinforcement learning, and Code is the key environment for running multiple rounds of reinforcement learning.
On the one hand, I understand Code as a programming tool, but I prefer to think of it as an environment for realizing AGI. AGI may be realized first in this environment because this environment is the simplest and it can exercise the core capabilities of AI. If AI cannot even develop an end-to-end application software, it will be even more difficult in other fields. If it cannot replace basic software development work on a large scale in the future, it will also be difficult in other fields.
Moreover, as the coding ability improves, the model’s ability to follow instructions will also improve. For example, when dealing with very long prompts, Claude is obviously better. We guess this is logically related to its coding ability.
Another point is that I think the future AGI will be realized in the digital world first. In the next two years, Agents will be able to do almost everything that people do on their phones and computers. On the one hand, it can be done through simple coding, and if that doesn't work, it can also call other virtual tools. Therefore, it is a big logic to realize AGI in the digital world first and make it run faster.
04 How to judge a good Agent?
Zhang Peng: Coding is the "universal machine" in this world. With it, AI can build and create. Moreover, the field of programming is relatively structured, which is suitable for AI to play. When evaluating the quality of an agent, in addition to user experience, from what perspective do you evaluate the potential of an agent?
Cage:A good Agent must first have an environment to help build a data flywheel, and the data itself must be verifiable.
Recently, Anthropic researchers have been mentioning a term called RLVR (Reinforcement Learning from Verifiable Reward), where the "V" refers to verifiable rewards. Code and mathematics are very standard verifiable fields. After completing a task, you can immediately verify whether it is right or wrong, and the data flywheel is naturally established.
The working mechanism of the data flywheel | Image source: NVIDIA
Therefore, building an agent product is to build such an environment. In this environment, the success or failure of the user's task is not important, because the current agent will definitely fail. The key is that when it fails, it can collect data with signals, rather than noise data, to guide the optimization of the product itself. This data can even be used as cold start data for reinforcement learning environments.
Second, is the product sufficiently "Agent Native". That is, when designing a product, we must consider the needs of both people and agents. A typical example is The Browser Company. Why did it make a new browser? Because the previous Arc was designed purely to improve the efficiency of human users. When designing their new browser, many new features will be used by AI agents themselves in the future. This is very important when the underlying design logic of the product changes.
In terms of results, objective evaluation is also critical.
1. Task completion rate + success rate: First, the task must be completed, so that the user can at least receive a feedback. The second is the success rate. For a 10-step task, if the accuracy rate of each step is 90%, then the final success rate is only 35%. Therefore, the connection between each step must be optimized. At present, a passing line in the industry may be a success rate of more than 50%.
2. Cost and efficiency: including computing cost (token cost) and user time cost. If GPT-4o runs a task in 3 minutes, while another agent takes 30 minutes, it will be a huge waste for the user. Moreover, the computing power consumption in these 30 minutes is huge, which will affect the scale effect.
3. User indicators: The most typical one is user stickiness. Are users willing to use the product repeatedly after trying it out? For example, the daily active user/monthly active user (DAU/MAU) ratio, the next month retention rate, the payment rate, etc. These are the fundamental indicators to prevent the company from having only "false prosperity" (five minutes of fame).
Li Guangmi: Let me add another perspective: the degree of match between the Agent and the current model capabilities. Today, the capabilities of Agent 80% rely on the model engine. For example, when GPT reached 3.5, the general paradigm of multi-round dialogue emerged, and the Chatbot product form became feasible. The rise of Cursor was also due to the model development to the level of Claude 3.5, which enabled its code completion capability to be established.
Devin actually came out quite early, so it is very important for the founding team to understand the boundaries of the model's capabilities. They need to know where the model can go today and in the next six months, which is closely related to the goals that the Agent can achieve.
Zhang Peng: What is an "AI Native" product? I think AI Native products are not just for people to use, they must also serve AI.
In other words, if a product does not have reasonable data to debug and is not built for the future AI working environment, then it only uses AI as a tool to reduce costs and increase efficiency. Such a product has limited vitality and can easily be overwhelmed by the wave of technology. A true AI Native product should have a built-in two-way mechanism to serve AI and humans. Simply put, when AI is serving users, are users also serving AI?
Cage: I like this concept very much. Agent data does not exist in the real world. No one will break down the thinking process step by step when completing a task. So what should we do? One way is to find a professional labeling company, and the other way is to leverage users and capture the users' real usage methods and the Agent's own operation process.
Zhang Peng: If we let humans “feed” data to AI through agents, what kind of tasks would be the most valuable?
Cage:Instead of thinking about using data to serve AI, it is better to think about what strengths AI has that should be magnified. For example, in scientific research, before AlphaGo, humans thought that Go and mathematics were the most difficult. But after using reinforcement learning, it was found that these are actually the easiest for AI. The same is true in the field of science. It has been a long time in human history that no scholar can understand the nooks and crannies of every discipline, but AI can. So I think that tasks such as scientific research are difficult for humans, but not necessarily difficult for AI. That is why we need to find more data and services to support it. The rewards of this type of task are more verifiable than most tasks. In the future, it may even be that humans help AI "shake the test tube", and then tell AI whether the result is right or wrong, helping AI to light up the technology tree together.
Li Guangmi:It is necessary to cold start the data at the beginning. Building an agent is like building a startup company. The founder must do a cold start and do it himself. Next, setting up the environment is very important, which determines the direction of the agent. After that, it is more important to build a reward system. I think the two factors of environment and reward are very critical. On this basis, the entrepreneur of the agent just needs to be the "CEO" of the agent. Today, AI can already write code that humans cannot understand but can run. We don't necessarily have to understand the end-to-end logic of reinforcement learning. We just need to build the environment and set the rewards.
05 Where will Agent’s business model go?
Zhang Peng: Recently, we have seen a lot of ToB agents, especially in the United States. Have their business models and growth models changed? Or are there new models emerging?
Cage: The biggest feature now is that more and more products are being used from the bottom up in corporate organizations, starting with the C-end. The most typical example is Cursor. In addition to it, there are many AI Agent or Copilot products that people are willing to use first. This is no longer the traditional SaaS model that requires getting the CIO in the first place and signing a one-on-one contract, at least not the first step.
Another interesting product is OpenEvidence, which is targeting doctors. They first conquered the doctor group, and then gradually implanted advertisements for medical devices and drugs. These businesses do not need to negotiate with hospitals at the beginning, because negotiating with hospitals is very slow. The most critical thing for AI startups is speed. It is useless to rely solely on technical moats. Growth needs to be achieved through this bottom-up approach.
AI medical unicorn OpenEvidence|Source: OpenEvidence
Regarding business models, there is now a trend of gradually moving from cost-based pricing to value-based pricing.
1. Cost-based: This is like traditional cloud services, adding a layer of software value on top of the CPU/GPU cost.
2. Pay per action: On the Agent side, one way is to charge by “action”. For example, the logistics agent I mentioned earlier charges a few cents for a phone call to a truck driver.
3. Charge by workflow: A higher level of abstraction is charging by "workflow", such as completing an entire logistics order. This is further from the cost side and closer to the value side, because it is really involved in the work. But this requires a relatively convergent scenario.
4. Pay by results: Going up one level, it is pay by results. Because the success rate of agents is not high, users want to pay for successful results. This requires the agent company to have a high level of product polishing capabilities.
5. Pay by Agent: In the future, we may actually pay by “Agent”. For example, there is a company called Hippocratic AI that makes AI nurses. In the United States, it costs about $40 per hour to hire a human nurse, while their AI nurses only cost $9 to $10 per hour, which is three quarters of the cost. In a market like the United States where labor is expensive, this is very reasonable. If the Agent can do better in the future, I can even give it bonuses and year-end bonuses. These are all innovations in business models.
Li Guangmi: What we are most looking forward to is a value-based pricing method. For example, if Manus AI builds a website, is it worth $300? If it builds an application, is it worth $50,000? But the value of today's tasks is still difficult to price. How to establish a good measurement and pricing method is worth exploring for entrepreneurs.
In addition, Kaiqi just mentioned that payment is based on agent, which is just like a company signing a contract with its employees. In the future, when we hire an agent, do we need to issue it an "ID card"? Do we need to sign a "labor contract"? This is actually a smart contract. I am looking forward to how smart contracts in the Crypto field will be applied to agents in the digital world in the future. When the task is completed, economic benefits will be distributed through a good measurement and pricing method. This may be an opportunity to combine agents with Crypto smart contracts.
06 What will the collaborative relationship between humans and agents look like?
Zhang Peng: Recently in the field of Coding Agent, there are two terms that have been discussed a lot: "Human in the loop" and "Human on the loop". What are they discussing?
Cage: "Human on the loop" means that humans should reduce the number of decisions in the loop as much as possible and only participate at critical moments. It is a bit like Tesla's FSD. When the system encounters a dangerous decision, it will warn humans to take over the accelerator and brake. In the virtual world, this usually refers to non-instantaneous, asynchronous human-computer collaboration. People can intervene in key decisions that AI is unsure of.
"Human in the loop" is more like AI will "ping" you from time to time to confirm something. For example, Minus AI has a virtual machine on the right side, and I can see what it does in the browser in real time. It's like an open white box, and I can roughly know what the agent wants to do.
These two concepts are not black and white, but a spectrum. Now it is more "in the loop", and people still need to review and approve at many key points. The reason is simple, the software has not reached that stage yet, and someone has to be responsible if there is a problem. The accelerator and brake must not be removed.
It is foreseeable that in the future, for highly repetitive tasks, the final result will be that people will only read the abstracts, and the degree of automation will be very high. For some difficult problems, such as letting AI read pathology reports, we can increase the "false positive rate" of the Agent a little bit, making it easier for it to think that "there is a problem", and then "on the loop" send these cases as emails to human doctors. In this way, although human doctors need to review more cases, all cases judged as "negative" by the Agent can be approved smoothly. If only 20% in the pathology report is really difficult, then the work bandwidth of human doctors has been magnified 5 times. So don't worry too much about "in" or "on", as long as you find a good combination point, you can do a good job of human-machine collaboration.
Li Guangmi: Behind the question asked by Brother Peng, there is actually a huge opportunity, which is "new interaction" and "how people and agents work together". This can be simply understood as online (synchronous) and offline (asynchronous). For example, when we broadcast a meeting live, we must be online in real time. But if I, as a CEO, assign tasks to my colleagues, the project progress is asynchronous.
The greater significance of this is that when agents are put into use on a large scale, it is worth exploring how people and agents can interact with each other, and how agents can interact with each other. Today, we still interact with AI through text, but in the future, there will be many ways to interact with agents. Some may run automatically in the background, while others require people to watch in front. Exploring new interactions is a huge opportunity.
07 Excess capacity and insufficient demand, when will Agent’s “killer application” appear?
Zhang Peng: Coding Agent is still an extension of IDE. Will there be any changes in the future? If everyone is crowded on this road, how can the latecomers catch up with Cursor?
Cage: IDE is just an environment, and replicating an IDE is not valuable in itself. But making an Agent in an IDE or another good environment is valuable in itself. I will think about whether its users are just professional developers, or whether it can be expanded to "civilian developers" beyond professional developers - those white-collar workers who have many automation needs.
What is lacking now? It is not the supply capacity, because products like Cursor have magnified AI's coding supply capacity by 10 times or even 100 times. In the past, if I wanted to make a product, I needed to outsource an IT team, and the cost of trial and error was very high. Now, in theory, I can try and error by just saying one word and paying a monthly fee of $20.
What is lacking now is demand. Everyone is using old demands to fit new technologies, which is a bit like "looking for nails with a hammer". Most of the current demands are for landing pages or basic toy websites. In the future, we need to find a convergent product form. This is a bit like when the recommendation engine came out, it was a very good technology. Later, a product form called "information flow" appeared, which really brought the recommendation engine to the public. But the AI Coding field has not yet found a killer product like "information flow".
Li Guangmi:I think Coding may take away the value of the entire large model industry at a certain stage. How can this value grow? Today's first act is still to serve 30 million programmers around the world. Let me give you an example. Photoshop serves 20 to 30 million professional designers around the world, and the threshold is very high. But when Jianying, Canva, and Meitu Xiuxiu come out, there may be 500 million or even more users who can use these tools and create more popular content.
Code has one advantage, it is a platform for creative expression. In this society, tasks above 90% can be expressed through Code, so it has the potential to become a creative platform. In the past, the threshold for application development was very high, and a large number of long-tail demands were not met. When the threshold is greatly reduced, these demands will be stimulated. What I am looking forward to is the "explosion of applications." The largest data generated by the mobile Internet is content, and the largest content generated by this wave of AI may be new application software. This is like the difference between long video platforms such as Youku and iQiyi and Douyin. You can compare the big model to a camera, and on top of it you can make killer applications such as Douyin and Jianying. This may be the essence of the so-called "Vibe Coding", which is a new creative platform.
Zhang Peng: To improve the output value of Agent, input also becomes very important. But in terms of products and technology, are there any ways to improve the quality of input to ensure better output?
Cage: When it comes to products, we cannot assume that it is the users’ fault if they cannot use the product well. The most critical word to focus on is “context”. Can an agent establish “context awareness”?
For example, if I write code in a large Internet company, the Agent will not only look at the code I have on hand, but also the entire company's codebase, and even my conversations with product managers and colleagues in Feishu, as well as my previous coding and communication habits. Giving all these contexts to the Agent will make my input more efficient.
Therefore, for Agent developers, the most important thing is to make the connection between the memory mechanism and the context good enough, which is also a major challenge for the Agent infrastructure.
Agent challenges: good memory mechanism and context connection | Image source: Retail Science
In addition, it is also important for developers to prepare cold start data for reinforcement learning and to define clear rewards. The meaning behind this reward is how to break down the needs of users when they are not expressing clearly. For example, when I asked unclear questions, OpenAI's Deep Research would first give four guiding questions. In the process of interacting with it, I was actually thinking about my needs clearly.
For today's users, the most important thing is to think about how to clearly express their needs and how to accept them. Although you don't have to "start with the end in mind", you should have a rough expectation of the good and the bad. When we write prompts, we should also write like code, with clear instructions and logic, so as to avoid a lot of invalid output.
Li Guangmi: I would like to add two points. First, the importance of context. We often discuss internally that if the context is done well, there will be new opportunities at the level of Alipay and PayPal.
In the past, e-commerce companies looked at GMV, but in the future, they will look at task completion rate. Task completion is about intelligence on one hand and context on the other. For example, if I want to build a personal website, if I provide my Notion notes, WeChat data, and email data to AI, then the content of my personal website will definitely be very rich.
Second, autonomous learning. After setting up the environment, the agent must be able to iterate, which is very critical. If it cannot continue to learn and iterate, the result will be that it will be eaten up by the model itself, because the model is a learning system. In the last wave of mobile Internet, companies that did not do machine learning and recommendation did not grow big. In this wave, if the agent cannot do end-to-end autonomous learning and iteration well, I think it will not be able to succeed.
08 What other changes and opportunities are there amid the competition among giants?
Zhang Peng: How do we determine whether the capabilities of future agents will appear in the form of a super interface or be discretely distributed in various scenarios?
Cage: I see a big trend. First, it is definitely multi-agent. Even if it is to complete a task, in products like Cursor, the agents doing code completion and unit testing may be different, because they require different "personalities" and are good at different things.
Second, will the entrance change? I think the entrance is a second-order problem. The first thing that needs to happen is that everyone has many agents and collaborates with them. Behind these agents will be a network that I call "Botnet". For example, in the future, when I shop, fixed consumption above 60% may be completed by agents.
The same is true in the productivity scenario. In the future, programmers' daily meetings may be replaced by collaboration between agents, which will push abnormal indicators and product development progress. When these happen, changes in the entry point may appear. At that time, API calls will no longer be mainly human calls, but calls between agents.
Zhang PengQ: What are the decision-making and action states of Agents of capable large companies, such as OpenAI, Anthropic, Google, and Microsoft?
Li Guangmi: The key word in my mind is "differentiation". Last year, everyone was chasing GPT-4, but now there are more things that can be done, and each company has begun to differentiate.
The first to diverge was Anthropic. Because it was later than OpenAI and its comprehensive capabilities were not as strong, it focused on coding. I feel that it has touched the first big card on the road to AGI, which is Coding Agent. They may think that AGI can be achieved through coding, which can bring about the ability to follow instructions and Agent capabilities, which is a logically self-consistent closed loop.
But OpenAI has even more big names in its hand. The first one is ChatGPT, which Sam Altman may want to make into a product with 1 billion daily active users. The second one is its "o" series models (GPT-4o, etc.), which are expected to bring more generalization capabilities. The third one is multimodality, and its multimodal reasoning ability has improved, which will also be reflected in generation in the future. So, Anthropic has touched one big card, and OpenAI has touched three.
Another big company is Google. I think by the end of this year, Google may catch up in all aspects. Because it has TPU, Google Cloud, the top Gemini model, Android and Chrome. You can't find another company in the world that has all these elements and is almost independent of external parties. Google's end-to-end capabilities are very strong. Many people are worried that its advertising business will be disrupted, but I feel that it may find new ways to combine products in the future and transform from an information engine to a task engine.
Look at Apple. Because it does not have its own AI capabilities, it is very passive in iteration. Microsoft is known for its developers, but Cursor and Claude have actually grabbed a lot of developers' attention. Of course, Microsoft's plate is very stable, with GitHub and VS Code, but it must also have very strong AGI and model capabilities. So you can see that it also announced that one of GitHub's preferred models has become Claude, and it has iterated its own developer products. Microsoft must hold on to the developer side, otherwise the foundation will be gone.
So everyone started to split up. Maybe OpenAI wants to be the next Google, and Anthropic wants to be the next Windows (living on API).
Zhang Peng: What are the changes and opportunities in Agent-related infrastructure?
Cage:Agent has several key components. In addition to the model, the first is the environment. In the early stages of Agent development, 80%'s problems were all due to the environment. For example, the early AutoGPT either used Docker to start, which was very slow, or deployed directly on the local computer, which was very inefficient.SafetyIf an agent wants to "work" with me, I have to equip it with a "computer", so the opportunity of environment comes out.
There are two major requirements for configuring a computer:
1. Virtual Machine/Sandbox: Provide a safe execution environment. If the task is done wrong, it can be rolled back. The execution process cannot harm the actual environment. It must be able to start quickly and run stably. Companies like E2B and Modal Labs are providing such products.
2. Browser: Information retrieval is the biggest demand, and agents need to crawl information from various websites. Traditional crawlers are easily blocked, so agents need to be equipped with a dedicated browser that can understand information. This gave rise to companies such as Browserbase and Browser Use.
The second component is the context. This includes:
-
Information Retrieval: Traditional RAG companies are still there, but there are also new companies, such as MemGPT, which develops lightweight memory and context management tools for AI agents.
-
Tool discovery: There will be a lot of tools in the future, and we will need a platform like Dianping to help agents discover and select useful tools.
-
Memory: The agent needs an Infra that can simulate the complex long-term and short-term memory combination capabilities of humans.
The third component is tools, which include simple search, complex payment, automated backend development, etc.
Finally, when the agent becomes more powerful, an important opportunity is agent security.
Li Guangmi: Agent Infra is very important. We can think from the end to the beginning. Three years from now, when trillions of agents are performing tasks in the digital world, the demand for Infra will be too great, which will reshape the entire cloud computing and digital world.
But today we still don’t know what kind of Agent can grow big and what kind of Infra it needs. So now is a very good window period for entrepreneurs to co-design and co-create Infra tools with those Agent companies that are doing well.
I think the most important things today are, first, virtual machines, and second, tools. For example, future agent searches will definitely be different from human searches, and there will be a huge demand for machine searches. Currently, human searches on the entire network may be 20 billion times a day, and in the future, machine searches may be hundreds of billions or even trillions of times. This kind of search does not require human sorting optimization, and a large database may be enough. There are great opportunities for cost optimization and entrepreneurship.
09 When AI is no longer just a big model, in which direction will it evolve?
Zhang Peng: Agents can never get around models. Looking back today, what key steps do you think model technology has taken in the past two years?
Li Guangmi: I think there are probably two key milestones. One is the Scaling Law paradigm represented by GPT-4, which means that in the pre-training stage, scaling up is still effective and can bring universal generalization capabilities.
The second major milestone is the paradigm of "models thinking" represented by the "o" series of models. It significantly improves reasoning ability through longer thinking time (thinking chain).
I think these two paradigms are the left and right arms of today's AGI. On this basis, the Scaling Law is far from stopping, and the thinking mode will continue. For example, Scaling can continue in multimodality, and the thinking ability of the "o" series can be added to multimodality, so that multimodality can have a longer reasoning ability, and the controllability and consistency of generation will become very good.
My feeling is that the next two years may see faster progress than the past two years. Today, we may be in a state where thousands of top AI scientists around the world are jointly promoting the Renaissance of human technology. With sufficient resources and platforms, breakthroughs may occur in many areas.
Zhang Peng: What technological advancements and leaps are you looking forward to seeing in the field of AI in the next one or two years?
Cage: The first is multimodality. At present, the understanding and generation of multimodality are still relatively scattered. In the future, it will definitely move towards "grand unification", that is, the integration of understanding and generation. This will greatly open up the imagination of products.
The second is autonomous learning. I really like the concept of "the era of experience" proposed by Richard Sutton (the father of reinforcement learning), that is, AI improves its ability through the experience of performing tasks online. This was not seen before because there was no foundation of world knowledge. But starting this year, this will be a continuous thing.
2024 Turing Award winner Richard Sutton |Source: Amii
The third is memory. If the model can really do a good job of Agent memory at the product and technical level, the breakthrough will be huge. The stickiness of the product will really appear. I feel that the moment GPT-4o started to have memory, I really became sticky to the ChatGPT app.
Finally, there are new interactions. Will there be new interactions that are no longer text input boxes? Because typing is actually a very high threshold. Will there be more intuitive and instinctive interactions in the future? For example, I have an "always-on" AI product that constantly listens to me and thinks asynchronously in the background, and can capture key context at the moment when I get inspiration. I think these are what I am looking forward to.
Zhang Peng: Indeed, today we face both challenges and opportunities. On the one hand, we cannot be overwhelmed by the speed of technological development and must maintain continuous attention. On the other hand, today's AI products are moving from "tools" to "relationships." People will not establish a relationship with tools, but will establish a relationship with an AI that has memory, understands you, and can "be in tune" with you. This kind of relationship is essentially habit and inertia, which is also an important barrier in the future.
Today's discussion was very in-depth. Thanks to Guangmi and Kaiqi for their wonderful sharing. Thanks to the audience in the live broadcast room for their company. See you in the next issue of "Tonight's Technology Talk".
Li Guangmi:Thanks.
Cage:Thanks.
The article comes from the Internet:Let’s talk about Agent in detail. Is it a “colleague” or a “tool”? What are the entrepreneurial opportunities and value?