Article source: China Entrepreneur Magazine
Image source: Generated by AI
“Abandoning generative models and studying LLM (Big Language Models), we can’t get AI to reach the level of human intelligence just through text training.” Recently,Yann LeCun, Meta’s chief AI scientist, once again blasted generative AI at the 2025 Artificial Intelligence Action Summit in Paris, France.
Yang Likun believes that although existing large models operate efficiently, the reasoning process is divergent, and the generated tokens may not be within the range of reasonable answers. This is why some large models produce illusions. Although many current generative models allow AI to pass the bar exam and solve mathematical problems, they cannot do housework,What humans can do without thinking is complex for generative AI.
He also said that generative models are not suitable for making videos at all. The AI models that can generate videos currently see cannot understand the physical world. They are just generating beautiful pictures. Yang Likun supports models that can understand the physical world. He proposed the Joint Embedded Prediction Architecture (JEPA), which is more suitable for predicting video content. He always believes that only when AI can truly understand the physical world can artificial intelligence comparable to human intelligence.
Finally,Yang Likun emphasized the need for an open source artificial intelligence platform. In the future, we will have universal virtual assistants that will regulate all our interactions with the digital world.If they can speak all the languages in the world, understand all cultures, all value systems, and all centers of interest, such an AI system cannot come from a few companies in Silicon Valley. It must be done collaboratively in an effective way.
The wonderful views are as follows:
1. We need human-level intelligence because we are used to interacting with people. We look forward to the emergence of AI systems with human intelligence. In the future, ubiquitous AI assistants will become a bridge between humans and the digital world, helping humans better interact with the digital world.
2. There is no way we can get AI to reach the level of human intelligence just through text training. This is impossible.
3. At Meta, we call this kind of AI that can reach the level of human intelligence advanced machine intelligence. We don’t like the term “AGI”(General Artificial Intelligence), but call it “AMI”, which is pronounced in French like “friend”.
4. Generative models are not suitable for making videos at all. You may have seen AI models that can generate videos, but they don’t really understand physics, just generate beautiful pictures.
5. If you are interested in AI that reaches the level of human intelligence and you are in academia, don’t study LLM because you’re competing with hundreds of people with tens of thousands of GPUs, which makes no sense.
6. AI platforms need to be shared. They must speak all the languages in the world, understand all cultures, all value systems, and all interest centers. No company in the world can train such a basic model, and it must be done in an effective way. Collaborative manner.
7. Open source models are slowly but firmly moving beyond closed-source models.
The following is the full text of the sharing (with deletions):
Why we need AI at the level of human intelligence
As we all know, we need human-level artificial intelligence. This is not only an interesting scientific issue, but also a product requirement. In the future, we will wear smart devices, such as smart glasses, and use these smart devices to access AI assistants and interact with them at any time.
We need human-level intelligence because we are used to interacting with people. We look forward to the emergence of AI systems with human intelligence. In the future, ubiquitous AI assistants will become a bridge between humans and the digital world, helping humans better interact with the digital world.However, compared with humans and animals, current machine learning is still very poor. We have not yet built machines with human learning capabilities, common sense and the ability to understand the material world. Both animals and humans can advance actions based on common sense, and these behaviors are essentially goal-driven.
So the artificial intelligence systems that almost everyone is currently using do not have the characteristics we want.Because they recursively generate tokens after tokens, and then use the marked tokens to predict the next token. The way to train these systems is to put information on the input and then try to get it to reproduce the information on the input at the output. It is a causal structure that cannot cheat or use specific inputs to predict itself. It can only look at the tokens around you. So it is very efficient. People call it the Universal Large Model, and you can use it to generate text and pictures.
But this reasoning process is divergent. Every time you generate a token, it may not be within the range of a reasonable answer, and it may keep you farther and farther away from the correct answer. If this happens, there is no way to repair it in the future. This is why some large models produce illusions and nonsense.
Nowadays, these artificial intelligences cannot replicate the wisdom of humans. We cannot even replicate the wisdom of animals such as cats or mice. They all understand the operating rules of the physical world and can complete some actions that are carried out based on common sense without planning. A 10-year-old human child can complete the actions of collecting dishes and cleaning the table without learning, and a 17-year-old young man can learn to drive in 20 hours. However, we have not yet been able to build a robot that can be used for home use, which shows that our current artificial intelligence research and development still lacks some very important things.
Our existing AI can pass bar exams, solve mathematical problems, and prove theorems, but it cannot do housework.What we think can be done without thinking is very complex for artificial intelligence robots, and what we think is unique to humans, such as language, playing chess, creating poetry, etc., can be easily done by today’s AI and robots.
There is no way we can get AI to reach the level of human intelligence just through text training. This is impossible.Some vested interests will say that AI’s intelligence will reach the level of a human doctoral degree next year, but this is simply impossible. AI may reach the level of a human doctoral degree in certain fields such as chess playing and translation, but the universal model cannot do it. If we only train these AI models that are specific to a specific domain problem, if your question is standard, the answer can be generated in a few seconds, but if you modify the expression of the question a little, the AI may still give the same answer because it doesn’t really think about the question. So it will take time for us to develop an artificial intelligence system that can reach the level of human intelligence.
Not “AGI” but “AMI”
At Meta, we call this kind of AI that can reach the level of human intelligence advanced machine intelligence. We don’t like the term “AGI”(General Artificial Intelligence), but call it “AMI”, which is pronounced in French like “friend”.We need a model that uses our senses to collect information and learn, that we can manipulate in our minds, and learn two-dimensional physics from videos. For example, systems with persistent memories, systems that can plan actions in layers, and systems that can reason, and then design rather than fine-tuning to achieve controllable and safe systems.
Now, I know that the only way to build such a system is to change the way current artificial intelligence systems reason. The current LLM reasoning method is to generate a token by running a fixed number of neural network layers (Transfomer) and input it, and then running a fixed number of neural network layers again. The problem with this way of reasoning is that whether you ask a simple or complex question, when you ask the system to respond “yes” or “no”, it will take as much calculation to answer them. So people have been cheating and telling the system how to answer. Humans know this reasoning and thinking technique, allowing the system to generate more tokens, which will cost more computing power to answer questions.
In fact, this is not how reasoning works. In many different fields such as classical statistical artificial intelligence, structural prediction, etc., the way reasoning works is: You have a function that measures the compatibility between your observations and your output. or incompatibility, the reasoning process involves finding values that compress the information space to a minimum and output. This function we call an energy function.When the results do not meet the requirements, the system will only perform optimizations and reason. If the reasoning problem is more difficult, the system will spend more time reasoning. In other words, it will spend longer thinking about complex problems.
Many things in classic artificial intelligence are related to reasoning and search, so optimizing any computing problem can be reduced to an inference problem or a search problem. This type of reasoning is more similar to what psychologists call System 2, which is thinking about how you will do it before taking action. System 1 is things that you can do without thinking, which can become a subconscious mind.
Source: Video screenshot
Let me briefly explain the energy model, which means that we can capture the dependence between variables through the energy function. Assuming the observation value X and the output value Y, when X and Y are compatible, the energy function takes a low value, and when X and Y are incompatible, the energy function takes a high value. You don’t want to calculate Y just from X, you just want an energy function to measure the degree of incompatibility, so you just give an X and find a Y with a lower energy.
Now let’s take a closer look at how the architecture of the world model is built and how it has to do with thinking or planning. The system is like this. Observing the world requires a perception module, which summarizes the state of the world. Of course, the state of the world is not completely observable, so you may need to combine it with memory. The content of memory contains your thoughts on the state of the world, and the combination of the two forms a world model.
So what is the world model?The world model gives a summary of the current state of the world. It gives a sequence of actions you imagine in an abstract presentation space, and your world model predicts the state of the world after you take those actions.If I told you to imagine a cube floating in front of you, and now rotate the cube vertically 90°, what would it look like? You can easily imagine in your mind what it will look like after spinning.
I think before we have real working audio and video, we will have human-level intelligence.If we had this world model that could predict the outcome of a series of actions, we could input it into a task goal and use it to measure how much the predicted final state would meet the goals we set for ourselves. This is just an objective function. We can also set some constraints and regard them as requirements that the system needs to meet for safe operation. With these constraints, you can ensure the security of the system and make it impossible for you to overcome them. They are mandatory and are outside the scope of training and reasoning.
Now a series of actions should use a world model that can be used repeatedly in multiple time steps. If you perform the first action, it predicts the state after the action is completed, and if you perform the second action, it predicts the next state. Follow this trajectory, you can also set task goals and constraints. If the world is not completely certain and predictable, then the world model may require potential variables to explain all things about the world that we have not observed, which biases our predictions. Eventually,What we want is a system that can be planned in layers.It may have several levels of abstraction, and at the lower levels we plan low-level actions, such as basic muscle control. But at a high level, we can plan abstract macro actions. For example, I was sitting in my office at New York University and decided to go to Paris. I can divide this task into two sub-tasks: going to the airport and catching a plane. Then plan each step in detail: get your bag, go out, take a taxi, take the elevator, buy a ticket…
We often don’t feel like we’re doing hierarchical planning for these things, they’re almost subconscious actions, but we don’t know how to make machine learning do this. Almost every machine learning process involves hierarchical planning, but the prompts at each level are entered manually. We need to train an architecture so that it can learn these abstract demonstrations by itself, not only the state of the world, but also the prediction of the world. The model can also predict abstract actions at different abstract levels, so that machine learning can achieve hierarchical planning unconsciously like humans.
How to make AI understand the world
With all these reflections in mind, I wrote a long paper three years ago explaining the areas I believe artificial intelligence research should focus on. I wrote this paper before ChatGPT exploded. To this day, my views on this issue have not changed.ChatGPT didn’t change anything.That paper was about the path to autonomous machine intelligence, which we now call advanced machine intelligence because the word “autonomous” scares people. I have introduced it in speeches on different occasions.
A common way to get a system to understand how the world works is to train natural language systems according to the process we used to train natural language systems in the past and apply it to video. If a system can predict what will happen in a video, you show it a short video and then ask it to predict what will happen next. Training it to make predictions can actually allow the system to understand the underlying structure of the world. It works with text because predicting words is relatively simple, the number of words is limited, and the number of words that can be marked is limited. We cannot accurately predict which word will follow another word, or which word is missing from the text, but we can calculate the probability of each word in the dictionary.
But we can’t do this with images or videos, we don’t have a good way to express the distribution of video frames, and every time we try to do this, we basically run into mathematical difficulties. So, you can try to solve this problem using statistics and mathematics invented by physicists. In fact, it’s best to abandon the idea of probability modeling completely.
Because we cannot accurately predict what will happen in the world. If a system is trained to predict only one frame, it will not do well.So the way to solve this problem is to develop a new architecture, which I call the Joint Embedded Prediction Architecture (JEPA). Generative models are not suitable for making videos at all. You may have seen AI models that can generate videos, but they don’t really understand physics, just generate beautiful pictures. The idea of JEPA is to run observations and output values simultaneously, so that it is no longer just predicting pixels, but predicting what happens in the video.
Source: Video screenshot
Let’s compare these two architectures. On the left is the generative architecture, where you input X, the observation, into the encoder and then make a prediction about Y, which is a simple prediction. And in the JEPA architecture on the right, you run both X and Y and possibly the same or different encoders at the same time, and then predict the representation of Y based on the representation of X in this abstract space, which will cause the system to basically learn an encoder that eliminates everything you can’t predict, and that’s what we really do.
As we shoot in the room, the camera starts to move. Neither humans nor AI intelligence can predict who will appear in the next picture, what the texture of the walls or floors will look like, and there are many things we cannot predict at all. Therefore,Instead of insisting that we make probability predictions about things that cannot be predicted, we should give up predicting it and learn a representation where all these details are basically eliminated, making prediction much easier and we simplify the problem.
There are various styles of JEPA architecture, but we won’t discuss the potential variables first, but let’s talk about the action conditions, which is the most interesting part because they are really a model of the world. You have an observation X that is the current state of the world, and you enter the action you plan to do into the encoder. This encoder is the world model, and let it give you a prediction of the state of the world after you do this action. This is how you plan.
Recently, we conducted in-depth research on Video JEPA. How does this model work? For example, first, 16 consecutive frames are extracted from the video as input samples, then some frames are masked and destroyed, and then these locally destroyed video frames are input to the encoder, and a prediction module is simultaneously trained to make it based on incomplete picture information reconstructs a complete video representation. Experiments have shown that this self-supervised learning method has significant advantages. The deep features it learns can be directly transferred to downstream tasks such as video action classification, and has achieved excellent performance in multiple benchmark tests.
One of the very interesting things isIf you show the system and something very strange happens in the video, the system is actually telling you that its prediction error is soaring.You take a video and take 16 frames to measure the prediction error of the system. If something strange happens, such as an object spontaneously disappears or changes shape, the prediction error will rise. It tells you that although the system is simple, it has learned a certain level of common sense that it can tell you whether something very strange is happening in the world.
I would like to share our latest work-DINO-WM (a new way to build dynamic visual models without reconstructing the visual world). Train a predictor with a picture of the world and then run it through the DINO encoder. Finally, the robot may make an action so that you can get the next frame of the video, put this frame of image into the DINO encoder again and run it, get a new image, and then train your predictor to predict what will happen based on the action taken.
Planning is very simple. You observe an initial state, put it in the DINO encoder and run it, and then use imaginary actions to run the world model at multiple points in time and steps. Then you have a target state, which is represented by the target image. For example, you run it to the encoder, and then calculate the difference between the predicted state and the state representing the target image in the presentation space to find an action sequence with the lowest cost to run.
Source: Video screenshot
It’s a very simple concept, but it works well. Suppose you have this little T-shaped pattern and you want to push it to a specific position, and you know which position it has to go, because you put the image of that position into the encoder, and it will give you a status of the target in the presentation space. When you take a series of planned actions, what actually happens in the real world is the internal psychological prediction of the sequence of actions planned by the system, and putting it into the decoder produces a graphical representation of the internal state.
Please abandon studying generative models
Finally, I have some suggestions to share with you.The first is to abandon the generative model.This is the most popular method right now and everyone is studying it. You can study JEPAs, which are not generative models; they predict what will happen in the world in the demonstration space. Give up reinforcement learning. I have been saying for a long time that it is inefficient.If you are interested in AI that reaches the level of human intelligence and you are in academia, don’t study LLM because you’re competing with hundreds of people with tens of thousands of GPUs, which makes no sense.There are still many problems to be solved in academia. Planning algorithms are very inefficient. We must come up with better methods. JEPA with potential variables is a completely unsolved problem in uncertain hierarchical planning. Scholars are welcome to explore these.
In the future, we will have universal virtual assistants that will always accompany us and regulate all our interactions with the digital world. We can’t let these AI systems come from just a few companies in Silicon Valley or China, which means that the platforms we use to build these systems need to be open source and widely available. These systems are expensive to train, but once you have a basic model, fine-tuning for specific applications is relatively cheaper and affordable to many people.
AI platforms need to be shared. They need to speak all the languages in the world, understand all cultures, all value systems and all interest centers. No company in the world can train such a basic model, and it must be done in an effective way. Collaborative completion.
Therefore, an open source artificial intelligence platform is necessary. The crisis I see in Europe and elsewhere is that geopolitical competition has tempted some governments to essentially outlaw the release of open source models because they want to keep scientific secrets to stay ahead. This was a huge mistake,When you conduct research in secret, you will fall behind, which is inevitable,What will happen is that the rest of the world is adopting open source technology, and we will surpass you. This is what is happening right now,Open source models are slowly but firmly moving beyond closed-source models.