Your Position Home AI Technology

Why can’t Sora be a model of the world?

Wen| Wang Zhiyuan

After writing an article on spatial intelligence, I posted it to the group to talk about how it uses virtual space data to train robots to help humans understand the world.

As a result, a friend asked a question:

Is Vincent’s video considered spatial intelligence? It can also generate virtual scenes, why is it not the best way? That’s an interesting question. My first reaction was Sora.

Wensheng’s video star is rising too fast. A few sentences can generate a video. Within two years, Byte, Tencent, and even other model manufacturers have suppressed the track.

However, two years later, some people found that it was not so perfect. The generated portraits always had the strangeness of the Valley of Horrors. Even Yann LeCun, chief artificial intelligence scientist at Facebook, commented that Sora was just a beautiful drawing and didn’t understand the laws of physics.

So, I studied it with questions: Why can’t Sora, which seems to be powerful, become a real world simulator? What is the gap between it and spatial intelligence?

01

Einstein had a classic saying:

“If you can’t simply explain something, it means you haven’t really understood it. rdquo;(If you can’t explain it simply, you don’t understand it well enough.)

Therefore, if you want to explore in depth, you must start from the deep-seated technical principles.

The core of Sora is the Diffusion Model; starting with a bunch of random noise, using AI to remove the clutter step by step, and finally generating clear pictures, and then connecting these pictures into video. It sounds like magic, but in fact it is backed by mathematical principles.

In addition, it also has a helper called Transformer, which many people have heard of. What does it mean? It is good at processing sequential data and connecting scattered information into a line. In Sora, it breaks down text instructions and connects frames after frames into smooth actions.

An example:

If you type a boat sailing in a coffee cup, Sora will first understand the boat and the coffee cup, and then connect the related words and scenes such as boat, rippling waves, and tilting of the ship.

This relies on massive video data and powerful computing power to generate tens of seconds of images in a few seconds.

However, have you ever thought that you can really understand the physical world by relying solely on the results of data? The answer is no. The problem lies in the architecture.

Diffusion models are good at learning pixel patterns from data and predicting what the next picture should look like;Transformer can connect frames seamlessly. So visually, Sora is smart and can imitate the continuity of real videos, but when you think about it carefully, the problem arises.

How can a boat fit into a cup? I tried typing “The cat jumps to the table”, and the picture was so smooth that I couldn’t say anything. As a result, the cat’s legs directly passed through the table, just like a mold in a game. Why is this happening?

Because Sora’s generative logic is drawn beautifully, not correctly.

It doesn’t understand how gravity makes its feet fall to the ground, nor does it understand why the table blocks the cat’s legs. When generating the portrait of the Valley of Horrors, it is even more clear at a glance. The details of the face will collapse when zoomed in. It only knows how to predict by pixels, but does not consider the rules of reality.

So Sora’s strengths and weaknesses are two sides of the same coin.

Visual smoothness is its ability, and irrationality is also its lifeline. As Yann LeCun said, it doesn’t understand why Apple landed. I think this view is correct: Sora’s architecture doesn’t want to understand the physical world at all, just wants to make the picture look real.

Since Sora does not understand the physical world, can it become a world simulator?

I think it’s a little tricky. Why?

The world simulator is a virtual environment that can run physical rules, helping robots learn causality in reality, but although the videos generated by Sora look like something, they have no authenticity.

Think about it, how can a video like a ship in a cup teach a robot? The robot may think that the cup can hold a 10,000-ton giant ship, which is not useful at all.

Therefore, the goal of diffusion models and transformers is visual generation rather than physical simulation, and Sora is more like an artistic tool that pursues good-looking pictures rather than the right world. This makes me think that Sora’s limitation is that its architecture is not targeted.

02

Now that the question arises: What are the key features of a world simulator?

I think there are three basic points:

First, you need to know what the rules of real objects are like, and if you move them into a virtual scene, you need not be too different; second, you need to understand how objects affect each other; third, you need to be able to integrate different objects and reason with each other.

This is a bit abstract, let me give you an example:

You are teaching a robot how to hold things. The virtual cup in the world simulator must imitate the weight, material, and shape of the real cup, so that the robot will know how hard it should use to grab it.

If the simulator imitates gravity inaccurately based on various indicators, the robot will grasp it too tightly or too loosely, and things will fall off or even be damaged.

Let’s talk about intelligent transportation.

In reality, traffic jams are a big problem. To solve it, we must rely on algorithms and data analysis, such as off-peak travel.

Suppose there is a world simulator. If it cannot simulate the duration of traffic lights and vehicle speeds, it cannot predict where and when traffic jams will be blocked, and it cannot make peak planning.

Similarly, if the simulator is unclear about the friction of the vehicle, it cannot determine whether the car can start smoothly at the green light or stop in time at the red light; if it is unclear about the interaction between the vehicles, traffic will be chaotic and may even occur. accident.

Therefore, the role of the world simulator is to clarify complex physical rules and the relationship between objects, so that high-tech things such as robots and intelligent transportation can work better.

In contrast, Sora is obviously deficient in key features. It does a great job in visual generation, but it cannot meet the requirements of world simulators for physical rules and causal reasoning.

This problem does not only appear on Sora, but some domestic large models also have similar architectural flaws. When I use Douyin, I often see people using graphic video models. As a result, the person suddenly turns into a dog. It looks funny, but it obviously does not conform to the logic of reality.

The reason is simple. The architecture cannot provide real physical understanding for the world simulator, so its application in embodied intelligence or other fields will be greatly limited.

One conclusion can be drawn: the architecture of the world model and Wensheng videos is completely different. To simulate the real world, a world model must understand physical laws and realistic logic; Wensheng videos mainly generate pictures and are not so strict in logic and authenticity.

03

I think, in contrast, what is really worth paying attention to is models that focus more on physical rule modeling and causality reasoning directions.For example: Li Feifei’s World Labs, Huang Renxun’s Cosmos WFMs, and Group Nuclear Technology’s spatial intelligence.

Why take them as examples? There are three points:

Let’s first look at the goal. Huang Renxun’s Cosmos WFMs (World Model) is to create a “virtual brain” that can simulate the real world.This brain must understand the rules of physics, know how objects move, how forces act, and understand the cause and effect of things.

Li FeifeiThe goal of World Labs is to allow artificial intelligence to truly understand the world.It allows AI to not only see but also understand the world by simulating physical rules, causality and complex scenarios.

For example, an AI product can predict the development of things in a virtual scene or make reasonable decisions based on different situations. This ability is crucial to improving intelligence in fields such as robots and autonomous driving.

The goal of Qunnuclear Technology’s spatial intelligence is to move the real world to the digital world, so that AI can understand and use it, and then use data to help home design, architectural planning, as well as AR, VR and other fields, helping the industry be more efficient. Work.

To put it bluntly, it hopes to create a digital twin world where people, AI, and space can think and act to solve practical problems.

Now that you have a goal, let’s take a look at the three technology realization paths.

The technical implementation path of Cosmos WFMs is to provide developers with efficient development tools by building Generative World Foundation Models (WFMs), combining key technologies such as advanced word segmentation, safety barriers and accelerated Media Processing Service pipelines.

Specifically, it uses NVIDIA NeMo to tune the basic model and provides open source support through GitHub and Hugging Face to help developers generate highly simulated physical data.

In addition, Cosmos also focuses on tasks such as multi-view video generation, path planning, and obstacle avoidance, further enhancing the application capabilities of physical AI in fields such as robots and autonomous driving.

Isn’t the contents of the report difficult to understand?

In popular terms:The system they have built allows AI to learn to see the road, plan routes, avoid obstacles like a human, and can also generate videos from various angles. It is particularly suitable for use in fields such as robots and autonomous driving.

Li Feifei’s World Labs ‘technical implementation path is to develop an intelligent conversion technology from 2D to 3D, so that AI can not only understand flat pictures, but also generate complete three-dimensional space.

Their system starts from an ordinary photo, estimates the 3D structure of the scene, and then completes the invisible parts of the image, ultimately creating a virtual world where users can freely explore and interact.

Simply put, use AI to turn flat images into three-dimensional space, so that people can walk in and look around like they are in the real world.This technology is particularly useful in areas such as robotic navigation and virtual reality, which require spatial intelligence to understand and respond to complex 3D environments.

Group Nuclear Technology engages in spatial intelligence. Simply put:

10,000 GPU servers use computing power to help the home and construction industries quickly create a large number of 3D models, and save a lot of 2D and 3D design data; integrating the data onto one platform can generate particularly realistic virtual scenes.

Finally, companies can use this platform to train robots, such as sweeping robots or autonomous driving devices, so that they can simulate real environments in the virtual world, learn how to move, how to avoid obstacles, and become smarter.

Therefore, whether Huang Renxun’s Cosmos WFMs, Li Feifei’s World Labs, or Group Nuclear Technology’s spatial intelligence,The core goal of the technology is to make AI training smarter in space and better solve practical problems by simulating physical rules and causality in the real world.

04

I believe that achieving this goal requires one key factor: high-quality data. Data is the basis for building world models and spatial intelligence, but it is also the biggest “obstacle” to development.

Why?

Let’s say embodied intelligence is a bit abstract, let’s put it a more specific word: virtual training. There are two important aspects of virtual training:

One is the massive amount of generated data. Text models like GPT rely on ultra-large data and powerful computing power to learn and reason; the other is real data. The size, weight, and material of the pillow, or how light reflects or objects collide, these are physical interaction scenes.

This kind of real data comes from the real world and directly determines whether virtual training can simulate behaviors and reactions that conform to actual logic;

In other words, virtual training requires two types of data: one is big data generated virtually, and the other is physical data from real scenes, and the latter often becomes a bottleneck in development.

The reason is simple: although generative technologies such as Wensheng videos and Wensheng pictures can generate rich content, it is difficult to directly obtain real physical rules and precise interaction details.

For example, Vincent’s video can generate a rolling ball, but it may not accurately simulate the ball’s friction, bounce height, or collision response on different ground materials.

So where does the data for real scenes come from? It can only come from the real world.

Collect from the real environment through sensors, cameras, lidar and other devices; when you drive, the sensors will record the vehicle’s motion trajectory, force changes, light reflections, as well as vehicle spacing, pedestrian behavior, and even the impact of weather on road conditions. This information will be uploaded to the platform for analysis and training.

But having data is not enough.

The data of the platform cannot ensure that the next operation will be accurate, and a lot of training must be carried out in a virtual environment; self-driving cars must repeatedly simulate driving in a virtual environment, and may have to run thousands of times until they can cope with various complex scenarios. Only then can they be used in the real world.

By understanding this, you will understand that this is not only a problem in the fields of autonomous driving and robots, but also in other industries.

Whether in medical care, manufacturing or agriculture, world models and spatial intelligence require massive amounts of real data to support, and capabilities must be verified and optimized through repeated training in virtual environments.

In other words, whether it is autonomous driving, robot navigation, or tailored intelligent applications in other industries, the core challenge lies in how to obtain high-quality real data, and then through the combination of virtual and reality, AI can truly solve practical problems; This is the key to the implementation of future technology.

Whoever has the underlying structure and who has the data will have the opportunity to go to the poker table.

Popular Articles