Does pseduo-LIDAR help Tesla or its competitors more?

Topic: 
Tags: 

Tesla doesn't want to use LIDAR. So they are hoping for success in a technique known as pseudo-LIDAR, where you train neural networks to look at images and calculate the distance to everything in the scene, as though you had a LIDAR. It's not here yet, but an interesting question is, should this succeed, is it better for Tesla or for their LIDAR using competitors who already have tons of experience using 3D point clouds?

Read about that at Does pseduo-LIDAR help Tesla or its competitors more?

Comments

Provided one object apparent area changes by the square of the distance, it seems not too difficult to deduce that if some object in the road has increased his apparent area 4 times in 10 seconds, you will hit it 10 seconds in the future if you continue moving at the same speed. Seems Skydio already does it: https://www.youtube.com/watch?v=AOYxlj5iuvo

You don't have 500 milliseconds to come to a quick decision about surprise obstacles, forget 20 seconds.

Tesla does not just depend on cameras
It also has a forward looking RADAR that gives precise, instant distance readings.

Are you sure Tesla's output will be a 3d pixel.cloud and not some impossible to understand AI thing? I don't see the need to go through the extra step. If so, it would actually leave the pack further behind.

No, I'm not sure. The teams trying to do this outside Tesla are indeed trying to generate a point cloud from the images. Tesla or a team trying to build a whole perception systems might well not do that, and just try to build a classifier with an idea of distance. What Tesla makes it is unlikely to share, though once it is demonstrated how to do it, others will do it as well.

"When an obstacle on the road is revealed to your sensors suddenly, by surprise, you need to know how far AWARE it is with high reliability"

should be (caps added for emphasis)

"When an obstacle on the road is revealed to your sensors suddenly, by surprise, you need to know how far AWAY it is with high reliability"

Yes it probably will be a point cloud output because it needs to compare the data with the radar data. If this becomes accurate then there is still the possibility of using Lidar as a further check and fail-safe sensor. Lifar has slightly different effects from direct sunlight. So better to have as a back up rather than a second camera..
If/when lidars cost less than US$100 I still see them being used.

Well, if the LIDAR is cheap yes, you might use it at the same time as pseudo-lidar. In addition, lidar could make Pseudo-lidar more accurate and reliable, with lower resolution from the lidar but an always accurate base distance for each region. Right now lidar does have reliability issues which might make people prefer a really good pseudo-lidar, and there is the potential for more range (in the daytime.)

The ability to look at images and calculate the distance to everything in the scene isn't necessary. It also isn't possible. It's definitely not the way humans drive.

Do you have a link to your source that Tesla "has promoted" this approach? It doesn't sound at all like something that would be very useful for them, given the inherent limitations.

Which is easier, to train a neural network to look at images and calculate the distance to everything in the scene, or to train a neural network to look at images and determine whether or not to slow down? Surely it's the latter. (And in case you're going to say that it's the former because you can do it unsupervised, no, the latter can also be done unsupervised, as long as you have a system that can go from a point cloud to the decision whether or not to slow down. Drive around with lidar and cameras. Process the camera data through a system that goes from images to a determination of whether or not to slow down. Process the lidar data through a system that goes from LIDAR data to a determination of whether or not to slow down. Also have the car record whether or not the human driver decides to slow down. Use ocassional manual review of scenarios where the three answers differ to confirm that the system you think works very well does in fact work very well.) Of course, "do I slow down" is only one of the many questions that a car has to decide. But this "Software 2.0" approach is the direction that development needs to be going in if this problem is ever going to be solved. And we know that Tesla is answering some questions this way. We know they've built a neural network to go from images (plus, I assume, other information like radar data) to "am I about to be cut off?" They do this without generating a 3D point cloud, which makes sense because the middle step in going from images/radar to 3D point cloud to "am I about to be cut off" would only make the answer less accurate.

Or for another real world example, how about stop-sign detection? You don't predict the location of a stop sign before you can see it by generating a 3D point cloud. https://twitter.com/karpathy/status/1224408275627601920

I feel like you're missing the key reason why humans have TWO eyes instead of one. Depth perception.

The wider the 'eyes' (in this case cameras), the better the ability to calculate depth.

Also, please explain to me how LIDAR can see colour on signs, traffic lights, car brake/indicator lights and road markings....

Correction: by "the wider the eyes", I meant the wider the distance between two eyes (or two cameras)

But I point out that humans can drive with just one eye. Stereo depth perception just doesn't work out to the distances necessary here, even if you put the cameras on opposite sides of the car to get a long baseline. It can be useful for close-in operations but the need for distant perception is so strong that most teams just don't bother to do stereo.

LIDAR does not see colour, why did you think it did? It is not used for that purpose, the cameras are. (Mostly for traffic lights and brake indicators, not everybody looks at the colours on signs and road markings, but when they look at those they use cameras in any event.)

This is not unsupervised learning. There is still ground-truth. Unsupervised learning is inference without ground-truth.

Yes, a better term for this which is sometimes used is self-supervised.

However, the technique where you rely on the fact that for distance estimates to be correct they must follow principles of physics, I would consider to be true unsupervised learning.

Among other things, this video shows how Tesla is able to use self-supervised learning to develop depth estimation without using lidar: https://youtu.be/hx7BXih7zx8

The work they are doing on their bird's eye view network was also fascinating. Lidar or no-lidar, one key to self-driving is the ability to predict the existence of things you can't directly perceive.

I don't know how well defined these terms are, but usually unsupervised learning, as a reader pointed out, doesn't get much ground truth data given to it. Perhaps a scoring function or a goal like not crashing. Supervised learning has humans actually labeling the thing they want the system to learn, such as "A car looks like the thing in this box." Self-supervised is a term I have seen used when a non-human information source is giving you information on what you are trying to learn. Ie. a lidar tells you "this thing is actually 30m away." Or even "there is something here." Comma trains in this way, with a lidar on their training car. And it's a good way to train a pseudo lidar.

The method I saw tesla use (I have not watched Andrew's video) was unsupervised. They would try to get a distance estimate for things every frame. If the distance estimates stayed consistent with physics -- ie. gradual and continuous changes in distance -- they were considered likely true. If they suddenly jumped -- ie. you got 30 frames of 10m away and one frame of 20m away, you could decide the 20m estimate was wrong.

I believe the method you describe is one of the ones that was described in the video. In addition to that, a comparison is made between the estimates from different cameras in places where they overlap.

The speaker in the video called it "self-supervised".

Well, there is really continuum, from fully supervised where a human tells you "this is a cat" and fully unsupervised where the network discovers patterns in the data it did not know to look for and learns to generalize and identify them. The key factor, though, is the cost. Supervised is expensive. Anything that doesn't need a human is much cheaper and can be done at bigger scale. Using a LIDAR sort of needs a human because somebody has to drive around with a LIDAR, and somebody even has to drive around with a camera, but both are comparatively cheap.

I'm not sure if it's cheap to get someone to drive around with a LIDAR.

I do know that Tesla has hundreds of thousands of people driving around with cameras for free. And they're encountering all kinds of situations that companies hiring people to drive around in their tiny geofenced zones aren't seeing at all.

Add new comment