On 19th August 2021, Elon Musk and the Tesla AI team presented the technical progress in the field of artificial intelligence and answered questions from the audience. This is the English transcript of the first part of the presentation (timestamp 47:09 – 1:24:30), in which Andrej Karpathy and Ashok Elluswamy describe the approach and development over the last years and give an insight into how neural network image processing works. Many of the images included in the transcript contain links to video sequences shown at the event. You can access the YouTube video , part 2 of the presentation and the German translation of this transcript by clicking on the links.

Elon Musk: (47:09) Hello everyone, sorry for the delay. Thanks for coming. And sorry, had some technical difficulties, really need AI for this. So what we want to show today is that Tesla is much more than an electric car company, that we have deep AI activity in hardware, on the inference level, on the training level. And basically (47:38…) I think arguably the leaders in real world AI as it applies to the real world. And those of you who have seen the full self driving beta AI can appreciate the rate at which the Tesla neural net is learning to drive. And this is a particular application of AI.
But I think there’s more, there are more applications down the road that will make sense. And we’ll talk about that later in the presentation. But yeah, we basically want to encourage anyone who is interested in solving real world AI problems at either the hardware or the software level to join Tesla or will consider joining Tesla. So let’s see, we’ll start off with Andrej.
Andrej Karpathy: (48:35) Hi, everyone. Welcome. My name is Andrej, and I lead the vision team here at Tesla autopilot. And I’m incredibly excited to be here to kick off this section, giving you a technical deep dive into the autopilot stack and showing you all the under the hood components that go into making the car drive all by itself. So we’re going to start off with the vision component here.
Now, in the vision component, what we’re trying to do is, we’re trying to design a neural network that processes the raw information, which, in our case, is the eight cameras that are positioned around the vehicle, and they send us images. And we need to process that in real time into what we call the vector space. And this is a three dimensional representation of everything you need for driving. So, this is the three dimensional positions of lanes, edges, curbs, traffic signs, traffic lights, cars, their positions, orientations, depths, velocities, and so on. So here I’m showing a video of… – actually hold on. Apologies.

(49:47) Here I’m showing the video of the raw inputs that come into the stack, and then neural net processes that into the vector space. And you are seeing parts of that vector space rendered in the instrument cluster on the car.

Now, what I find kind of fascinating about this is that we are effectively building a synthetic animal from the ground up. So, the car can be thought of as an animal, it moves around, it senses the environment and, you know, acts autonomously and intelligently. And we are building all the components from scratch in-house. So, we are building, of course, all of the mechanical components of the body, the nervous system, which has all the electrical components, and for our purposes, the brain of the autopilot, and specifically for this section, the synthetic visual cortex.
Now, the biological visual cortex actually has quite intricate structure and a number of areas that organize the information flow of this brain. And so, in particular, in your visual cortices the information hits the light, hits the retina, and goes through the LGN (lateral geniculate nucleus) all the way to the back of your visual cortex, goes through areas V1, V2, V4, the IT, the ventral and dorsal streams, and the information is organized in a certain layout. And so, when we are designing the visual cortex of the car, we also want to design the neural network architecture of how the information flows in the system.
(51:13) So, the processing starts in the beginning when light hits our artificial retina. And we are going to process this information with neural networks. Now I’m going to roughly organize this section where I’m chronologically. So, starting off with some of the neural networks and what they look like roughly four years ago when I joined the team, and how they have developed over time.
So roughly four years ago, the car was mostly driving in a single lane going forward on the highway. And so it had to keep lane, and it had to keep distance away from the car in front of us. And at that time, all of processing was only on individual image level. So a single image has to be analyzed by a neural net and make little pieces of the vector space – process that into a little piece of the vector space. So this processing took the following shape.

We take a 1280 by 960 input, and this is 12 bit integers streaming in at roughly 36 Hertz. Now we’re going to process that with a neural network. So we instantiate a feature extractor backbone. In this case, we use residual neural networks. So we have a stem and a number of residual blocks connected in series. Now, the specific class of resonance that we use are RegNets, because this is like a very… – RegNets offer a very nice design space for neural networks, because they allow you to very nicely trade off latency and accuracy.
Now, these RegNets give us as an output a number of features at different resolutions in different scales. So in particular, on the very bottom of this feature hierarchy, we have very high resolution information with very low channel counts, and all the way at the top, we have low spatial low resolution spatially, but high channel counts. So on the bottom, we have a lot of neurons that are really scrutinizing the detail of the image. And on the top, we have neurons that can see most of the image and a lot of that context… have a lot of that scene context.
(52:54) We then like to process this with feature pyramid networks. In our case we like to use BiFPNs (Bi-directional Feature Pyramid Networks), and they get to multiple scales to talk to each other effectively and share a lot of information. So for example, if you’re a neuron all the way down in the network, and you’re looking at a small patch, and you’re not sure if this is a car or not, it definitely helps to know from the top layers that “Hey, you are actually in the vanishing point of this highway”. And so that helps you to see that this is probably a car.
(53:19) After a BiFPN and a feature fusion across scales, we then go into task specific heads. So for example, if you are doing object detection, we have a one stage yolo (‘you only look once’) like object detector here, where we initialize a raster, and there’s a binary bit per position telling you whether or not there’s a car there. And then in addition to that, if there is, here’s a bunch of other attributes you might be interested in: the x y width height offset, or any of the other attributes, like what type of a car is this, and so on. This is for the detection by itself.
Now, very quickly, we discovered that we don’t just want to detect cars, we want to do a large number of tasks. So for example, we want to do traffic light recognition and detection, a lane prediction and so on. So very quickly, we conversioned this kind of architectural layout, where there’s a common shared backbone, and then branches off into a number of heads. So we call these therefore Hydra Nets. And these are the heads of the Hydra.

(54:10) This architecture layout has a number of benefits. So, number one, because of the feature sharing, we can amortize the forward pass inference in the car at test time. And so this is very efficient to run because if we had to have a backbone for every single task, that would be a lot of backbones in the car. Number two, this decouples all of the tasks, so we can individually work on every one task in isolation. And for example, we can operate on any of the datasets or change some of the architecture of the head and so on, and you’re not impacting any of the other tasks. And so, we don’t have to revalidate all the other tasks, which can be expensive.
And number three, because there’s this bottleneck here in features, what we do fairly often is that we actually cache these features to disk and when we are doing these fine tuning workflows, we only fine tune from the cached features up and only fine tune the heads. So most often in terms of our training workflows, we will do an end to end training run once in a while where we train everything jointly, then we cache the features at the multiscale feature level, and then we fine tune of that for a while, and then end to end, train once again, and so on.

(55:16) So here’s the kinds of predictions that we were obtaining, I would say, several years ago now, from one of these Hydra nets. So again, we are processing individual images. There we go.

We are processing just individual images and we’re making a large number of predictions about these images. So for example, here you can see predictions of the stop signs, the stop lines, the lines, the edges, the cars, the traffic lights, the curbs here, whether or not the car is parked, all of the static objects like trash cans, cones, and so on. And everything here is coming out of the net – here, in this case, out of the hydra net. So, that was all fine and great, but as we worked towards FSD, we quickly found that this is not enough. So, where this first started to break was when we started to work on Smart Summon.

Here, I’m showing some of the predictions of only the curb detection task. And I’m showing it now for every one of the cameras. So, we’d like to wind our way around the parking lot to find the person who is summoning the car. Now, the problem is that you can’t just directly drive on image space predictions. You actually need to cast them out and form some kind of a vector space around you. So we attempted to do this using C++ and developed what we call the ‘occupancy tracker’ at the time.

So here, we see that the curb detections from the images are being stitched up across camera scenes, camera boundaries, and over time. Now there were two major problems, I would say, with this setup. Number one, we very quickly discovered that tuning the occupancy tracker and all of its hyper parameters was extremely complicated. You don’t want to do this explicitly by hand in C++. You want this to be inside the neural network and train that end to end. Number two, we very quickly discovered that the image space is not the correct output space. You don’t want to make predictions in image space, you really want to make it directly in the vector space. So here’s a way of illustrating the issue.

(57:01) Here I’m showing on the first row the predictions of our curbs and our lines in red and blue. And they look great in the image. But once you cast them out into the vector space, things start to look really terrible and we are not going to be able to drive on this. So, you see all the predictions are quite bad in vector space. And the reason for this fundamentally, is because you need to have an extremely accurate depth per pixel in order to actually do this projection. And so you can imagine just how high of the bar it is to predict that depth so accurately, in these tiny every single pixel of the image. And also, if there’s any occluded area where you’d like to make predictions, you will not be able to, because it’s not an image… space concept in that case.

The other problems with this, by the way, is also for object detection. If you are only making predictions per camera, then sometimes you will encounter cases like this, where a single car actually spans five of the eight cameras. And so, if you are making individual predictions then no single camera since sees all of the car, and so obviously, you’re not going to be able to do a very good job of predicting that car. And it’s going to be incredibly difficult to fuse these measurements.
So, we have this intuition that what we’d like to do instead is we’d like to take all of the images and simultaneously feed them into a single neural net and directly output in vector space. Now, this is very easily said, much more difficult to actually achieve. But roughly, we want to lay out a neural net in this way, where we process every single image with a backbone. And then we want to somehow fuse them. And we want to represent the features from image space features to directly some kind of a vector space features and then go into the decoding of the head.
(58:42) There are two problems with this. Problem number one, how do you actually create the neural network components that do this transformation? And you have to make it differentiable, so that end to end training is possible. And number two, if you want vector space predictions from your neural net, you need vector space based datasets. So just labeling images and so on is not going to get you there – you need vector space labels. We’re going to talk a lot more about problem number two later in the talk. For now, I want to focus on the neural network architectures, so I’m going to deep dive into problem number one.

So, here’s the rough problem, right? We’re trying to have this bird’s eye view prediction instead of image space predictions. So for example, let’s focus on the single pixel on the output space in yellow. And this pixel is trying to decide “Am I part of a curb or not?” as an example. And now, where should the support for this kind of a prediction come from in the image space? Well, we know roughly how the cameras are positioned and their experiences and intrinsics. So we can roughly project this point into the camera images. And, you know, the evidence for whether or not this is a curb may come from somewhere here in the images.
The problem is that this projection is really hard to actually get correct because it is a function of the road surface, the road surface could be sloping up or sloping down. Or also there could be other data dependent issues. For example, there could be occlusion due to a car. So if there’s a car occluding this viewpoint, this part of the image, then actually, you may want to pay attention to a different part of the image – not the part where it projects. And so, because this is data dependent, it’s really hard to have a fixed transformation for this component.

In order to solve this issue, we use a transformer to represent this space. And this transformer, it uses multi-headed self-attention and blocks of it. In this case, actually, we can get away with even a single block of doing a lot of this work. And effectively, what this does is, you initialize a raster of the size of the output space that you would like and you title it with positional encodings, with sines and cosines in the output space, and then these gets encoded with an MLP into a set of query vectors. And then all of the images and their features also emit their own keys and values. And then the query keys and values feed into the multi headed self-attention.
And so effectively, what’s happening is that every single image piece is broadcasting in its key, what it is, what is it a part of, so “Hey, I’m part of a pillar in roughly this location, and I’m seeing this kind of stuff.” And that’s in the key. And then every query is something along the lines of, “Hey, I’m a pixel in the output space at this position and I’m looking for features of this type.” Then the keys and the queries interact multiplicatively, and then the values get pooled accordingly.
(1:01:13) And so this represents the space and we found this to be very effective for this transformation. So, if you do all of the engineering correctly – this, again, is very easily said, difficult to do – you do all of the engineering correctly… – There’s one more problem actually before. I’m not sure what puts up with the slides.

So one more thing. You have to be careful with some of the details here when you are trying to get this to work. In particular, all of our cars are slightly cockeyed in a slightly different way. And so, if you’re doing this transformation from image space to the output space, you really need to know what your camera calibration is. And you need to feed that in somehow into the neural net. And so, you could definitely just like concatenate the camera calibrations of all of the images and somehow feed them in with MLP. But actually, we found that we can do much better by transforming all of the images into a synthetic virtual camera using a special rectification transform. So this is what that would look like.

(1:02:07) We insert a new layer right above the image rectification layer. It’s a function of camera calibration, and it translates all of the images into a virtual common camera. So, if you were to average up a lot of repeater images, for example, which face to the back, you would… without doing this, you would get a kind of a blur. But after doing the rectification transformation, you see that the back mirror gets really crisp. So, once you do this, this improves the performance quite a bit.

Here are some of the results. So, on the left, we are seeing what we had before, and on the right, we’re now seeing significantly improved predictions coming directly out of the neural net. This is a multi-camera network predicting directly in vector space. And you can see that it’s basically night and day – you can actually drive on this. And this took some time and some engineering, an incredible work from the AI team to actually get this to work and deploy and make it efficient in the car.

(1:03:01) This also improved a lot of our object detection. So for example, here, in this video, I’m showing single-camera predictions in orange and multi-camera predictions in blue. And basically, you can’t predict these cars, if you are only seeing a tiny sliver of a car. So your detections are not going to be very good, and their positions are not going to be good. But a multi-camera network does not have an issue. Here’s another video from a more nominal sort of situation. And we see that as these cars in this tight space cross camera boundaries, there’s a lot of junk that enters into the predictions and basically, the whole setup just doesn’t make sense, especially for very large vehicles like this one. And we can see that the multi-camera networks struggle significantly less with these kinds of predictions.
(1:03:40) Okay, so at this point, we have multi-camera networks, and they’re giving predictions directly in vector space. But we are still operating at every single instant in time completely independently. So very quickly, we discovered that there’s a large number of predictions we want to make that actually required the video context, and we need to somehow figure out how to feed this into the net. So, in particular, is this car parked or not? Is it moving? How fast is it moving? Is it still there, even though it’s temporarily occluded. Or, for example, if I’m trying to predict the road geometry ahead, it’s very helpful to know of the signs or the road markings that I saw 50 meters ago.

So, we tried to insert video modules into our neural network architecture. And this is kind of one of the solutions that we’ve converged on. So, we have the multiscale features as we had them from before. And what we are going to now insert is a feature cue module that is going to cache some of these features over time, and then a video module that is going to fuse this information temporarily. And then we’re going to continue into the heads that do the decoding. Now I’m going to go into both of these blocks one by one. Also in addition, notice here that we are also feeding in the kinematics. This is basically the velocity and acceleration that’s telling us about how the car is moving. So not only are we going to keep track of what we’re seeing from all the cameras, but also how the car has traveled.

(1:04:54) So here’s the feature queue and the rough layout of it. We are basically concatenating these features over time, and the kinematics of how the car has moved, and the positional encodings. And that’s being concatenated, encoded and stored in a feature queue. And that’s going to be consumed by video module. Now, there’s a few details here again to get right. So, in particular, with respect to the pop and push mechanisms, and when did you push… – especially when do you push, basically.

So, here’s a cartoon diagram, illustrating some of the challenges here. There are going to be… the ego cars coming from the bottom and coming up to this intersection here. And then traffic is going to start crossing in front of us. And it’s going to temporarily start occluding some of the cars ahead. And then we’re going to be stuck at this intersection for a while and just waiting our turn. This is something that happens all the time and is a cartoon representation of some of the challenges here. So, number one, with respect to the feature queue and when we want to push into a queue, obviously, we’d like to have some kind of a time-based queue, where, for example, we enter the features into the queue, say, every 27 milliseconds.
And so, if a car gets temporarily occluded, then the neural network now has the power to be able to look and reference the memory in time and learn the association that “Hey, even though this thing looks occluded right now, there’s a record of it in my previous features, and I can use this to still make a detection.” So, that’s kind of like the more obvious one, but the one that we also discovered is necessary in our case, is, for example, suppose you’re trying to make predictions about the road surface and the road geometry ahead, and you’re trying to predict that “I’m in a turning lane, and the lane next to us is going straight.” (1:06:26) Then, it’s really necessary to know about the lane markings and the signs, and sometimes they occur long time ago.
And so, if you only have a time-based queue, you may forget the features while you’re waiting at your red light. So, in addition to a time-based queue, we also have a space-based queue. So we push every time the car travels with certain fixed distance. Some of these details actually can matter quite a bit and so, in this case, we have a time-based queue and a space based queue to cache our features. And that continues into the video module.

Now for the video module, we looked at a number of possibilities of how to fuse this information temporally. We looked at three dimensional convolutions, transformers, axial transformers and an effort to try to make them more efficient. Recurrent neural networks have a large number of flavors. But the one that we actually like quite a bit as well, and I want to spend some time on, is a spatial recurrent neural network video module.

And so what we’re doing here is, because of the structure of the problem, we’re driving on two dimensional surfaces, we can actually organize the hidden state into a two dimensional lattice. And then as the car is driving around, we update only the parts that are near the car and where the car has visibility. So as the car is driving around, we are using the kinematics to integrate the position of the car into hidden features grid, and we are only updating the RNN at the points that are nearby us, sort of.

(1:07:45) So here’s an example of what that looks like. Here, what I’m going to show you is the car driving around, and we’re looking at the hidden state of this RNN, and these are different channels in the hidden state. So, you can see that – this is after optimization and training this neural net – you can see that some of the channels are keeping track of different aspects of the road, like for example, the centers of the road, the edges, the lines, the road surface, and so on.
Here’s another cool video of this. So, this is looking at the mean of the first 10 channels in the hidden state for different traversals of different intersections. And all I want you to see basically is that there’s cool activity, as the recurrent neural network is keeping track of what’s happening at any point in time. And you can imagine that we’ve now given the power to the neural network to actually selectively read and write to this memory. For example, if there’s a car right next to us, and is occluding some parts of the road, then now the network has the ability to not write to those locations. But when the car goes away, and we have a really good view, then the recurrent neural net can say, “Okay, we have very clear visibility, we definitely want to write information about what’s in that part of space.”

(1:08:53) Here’s a few predictions that show what this looks like. So here we are making predictions about the road boundaries in red, intersection areas in blue, road centers, and so on. We’re only showing a few of the predictions here just to keep the visualization clean. And yeah, this is done by the spatial RNN. And this is only showing a single clip, a single traversal. And you can imagine, there could be multiple trips through here, and basically number of cars and number of clips could be collaborating to build this map, basically, in effectively an HD map, except it’s not in the space of explicit items. It’s in the space of features of a recurrent neural network, which is kind of cool. I haven’t seen that before.

(1:09:35) The video networks also improved our object detection quite a bit. So, in this example, I want to show you a case where there are two cars over there, and one car is going to drive by and occlude them briefly. So, look at what’s happening with the single frame in the video predictions as the cars pass in front of us. Yeah, so that makes a lot of sense. So, a quick playthrough through what’s happening.
When both of them are in view, the predictions are roughly equivalent. And you are seeing multiple orange boxes because they’re coming from different cameras. When they are occluded, the single frame networks drop the detection, but the video module remembers it, and we can persist the cars. And then when they are only partially occluded, the single frame network is forced to make its best guess about what it’s seeing, and it’s forced to make a prediction. And it makes a really terrible prediction. But the video module knows that there’s only a partial that… you know, it has the information and knows that this is not a very easily visible part right now and doesn’t actually take that into account.

We also saw significant improvements in our ability to estimate depth and of course, especially velocity. So here I’m showing a clip from our remove the radar push, where we are seeing the radar depth and velocity in green. And we were trying to match or even surpass, of course, the signal just from video networks alone. And what you’re seeing here is… in orange, we are seeing a single frame performance, and in blue, we are seeing again video modules. And so you see that the quality of depth is much higher. And for velocity, the orange signal, of course, you can’t get velocity out of a single frame network. So we just differentiate to get that, but the video module actually is basically right on top of the radar signal. And so we found that this works extremely well for us.

(1:11:20) So here’s putting everything together. This is what our architecture roughly looks like today. So we have raw images feeding on bottom, they go through rectification layer to correct for camera calibration and put everything into a common virtual camera. We pass them through RegNet’s residual networks to process them into a number of features at different scales. We fuse the multiscale information with BiFPN. This goes through transformer module to re-represent it into the vector space and the output space. This feeds into a feature queue in time or space; that gets processed by a video module like the spatial RNN and then continues into the branching structure on the hydra net with trunks and heads for all the different tasks. And so that’s the architecture roughly what it looks like today. And on the right you are seeing some of its predictions for visualize both in a top down vector space and also in images.
(1:12:10) This architecture has definitely complexified from just a very simple image based single network about three or four years ago and continues to evolve. It is definitely quite impressive, now there’s still opportunities for improvements that the team is actively working on. For example, you’ll notice that our fusion of time and space is fairly late in neural network terms. So maybe we can actually do earlier fusion of space or time and do for example, cost volumes or optical flow-like networks on the bottom.
Or, for example, our outputs are dense rasters and it’s actually pretty expensive to post-process some of these dense rasters in the car. And of course, we are under very strict latency requirements, so this is not ideal. We actually are looking into all kinds of ways of predicting just the sparse structure of the road, maybe like, you know, point by point or in some other fashion that doesn’t require expensive post-processing. But this basically is how you achieve a very nice vector space. And now I believe Ashok is going to talk about how we can run planning and control on top of it.
Ashok Elluswamy: (1:13:11) Thank you, Andrej. Hi, everyone. My name is Ashok, I lead the planning and controls auto labeling and simulation teams. Like Andrej mentioned, the vision networks take dense video data and then compress it down into a 3D vector space. The role of the planner now is to consume this vector space and get the car to the destination while maximizing the safety, comfort and efficiency of the car.

(1:13:34) Even back in 2019, our planner was a pretty capable driver. It was able to stay in the lanes, make lane changes as necessary and take exits of the highway. But citizen driving is much more complicated. That means there are construction lane lines, vehicles do much more free from driving, then the car has to respond to all of (…1:13:53) and crossing vehicles and pedestrians doing funny things.
(1:14:00) What is the key problem in planning? Number one, the action space is very non-convex. And number two, it is high dimensional. What I mean by non-convex is, there can be multiple possible solutions that can be independently good. But getting a globally consistent solution is pretty tricky. So there can be pockets of local minima that the planning can start get stuck into. And secondly, the high dimensional becomes because the car needs to plan for the next 10 to 15 seconds and needs to produce a position, velocity, and acceleration or the (…1:14:34) window. This is a lot of parameters to produce at runtime.
(1:14:39) Discrete search methods are really great at solving non-convex problems because they are discrete, they don’t get stuck in local minima, whereas continuous function optimization can easily get stuck in local minima and produce poor solutions that are not great. On the other hand, for high dimensional problems, discrete search sucks because discrete does not use any graded information, so literally has to go and export each point to know how good it is. Whereas continuous optimization use gradient-based methods to very quickly go to a good solution.
(1:15:10) Our solution to the central problem is to break it down hierarchically. First is a coarse search method to crunch down the non-convexity and come up with a convex corridor, and then use continuous optimization techniques to make the final smooth trajectory. Let’s see an example of how the search operates.

So here, we’re trying to do a lane change. In this case, the car needs to do two back to back lane changes to make the left turn up ahead. For this, the car searches over different maneuvers. So, the first one, it searches a lane change that’s close by, but the car breaks pretty harshly, so it’s pretty uncomfortable. The next maneuver tries, that’s the lane change is a bit late, so it speeds up, goes beyond the other car, goes in front of the other cars and finally makes the lane change. But now it risks missing the left turn. We do 1000s of such searches in a very short time span. Because these are all physics based models, these features are very easy to simulate. And in the end, we have a set of candidates and we finally choose one based on the optimality conditions of safety, comfort and easily making the turn.
(1:16:23) So now the car has chosen this path and you can see that as the car executes this trajectory, it pretty much matches what we had planned. The cyan plot on the right side here, that one is the actual velocity of the car. And the white line underneath, this was the plan. We are able to plan for 10 seconds here and able to match that when you see in hindsight. So, this is a well-made plan. When driving alongside other agents, it’s important to not just plan for ourselves. But instead, we have to plan for everyone jointly and optimize for the overall scene’s traffic flow. In order to do this, what we do is we literally run the autopilot planner on every single relevant object in the scene. Here’s an example of why that’s necessary.

(1:17:09) This is an auto corridor, I’ll let you watch the video for a second. Yeah, that was autopilot driving an auto corridor going around parked cars, cones and poles. Here’s this 3D view of the same thing. The oncoming car arrives now, and autopilot slows down a little bit, but then realizes that we cannot yield to them because we don’t have any space to our side. But the other car can yield to us instead. So instead of just blindly breaking here, (…1:17:44) reasons about that car has low enough velocity that they can pull over and should yield to us, because we cannot yield to them and assertively makes progress.
(1:17:55) A second oncoming car arrives now. This vehicle has higher velocity. And like I said earlier, we literally run the autopilot trainer for the other object. So in this case, we run the planner for them. That object’s plan now goes around their site’s parked cars and then after they pass the parked cars goes back to the right side of the road for them. Since we don’t know what’s in the mind of the driver, we actually have multiple possible features for this car. Here, one feature is shown in red, the other one is shown in green. The green one is a plan that yields to us. But since this object’s velocity and acceleration are pretty high, we don’t think that this person is going to yield to us and they actually want to go around this parked cars. So, autopilot decides that “Okay, I have space here, this person is definitely going to come, so I’m going to pull over.”
(1:18:38) So as autopilot is pulling over, we noticed that that car has chosen to yield to us based on their yaw rate and their acceleration, and autopilot immediately changes its mind and continues to make progress. This is why we need to plan for everyone because otherwise we wouldn’t know that this person is going to go around the other parked cars and come back to that site. If you didn’t do this, autopilot would be too timid, and it would not be a practical self-driving car. So, now we saw how the search and planning for other people set up a convex valley.

Finally, we do a continuous optimization to produce the final trajectory that the planner needs to take. Here, the gray thing is the convex corridor. And we initialize a spline in heading an acceleration parametrize over the arc length of the plan. And you can see that the (…1:19:24) position continuously makes fine grained changes to reduce all of its costs.
Some of the costs for example, are distance from obstacles, traversal time, and comfort. For comfort, you can see that the latest acceleration plots on the right have nice trapezoidal shapes… – it’s going to come first… yeah, here on the right side, the green plot – that’s a nice trapezoidal shape. And if you record our human trajectory, this is pretty much how it looks like. The lateral jerk is also minimized. So, in summary, we do a search for both us and everyone else in the scene. We set up a convex corridor and then optimize for a smooth path. Together, these can do some really neat things like showing above.

(1:20:02) But driving looks a bit different in other places, like where I grew up. It is very much more unstructured, cars and pedestrians cutting each other, arch breaking, honking. It’s a crazy world. And we can try to scale up these methods but it’s going to be really difficult to efficiently solve this at runtime. What we instead want to do is use learning based methods to efficiently solve them. And I want to show why this is true. And so we’re going to go from this complicated problem to a much simpler toy product parking problem, but it still illustrates the core of the issue.

(1:20:35) Here, this is a parking lot. The ego car is in blue and needs to park in the green parking spot here. So, it needs to go around the curbs, the parked cars, and the cones shown in orange here. This is simple baseline. It’s an A* standard algorithm that uses a lattice based search. And the heuristic here is a distance, the Euclidean distance to the goal. So, you can see that it directly shoots towards the goal, but very quickly gets trapped in a local minima, and it backtracks from there and then searches a different path to try to go around this parked cars. Eventually, it makes progress and gets to the goal, but ends up using 400,000 nodes for making this.
(1:21:14) Obviously, this is a terrible heuristic. We want to do better than this. So if you added a navigation route to it, and have the car to follow the navigation route, while being close to the goal, this is what happens. The navigation route helps immediately, but still, when you enter encounters, cones or other obstacles, it basically does the same thing as before, backtracks and then searches a whole new path. And the support search has no idea that these obstacles exist, it literally has to go there, check if it’s in collision, and if it’s in collision back up. The navigation heuristic helped, but still, this took 22,000 nodes.

(1:21:51) We can design more and more of these heuristics to help the search make go faster and faster. But it’s really tedious and hard to design a globally optimal heuristic. Even if you had a distance function from the cones that guided the search, this would only be effective for a single cone. But what we need is a global value function. So instead, what we’re gonna use is neural networks to give this heuristic for us. The visual networks produce a vector space and we have cars moving around in it. This basically looks like an Atari game and it’s a multiplayer version. So, we can use techniques such as MuZero, AlphaZero, etc, that was used to solve Go, and other Atari games to solve the same problem.
So, we’re working on neural networks that can produce state and action distributions, that can then be plugged into Monte Carlo tree search with various cost functions. Some of the cost functions can be explicit cost functions like collisions, comfort, traversal time, etc. But they can also be interventions from the actual manual driving events. We train such a network for this simple parking problem. So here again, the same problem. Let’s see how MCTS researched us.

(1:23:01) Here you notice that the planner is basically able to in one shot make progress towards the goal. To notice that this is not even here using a navigation heuristic. Just given the scene, the planner is able to go directly towards the goal. All the other options you’re seeing are possible options. It’s not using any of them. Just using the option that directly takes it towards the goal. The reason is that the neural network is able to absorb the global context of the scene, and then produce a value function that effectively guides it towards the global minima as opposed to getting sucked in any local minima. So, this only takes 288 nodes and several orders of magnitude less than what was done in the A* with the Euclidean distance heuristic.

(1:23:41) So, this is what the final architecture is going to look like. The vision system is going to crash down the dense video data into a vector space. It’s going to be consumed by both an explicit planner and a neural network planner. In addition to this, the network planner can also consume intermediate features of the network. Together, this produces a trajectory distribution, and it can be optimized end to end both with explicit cost functions and human intervention and other limitation data. This then goes into explicit planning function that does whatever is easy for that and produces the final steering and acceleration commands for the car. With that, we need to now explain how we train these networks. And for training these networks we need large data sets. And Andrej now speaks briefly about manual labeling. (1:24:30)