Application and Algorithm of Inverse Reinforcement Learning(IRL) in self-driving car

5 minute read

Published:

I did some projects about self-driving car in the MSC lab UC Berkeley. Here by I summaries what I learned from this field, and briefly foreshadow what we did in the projects.

Background in self-driving car

A good self-driving car has to ensure two aspects: comfort to passenger, and etiquette in Interaction, and most importantly safety. The first two aspects requires a similar driving style to a human driver, and the last two need the car to predict the human drivers’ behaviors. Thus, the essence in these problems is learning the behaviors from human, which is the topic of IRL. In the scenario of self-driving car, some people post the problems as optimal control instead of reinforcement learning by ignoring the probabilistic in the transition of the system. So, some related works are named as Inverse Optimal Control(IOC).

To future abstract the problem of self-driving, we only study the trajectories of the car and have the following notations.

  • Trajectory: $\zeta \in S$
  • Feature: $f \in S \rightarrow R$
  • Feature Count: $\boldsymbol{f} \in \mathrm{S} \rightarrow R^{k}$
  • Weights: $\theta \in R^{k}$
  • Cost(reward): $\theta^{T} \boldsymbol{f}(\zeta)$

Features are human defined characteristics of the trajectory. For example the a_lon: longitudinal acceleration of the whole trajectory(The L2 norm of the acceleration at each time point). For convenience, we let the features to represent an aspect of “cost” of the trajectory, so that the weights can have equal signs.

In the projects I participated in, we assume the Cost function is some linear combination of the features. Thus, each feature has a weight corresponding to it. Note that this assumption also made it important for feature engineering(L2_a_lon is quite different from L1_a_lon).

A tricky problem of the IRL is how to judge the algorithm actually learned something. One common learning goal in machine learning is to maximize the log likelihood of the expert trajectories. So is in our task. When showing off the algorithms, some paper uses matching the feature counts of the human trajectories and predicted trajectories. In the formulation below, the $\tilde{\boldsymbol{f}}$ is the feature count of the expert trajectory. \(E_{P(\zeta | \theta)}[\boldsymbol{f}]=\tilde{\boldsymbol{f}}\)

To compute the $E$ in the above equation, the 1 is often used: \(P(\zeta | \theta)=\frac{1}{Z} e^{-\theta^{T}f(\zeta)}\) Where the $Z$ term is the normalization term, making the probability sum to 1. \(Z = \int e^{-\theta^{T} f(\zeta)} d \zeta\)

If you are curious about the assumptions behind the learning goal and the max entropy principle, please check background understanding section

Learning algorithms

In the previous section, the hardest problem is calculation of the probability of the trajectory. To be accurate, the problem is about how to approximate the normalization term $Z$

\[P(\tilde{\zeta} | \theta)=\frac{e^{-\theta^{T} f(\tilde{\zeta})}}{\int e^{-\theta^{T} f(\zeta) d \zeta} d \zeta}\]

2 fit the integral locally as an integral of gaussian, and have. \(\mathcal{L}=-\frac{1}{2} g^{T} H^{-1} g+\frac{1}{2} \log |H|-\frac{d \zeta}{2} \log 2 \pi\) It uses the local gradient and hessian to compute the shape of the gaussian, so the paper claims its algorithm “removes the assumption that the expert demonstrations are globally optimal”

With this approximation, the algorithm can directly use learning algorithms to maximize the approximated log likelihood and train weights.

3 calculates the term $Z$ more straight forward. It substitutes the likelihood of an optimal trajectory for the integral of all possible trajectories. The optimal trajectory is get by solving the optimization problem using the weights.

The paper found that the gradient of the experts’ log likelihood is a very simple form:

\[\mathrm{g}=E_{P(\zeta|\theta)}[f]-\tilde{f}\]

Thus, their algorithm uses gradient descent to train the weights, where the $E_{P}[f]$ is approximated by the feature count of the optimized trajectory.

The algorithm we used in our projects is similar to 3, by approximating the expected feature count $E_{P}[f]$ with an averaged feature count from a bunch of sampled trajectories. Human priories are integrated into the trajectory sampler so that the sampler covers the space of high probability trajectories. I might introduce more here after the paper get published.

Background understanding

One assumption in the framework of IRL is that the trajectories with same costs have same probability. And we want to constraint on the expected feature count to be equal to the expert feature count. Thus, the form of maximum entropy is almost there if we have the optimization problem of the following form.

\[\min_{p_{i}} \sum_{i} p_{i} \log p_{i} \\ \begin{aligned} \text{s.t.} & \sum_{i} r_{i} p_{i} &= \hat{r} \\ & \sum_{i} p_{i} &=1 \end{aligned}\]

References

  1. Dunn, A. M., Hofmann, O. S., Waters, B., & Witchel, E. (2011). Cloaking malware with the trusted platform module. Proceedings of the 20th USENIX Security Symposium. 

  2. Kim, D., Di Carlo, J., Katz, B., Bledt, G., & Kim, S. (2019). Highly Dynamic Quadruped Locomotion via Whole-Body Impulse Control and Model Predictive Control. Retrieved from http://arxiv.org/abs/1909.06586 

  3. Kuderer, M., Gulati, S., & Burgard, W. (2015). Learning driving styles for autonomous vehicles from demonstration. Proceedings - IEEE International Conference on Robotics and Automation, 2015-June(June), 2641–2646. https://doi.org/10.1109/ICRA.2015.7139555  2