# Tutorial 8

# Reinforcement Learning

# Introduction

At this point, our vehicle has every tool necessary to navigate its environment as efficiently as possible. However, in another environment such as an urban road, our vehicle requires a wider breadth of rules to obey traffic laws properly. In our simple simulation, the car only has a few tasks it needs to focus on to succeed: mapping, speed, steering, and obstacle avoidance. What if a traffic light is introduced, and you are eliminated from the competition if the car does not stop? An option is to program every case it will encounter. This approach is the most common in the industry today (mostly for financial reasons) but is limited in a realistic environment due to the chaotic and random nature of the real world. It becomes evident when you think about the endless aspects encountered during driving. Recently, companies such as Tesla have advanced to more sustainable navigation techniques that are guaranteed to last the test of time.

That is where reinforcement learning comes in. Q-Learning is a branch of reinforcement learning that was first introduced in 1989 (https://link.springer.com/content/pdf/10.1007/BF00992698.pdf) and was implemented to master Atari without any rules or training data only a year later. Today, variations of Q-Learning (specifically PPO (Proximal Policy Optimization)) are breaking news titles, led heavily by the development of OpenAI in their multi-agent interactions (https://openai.com/research/emergent-tool-use) or Minecraft bots that outperform humans in unseen environments.

# Fundamentals

The task of explaining the Q-Learning algorithm is a difficult one. For that reason, it is easiest to direct you towards this video that explains the fundamentals well: https://www.youtube.com/watch?v=0iqz4tcKN58. Alternatively, the paper above outlining Q-Learning is available but objectively less intuitive.

# Vocabulary

State: a robot's perception of an environment in a particular moment.
Action: something that can be performed by the robot.
Q-value: the value of taking a particular action in a given state.

# Iteration Process

Set the car at a random spot on the map
For each step
- Begin by choosing action a for state s depending on policy.
- Take action a, observe next state s', and reward r.
- Update Q-value for state s:
- Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
- Where
  - Q(s, a) represents the current estimate of the Q-value for state s and action a
  - \alpha is the learning rate
  - r is the reward received after taking action a in state s (0, in our case)
  - \gamma is the discount factor
  - \max_{a'} Q(s', a') is the maximum estimated Q-value for the next state s' considering all possible actions a'.
  - The term r + \gamma \max_{a'} Q(s', a') - Q(s, a) represents the temporal difference error [TD(0)].
- TD(2) would look like:
  - Q(s, a) \leftarrow Q(s, a) + \alpha \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 \max_{a'} Q(s_{t+2}, a') - Q(s, a) \right]
  - Where R_{t+1} is the reward received after taking action a in state s_{t}, leading to state s_{t+1}
Tweak hyperparameters such as lambda and gamma to achieve convergence over time.
Repeat.

# State-Action (Q) Table

When deciding how to format our Q-table, it is important to consider what state and action pairs we desire, and how large our resulting table becomes. A smaller Q-table will converge faster. A good example could look like:

State	Action 1	Action 2	Action 3
[0,0,0,0]	7	35	4
[0,0,0,1]	82	9	13
...	...	...	...
[20,20,20,10]	22	1	79

Where action 1 means slowly turn left at a constant velocity, action 2 means go straight, and action 3 go right. State represents an array with the following values:

[ distance to left diagonal of car, distance to straight of car, distance to right diagonal, velocity ] or [ d1, d2, d3, v]
Each Lidar distance is rounded to its nearest integer and clamped between [0…20]. The similar is done for velocity.

Our table is technically multidimensional, but for the sake of understanding, it is interpreted as a longer 2-dimensional.

Even with these simplified states, our table already contains 20*20*20*10 = 80,000 rows (states) and 20*20*20*10*3 = 240,000 Q-values that need to be optimized. Adding another set of actions to be able to change the car's velocity scales our entries to 480,000. It is vital to be aware of this when developing our program as it can quickly become a limitation.

# Implementation

Using our detailed iterative algorithm, the program should be able to converge. With this knowledge at hand, implementation is within reach. Good luck!