9.7s
for Stanford deep learning CS230. Um today's lecture is going to be about deep reinforcement learning. I actually switched uh the original plan of talking about neural network interpretability and LLM visualization simply because you you haven't had the chance to study attention maps, you know, convolutional neural networks and so it would have been an overkill to do
12.6s
Um today's lecture is going to be about
15.1s
deep reinforcement learning. I actually
17.2s
switched uh the original plan of talking
20.3s
about neural network interpretability
23.4s
and LLM visualization
25.8s
simply because you you haven't had the
28.0s
chance to study attention maps, you
31.1s
know, convolutional neural networks and
32.9s
so it would have been an overkill to do
34.5s
that week five. So we're going to talk about neural network interpretability and visualization in a later lecture actually. Um but today our focus will be on deep reinforcement learning uh which is probably my favorite lecture of uh the class. I I feel like I say that every week but it's okay. I like it. Um
36.4s
about neural network interpretability
38.2s
and visualization in a later lecture
41.0s
actually. Um but today our focus will be
44.2s
on deep reinforcement learning uh which
46.7s
is probably my favorite lecture of uh
49.7s
the class. I I feel like I say that
51.2s
every week but it's okay. I like it. Um
54.8s
the agenda is pretty packed. We're going to start with uh deep reinforcement learning which you can think of as the marriage between deep learning and reinforcement learning. Together the baby is called deep reinforcement learning. and we're going to see how reinforcement learning works and how neural networks can play a part in building a reinforcement learning um
58.1s
to start with uh deep reinforcement
61.4s
learning which you can think of as the
64.8s
marriage between deep learning and
67.0s
reinforcement learning. Together the
68.8s
baby is called deep reinforcement
70.2s
learning. and we're going to see how
72.8s
reinforcement learning works and how
74.6s
neural networks can play a part in
78.0s
building a reinforcement learning um
81.3s
agent. Um in the second half of the class we will focus on a very specific um you know concept called reinforcement learning from human feedback that you might have heard of. It's one of the core concept that really made the difference between what you might have remembered as GPT2 and chat GPT. You know, that's the leap.
85.0s
class we will focus on a very specific
89.4s
um you know concept called reinforcement
92.4s
learning from human feedback that you
95.1s
might have heard of. It's one of the
96.9s
core concept that
99.6s
really made the difference between what
103.4s
you might have remembered as GPT2
106.2s
and chat GPT. You know, that's the leap.
109.1s
Uh that's really the the the technique that had uh that has you know democratized um access to LLM because of the performance improvements and the alignment with humans. So, we're going to see what what is this concept of R LHF and how um does it work and why does it allow us to align a language model to
112.0s
that had uh that has you know
114.5s
democratized um access to LLM because of
117.6s
the performance improvements and the
119.8s
alignment with humans. So, we're going
122.2s
to see what what is this concept of R
125.0s
LHF and how um does it work and why does
129.0s
it allow us to align a language model to
131.3s
human preferences. Ready to go? As always, let's try to make it interactive. make it interactive. snorts and clears throat So, the motivation behind deep reinforcement learning and as usual, you're going to have all the most important papers that are covered in the class listed at the bottom of each slide. Um reinforcement learning has
134.2s
Ready to go? As always, let's try to
136.6s
make it interactive.
138.2s
make it interactive. snorts and clears throat
139.0s
So, the motivation behind deep
140.9s
reinforcement learning and as usual,
142.6s
you're going to have all the most
144.6s
important papers that are covered in the
146.8s
class listed at the bottom of each
148.4s
slide. Um reinforcement learning has
151.3s
grown in popularity. Um one of uh uh the you know very popular papers called human level control through deep reinforcement learning is the work um from deep mind um has showed us that a single algorithmtraining method can allow us to train AI that can play many many Atari games better than
156.0s
you know very popular papers called
158.6s
human level control through deep
160.1s
reinforcement learning is the work um
163.5s
from deep mind um has showed us that a
168.5s
single algorithmtraining
171.2s
method can allow us to train AI that can
176.2s
play many many Atari games better than
179.4s
humans. single algorithm over 40 50 games where it exceeds human capability which is quite impressive when you thought about the fact that you know machine learning used to be niche and you would have to train a really niche algorithm to perform different task here's an algorithm that can just learn sort of every Atari game a little later
183.0s
games where it exceeds human capability
186.3s
which is quite impressive when you
188.2s
thought about the fact that you know
190.1s
machine learning used to be niche and
191.8s
you would have to train a really niche
193.4s
algorithm to perform different task
195.1s
here's an algorithm that can just learn
197.3s
sort of every Atari game a little later
201.1s
you might have heard of Alpho Alph Go is a is an algorithm that was developed to beat and exceed human performance performance in the game of Go. We'll talk about it a little more. The game of Go is a very complex game. Some would argue way more complex than chess from a decision-m standpoint and from the
205.2s
a is an algorithm that was developed to
207.6s
beat and exceed human performance
209.6s
performance in the game of Go. We'll
211.4s
talk about it a little more. The game of
213.4s
Go is a very complex game. Some would
216.2s
argue way more complex than chess from a
220.6s
decision-m standpoint and from the
223.2s
possibilities that can happen on the board. And so, um, it it actually got solved, um, in 2017 again by the Deep M Deep Mind team and and David Silver's lab um, later on. And again another great paper from deep mind had showed us that reinforcement learning can also be used for strategy game that might be a
225.5s
board. And so, um, it it actually got
228.3s
solved, um, in 2017 again by the Deep M
233.0s
Deep Mind team and and David Silver's
235.2s
lab um, later on. And again another
239.3s
great paper from deep mind had showed us
241.4s
that reinforcement learning can also be
244.2s
used for strategy game that might be a
248.3s
touch more complex than chess or go that might actually involve multiple players playing with each other or against each other. Some of you might have played Starcraft for example. That's an example of a game where um it requires a lot of long long-term thinking, short-term thinking. Another one is, you know,
252.7s
might actually involve multiple players
255.4s
playing with each other or against each
258.1s
other. Some of you might have played
259.8s
Starcraft for example. That's an example
261.7s
of a game where um it requires a lot of
264.6s
long long-term thinking, short-term
265.9s
thinking. Another one is, you know,
267.9s
Dota. Some of you might have played Dota or League of Legend where you have a team playing against another team. Those are examples of games that involve multiple agents playing collaboratively. And it's pretty hard to develop systems that can play with each other against multiple opponents. Um and finally most recently this is 2022 so alongside the
269.8s
or League of Legend where you have a
271.6s
team playing against another team. Those
273.4s
are examples of games that involve
275.1s
multiple agents playing collaboratively.
277.8s
And it's pretty hard to develop systems
280.3s
that can play with each other against
282.9s
multiple opponents. Um and finally most
287.0s
recently this is 2022 so alongside the
289.7s
release of Chad GPT um this paper that introduces the concept of reinforcement learning with human feedback applied to aligning language models with human preferences and we'll talk about that later. So all all this to say that reinforcement learning allowed um us to exceed human performance in a variety of tasks. The first one um I want us to
293.4s
introduces the concept of reinforcement
295.2s
learning with human feedback applied to
297.3s
aligning language models with human
300.1s
preferences and we'll talk about that
302.2s
later. So all all this to say that
305.2s
reinforcement learning allowed um us to
308.6s
exceed human performance in a variety of
311.4s
tasks. The first one um I want us to
314.6s
think about is the the game of go. So let's say that you were asked to solve the game of go with classic supervised learning. Okay, everything we've seen together so far labeled data. How would you solve the game of go with classic supervised learning? be the label
317.0s
let's say that you were asked to solve
321.6s
the game of go with classic supervised
324.2s
learning. Okay, everything we've seen
326.4s
together so far labeled data. How would
329.4s
you solve the game of go with classic
332.0s
supervised learning?
337.8s
be the label
340.2s
be the label etc.
356.6s
plenty of games, hopefully from good players if you want the the algorithm to work. Um, and you look at X as the input being the current state of the board and Y as the next state of the board. And this would tell you what move was selected and you learn the move essentially. And hopefully if you do
359.1s
players if you want the the algorithm to
362.4s
work. Um, and you look at X as the input
366.1s
being the current state of the board and
368.4s
Y as the next state of the board. And
371.2s
this would tell you what move was
372.6s
selected and you learn the move
374.8s
essentially. And hopefully if you do
376.2s
that across many many games, you know, you might you might see the the agent become more attuned to the game and and and develop better strategies. So, you know, really hopefully it's a professional player. What are the disadvantages of that or the shortcomings that you can anticipate?
378.8s
you might you might see the the agent
382.0s
become more attuned to the game and and
384.2s
and develop better strategies. So, you
388.2s
know, really hopefully it's a
389.4s
professional player. What are the
391.5s
disadvantages of that or the
393.5s
shortcomings that you can anticipate?
397.4s
shortcomings that you can anticipate? Yes. see the entire space of possible states of the board, which is what you said. So you might miss out on a lot of different strategies. So the game of go is actually a game with two players. One
413.0s
see the entire space of possible states
416.1s
of the board, which is what you said. So
418.5s
you might miss out on a lot of different
420.0s
strategies. So the game of go is
422.6s
actually a game with two players. One
425.0s
player that uses the black stones and one player that uses the white stones. And iteratively they're going to place those stones on the grid, a 13 by13 grid that you can see on screen with the goal of surrounding their opponent. So you're constantly trying to surround the stones of the opponent and the opponent is
427.5s
one player that uses the white stones.
429.8s
And iteratively they're going to place
431.6s
those stones on the grid, a 13 by13 grid
436.9s
that you can see on screen with the goal
439.2s
of surrounding their opponent. So you're
441.2s
constantly trying to surround the stones
443.1s
of the opponent and the opponent is
445.1s
trying to surround your stones. And so you can imagine that for every intersection on the grid there is multiple possibilities. Either there's a black stone or a white stone or nothing. And on a 13 by13 grid you can imagine how many possibilities of a board state there are. It's um impossible to capture
447.3s
you can imagine that for every
450.0s
intersection on the grid there is
453.4s
multiple possibilities. Either there's a
455.1s
black stone or a white stone or nothing.
458.2s
And on a 13 by13 grid you can imagine
461.0s
how many possibilities of a board state
463.1s
there are. It's um impossible to capture
466.6s
all of that with historical moves from professional players. It would just never cover that. The same thing could be said in chess as well. you you know that even the professional players can plan X number of steps in advance but nobody knows where the game takes you and in the late stages of the games or
468.5s
professional players. It would just
470.2s
never cover that. The same thing could
472.7s
be said in chess as well. you you know
474.4s
that even the professional players can
476.1s
plan X number of steps in advance but
479.0s
nobody knows where the game takes you
481.2s
and in the late stages of the games or
483.6s
the end games players always find themselves playing a different game and that's part of the magic of being good at chess. snorts Um so yeah that's the problem. What's another problem or shortcoming beyond the fact that we can't observe possibly all the states?
486.2s
themselves playing a different game and
487.8s
that's part of the magic of being good
489.6s
at chess. snorts Um so yeah that's the
492.1s
problem. What's another problem or
493.9s
shortcoming beyond the fact that we
495.6s
can't observe possibly all the states?
517.0s
you said, well, first, you don't even know if this was a good move, you know? So, maybe it was not even a good move and you're learning something that was not a good move and you're labeling it as a good move. And second, um, you're actually only getting partial information, meaning you don't have the
519.4s
know if this was a good move, you know?
521.3s
So, maybe it was not even a good move
522.6s
and you're learning something that was
523.8s
not a good move and you're labeling it
525.3s
as a good move. And second, um, you're
528.2s
actually only getting partial
529.5s
information, meaning you don't have the
531.1s
information of what's in the person's mind and what strategy they're trying to execute. So your store, you're sort of looking at a single example among a long-term strategy and you can't expect the model to guess what's the long-term strategy because it was just trained on X and Y and matching the inputs to a
533.0s
mind and what strategy they're trying to
534.9s
execute. So your store, you're sort of
537.2s
looking at a single example among a
540.7s
long-term strategy and you can't expect
543.4s
the model to guess what's the long-term
545.4s
strategy because it was just trained on
547.4s
X and Y and matching the inputs to a
550.4s
possible output. So you you don't really have any concept of a strategy at that point. it looks one off at every decisions of the model. Okay, those are really good points. Um the other one is the ground truth might be illdefined. What I mean by that is um even the best humans in the world do not
552.5s
have any concept of a strategy at that
554.4s
point. it looks one off at every
556.7s
decisions of the model. Okay, those are
560.0s
really good points. Um the other one is
563.2s
the ground truth might be illdefined.
566.5s
What I mean by that is um
571.1s
even the best humans in the world do not
574.4s
play their best game every day and even their best game is not the ground truth. And that creates an issue because you're essentially training against a target that is off by a certain margin. You're never going to get better than the best human and the best human is not the best possible um existing the best possible
577.0s
their best game is not the ground truth.
579.8s
And that creates an issue because you're
582.2s
essentially training against a target
584.0s
that is off by a certain margin. You're
586.3s
never going to get better than the best
588.1s
human and the best human is not the best
589.8s
possible um existing the best possible
592.6s
strategy at every point. So you could argue what if we get a panel of experts that we're monitoring and those are the best players in the world. Even with a panel of expert that decides every move, you're still have an illdefined ground truth, you know. So that's a big issue. Too many states
594.8s
argue what if we get a panel of experts
598.2s
that we're monitoring and those are the
600.1s
best players in the world. Even with a
602.3s
panel of expert that decides every move,
605.0s
you're still have an illdefined ground
607.4s
truth, you know.
610.1s
So that's a big issue. Too many states
611.9s
in the game you mentioned and we will likely not generalize which is what you said meaning we're looking at one-off situations. We're not looking at entire strategies and so when we face a board state that we've never seen before because the model was not trained on strategy it sort of we get stuck. Yeah.
613.6s
likely not generalize which is what you
615.2s
said meaning we're looking at one-off
616.9s
situations. We're not looking at entire
618.6s
strategies and so when we face a board
621.8s
state that we've never seen before
624.2s
because the model was not trained on
625.9s
strategy it sort of we get stuck. Yeah.
631.4s
Okay. And this is an example of a perfect application for reinforcement learning because reinforcement learning is all about delayed labels and making sequences of good decisions. So if you had to remember in one sentence what's RL RL is making good sequences of decisions, sequences of good decisions, decisions, sequences of good decisions, sorry.
634.0s
perfect application for reinforcement
636.0s
learning because reinforcement learning
638.7s
is all about delayed labels and making
642.7s
sequences of good decisions. So if you
646.1s
had to remember in one sentence what's
648.2s
RL RL is making good sequences of
653.0s
decisions, sequences of good decisions,
655.6s
decisions, sequences of good decisions, sorry.
667.1s
the difference between you know classic supervised learning and RL is in uh in classic supervised learning you teach by example. In reinforcement learning you teach by experience which is also a different concept. You're not just showing cats and non-cats to a model. You're actually letting the model experience an environment until it figures out what
669.0s
supervised learning and RL is in uh in
673.0s
classic supervised learning you teach by
675.0s
example. In reinforcement learning you
677.4s
teach by experience
679.4s
which is also a different concept.
680.9s
You're not just showing cats and
683.7s
non-cats to a model. You're actually
686.7s
letting the model experience an
689.0s
environment until it figures out what
691.8s
were the best decision it made and learns from them. reinforcement learning applications I'm going to mention them. We we have we have gaming of course that we already covered. What are other applications of AI where we need good sequences of of AI where we need good sequences of decisions? decisions? clears throat clears throat Yes.
693.8s
learns from them.
698.8s
reinforcement learning applications I'm
700.6s
going to mention them. We we have we
702.1s
have gaming of course that we already
703.7s
covered. What are other applications
707.4s
of AI where we need good sequences of
709.8s
of AI where we need good sequences of decisions?
712.6s
decisions? clears throat
713.6s
clears throat Yes.
715.1s
autonomous driving. Yeah, correct. I mean, in driving, you could argue RL could work and there's some RL going on, but what you mean, I think, is you you have some sort of a dynamic planning algorithm that allows you to strategize. If you see a a red light ahead, you might start slowing down over time, but
717.3s
mean, in driving, you could argue RL
719.7s
could work and there's some RL going on,
722.5s
but what you mean, I think, is you you
724.8s
have some sort of a dynamic planning
726.7s
algorithm that allows you to strategize.
729.6s
If you see a a red light ahead, you
732.5s
might start slowing down over time, but
735.1s
maybe it will turn green, so you might not slow down completely. This is an example of a strategy that you need, of example of a strategy that you need, of course. course. Yeah. robotic controlling. That's a great example also related to autonomous driving. But imagine you uh want to
736.5s
not slow down completely. This is an
738.6s
example of a strategy that you need, of
741.2s
example of a strategy that you need, of course.
742.4s
course. Yeah.
744.2s
robotic controlling. That's a great
745.7s
example also related to autonomous
747.6s
driving. But imagine you uh want to
750.5s
teach to a robot to move from point A to point B. The number of good decisions that the robot needs to make in terms of moving each of their joints is tremendous. Like it's actually super unlikely that a robot would move from A to B if it's not trained to make good sequences of decisions.
753.0s
point B. The number of good decisions
756.1s
that the robot needs to make in terms of
758.6s
moving each of their joints is
761.4s
tremendous. Like it's actually super
763.7s
unlikely that a robot would move from A
765.4s
to B if it's not trained to make good
767.5s
sequences of decisions.
783.8s
like it, but it happens to be the biggest one over enforcement learning.
785.3s
biggest one over enforcement learning.
795.9s
Marketing. You're right. So, yeah, we talked about robotics. Advertisement is another example. Um, advertisement another example. Um, advertisement clears throat is a long game. Like companies are showing you multiple ads before you buy. And in fact, the reason rein reinforcement learning is important is because, you know, they're planning a strategy that might lead a buyer to
797.4s
talked about robotics. Advertisement is
799.3s
another example. Um, advertisement
803.0s
another example. Um, advertisement clears throat
803.8s
is a long game. Like companies are
806.6s
showing you multiple ads before you buy.
809.5s
And in fact, the reason rein
811.4s
reinforcement learning is important is
813.2s
because, you know, they're planning a
815.3s
strategy that might lead a buyer to
817.6s
execute a purchase over time and it requires long-term thinking. So there's a lot of reinforcement learning applied to marketing, advertisement, real time bidding, processes, etc. Okay, clear on what RL is and how it differs from classic supervised differs from classic supervised learning? No. Okay. Um, so let's put some vocabulary around that concept. In
820.3s
requires long-term thinking. So there's
822.6s
a lot of reinforcement learning applied
825.1s
to marketing, advertisement, real time
828.6s
bidding, processes, etc.
832.4s
Okay, clear on what RL is and how it
835.0s
differs from classic supervised
836.3s
differs from classic supervised learning?
838.1s
No. Okay. Um, so let's put some
841.0s
vocabulary around that concept. In
843.1s
reinforcement learning, you have an agent and the agent interacts with an agent and the agent interacts with an environment. environment, the agent will perform certain actions that we will denote a t where t is a time step. And the environment will show you states that transition from time step t to time step
844.4s
agent and the agent interacts with an
847.8s
agent and the agent interacts with an environment.
853.3s
environment, the agent will perform
855.7s
certain actions that we will denote a t
860.5s
where t is a time step. And the
863.4s
environment will show you states that
867.1s
transition from time step t to time step
869.4s
t one. So subject to an action at an environment may transition from a st to st plus one. You can think of the game of go. I take the action of putting my black stone on a certain grid intersection and the environment has changed. It moved from the state has changed. It moved from state time step t
873.8s
environment may transition from a st to
876.6s
st plus one. You can think of the game
878.4s
of go. I take the action of putting my
880.8s
black stone on a certain grid
882.6s
intersection and the environment has
884.8s
changed. It moved from the state has
886.6s
changed. It moved from state time step t
888.8s
to time step t one where my stone is on the grid. After that um state update happens, there's two things that the agent observes. the the the agent observes an observation that we will note OT and a reward RT. and of course the goal of the agent will be to maximize the rewards.
890.8s
on the grid. After that um state update
895.3s
happens, there's two things that the
897.8s
agent observes. the the the agent
899.5s
observes an observation that we will
901.4s
note OT and a reward RT.
910.8s
and of course the goal of the agent will
912.5s
be to maximize the rewards.
919.6s
we'll talk about it a little more. Um the observation sometimes is equal to the state. Can someone guess why we might need two concepts instead of a single concept? Why is it important to have a state and an observation?
922.0s
the observation sometimes is equal to
924.4s
the state. Can someone guess why we
927.0s
might need two concepts instead of a
929.1s
single concept?
931.0s
Why is it important to have a state and
934.5s
an observation?
945.0s
environment may not be fully um you know transparent to the user and so for example in chess or in go the observation is actually equal to the state you see everything on your board all the information is available to you if you play League of Legends or if you play League of Legends or Starcraft
948.6s
transparent to the user and so for
951.0s
example in chess or in go
953.9s
the observation is actually equal to the
955.7s
state you see everything on your board
957.5s
all the information is available to you
959.8s
if you play League of Legends or
962.6s
if you play League of Legends or Starcraft
964.2s
you know the concept of you know I think in English it's called like a cloud or a fog I think it's a fog you only see certain part of the map until you have explored everything or until your friends are sort of visiting the other parts of the map. And so the observation
966.2s
in English it's called like a cloud or a
968.3s
fog I think it's a fog you only see
970.9s
certain part of the map until you have
972.8s
explored everything or until your
974.6s
friends are sort of visiting the other
976.8s
parts of the map. And so the observation
978.9s
is actually less information than the states of the environment. Okay. And then the last piece of vocabulary is a transition. When I refer to a transition, I refer of the process of getting from state T plus one, which means we're in state T. The agent takes an action A. It observes OT and a reward
981.9s
states of the environment.
985.1s
Okay. And then the last piece of
986.8s
vocabulary is a transition. When I refer
988.6s
to a transition, I refer of the process
991.1s
of getting from state T plus one, which
993.7s
means we're in state T. The agent takes
996.1s
an action A. It observes OT and a reward
999.6s
RT and it transition to the next state ST plus one. Question. there are there examples of environment where the state is so large that the where the state is so large that the entire entire clears throat
1002.1s
ST plus one. Question.
1017.6s
there are there examples of environment
1020.5s
where the state is so large that the
1023.8s
where the state is so large that the entire
1025.7s
entire clears throat
1030.1s
reasons. Yeah. Yeah. You might have sighs games. I mean, look at open world games. Like truly, you could you could argue, I don't know, there are some games where you might press start and you see the entire environment, but who cares of what's happening 20,000 kilometers west of you if you're in a certain location. That might not
1032.3s
sighs games. I mean, look at open
1034.7s
world games. Like truly, you could you
1037.6s
could argue, I don't know, there are
1039.0s
some games where you might press start
1040.9s
and you see the entire environment, but
1042.5s
who cares of what's happening 20,000
1045.7s
kilometers west of you if you're in a
1048.2s
certain location. That might not
1050.1s
influence your strategy. So you might actually put some sort of a you know trust circle or like some sort of a circle in which you observe which you think has 99 of the information you need possibly for computational reasons. That's a good point. Okay, let's get to a practical example of a reinforcement learning algorithm
1051.8s
actually put some sort of a you know
1055.3s
trust circle or like some sort of a
1058.0s
circle in which you observe which you
1059.8s
think has 99 of the information you
1061.8s
need possibly for computational reasons.
1063.7s
That's a good point.
1065.8s
Okay, let's get to a practical example
1068.2s
of a reinforcement learning algorithm
1070.6s
and develop it together. This example is called recycling is good because recycling is good but also because it's a simple example illustrative of reinforcement learning. So let's say we have a a small environment with uh five states. There is a starting state marked in brown which is state two. It's our it's our
1073.3s
This example is called recycling is good
1075.7s
because recycling is good but also
1077.4s
because it's a simple example
1079.5s
illustrative of reinforcement learning.
1081.8s
So let's say we have a a small
1084.5s
environment with uh five states. There
1088.6s
is a starting state marked in brown
1091.8s
which is state two. It's our it's our
1094.4s
initial state. And then on the right side, sorry, on the left side, you have state one, which is a garbage. And it's great to get to the garbage because you're going to be able to recite to to put in the garbage the, you know, the stuff that you have in your hands. You
1099.0s
side, sorry, on the left side, you have
1101.8s
state one, which is a garbage. And it's
1104.9s
great to get to the garbage because
1106.3s
you're going to be able to recite to to
1108.0s
put in the garbage the, you know, the
1111.1s
stuff that you have in your hands. You
1112.8s
know, you're trying to throw away some garbage and the garbage can happens to be there and so we would expect there to be a reward. On the other side, if you actually go to the right, you might pass by state three, which is empty. You might pass by stage four where there is
1114.9s
garbage and the garbage can happens to
1116.6s
be there and so we would expect there to
1119.4s
be a reward. On the other side, if you
1121.7s
actually go to the right, you might pass
1123.8s
by state three, which is empty. You
1126.0s
might pass by stage four where there is
1128.7s
a chocolate uh packaging that is left on the ground that you can pick up and it's good to pick it up. And then on stage five, state five, you have the recycle bean which is more valuable than the garbage can because you can recycle and you should get better rewards for that.
1131.8s
the ground that you can pick up and it's
1134.6s
good to pick it up. And then on stage
1137.3s
five, state five, you have the recycle
1139.7s
bean which is more valuable than the
1141.4s
garbage can because you can recycle and
1143.7s
you should get better rewards for that.
1146.6s
So that's our game. In this game, we define a reward that is associated with the type of behaviors that we want the agents to learn. Um, and the reward is as followed. That's just one example. Plus two for throwing your garbage in the normal can, plus one for picking up the chocolate packaging, and plus 10 if
1148.8s
define a reward that is associated with
1151.0s
the type of behaviors that we want the
1152.8s
agents to learn. Um, and the reward is
1155.5s
as followed. That's just one example.
1157.8s
Plus two for throwing your garbage in
1160.4s
the normal can, plus one for picking up
1163.3s
the chocolate packaging, and plus 10 if
1166.0s
you manage to make it to the recycle you manage to make it to the recycle bin. Is it clear? Now, the goal will be, and that's the case in reinforcement learning often time to maximize the return. We'll define formally the return but think about it as maximize the amount of rewards that you get as you go through
1167.7s
you manage to make it to the recycle bin.
1169.7s
Is it clear?
1172.0s
Now, the goal will be, and that's the
1175.2s
case in reinforcement learning often
1177.5s
time to maximize the return.
1181.5s
We'll define formally the return but
1183.2s
think about it as maximize the amount of
1185.1s
rewards that you get as you go through
1187.1s
this journey and you make your decisions. In this specific game, we have five states and there's three types of state. In brown is the initial states. We have normal states and we have in blue terminal states. When you get to a terminal states in reinforcement learning, it will typically end the game. it will end one
1188.4s
decisions. In this specific game, we
1191.3s
have five states and there's three types
1193.5s
of state. In brown is the initial
1195.9s
states. We have normal states and we
1198.6s
have in blue terminal states. When you
1201.1s
get to a terminal states in
1202.8s
reinforcement learning, it will
1204.7s
typically end the game. it will end one
1209.4s
episode of the game. You move to another episode. You'll get back to the starting state or initial state and you'll redo another episode. The possible actions for our agent here are going to be fairly simple. Left and are going to be fairly simple. Left and right. And we're going to add an additional
1212.9s
episode. You'll get back to the starting
1214.3s
state or initial state and you'll redo
1216.0s
another episode.
1218.5s
The possible actions for our agent here
1220.5s
are going to be fairly simple. Left and
1222.7s
are going to be fairly simple. Left and right.
1225.2s
And we're going to add an additional
1226.8s
rule that is important, which is that the garbage collector comes in 3 minutes and it takes a minute to get from one state to the other. Why is that an important rule to add to the game? Can you guess? Can you guess? Yeah. forth between uh stage three and stage four. You just collect a bunch of
1228.4s
the garbage collector comes in 3 minutes
1232.4s
and it takes a minute to get from one
1234.6s
state to the other. Why is that an
1237.0s
important rule to add to the game?
1241.9s
Can you guess?
1244.2s
Can you guess? Yeah.
1249.1s
forth between uh stage three and stage
1251.6s
four. You just collect a bunch of
1253.3s
chocolate packaging and you never make it to the bin. And so, it's not what we want. Yeah. snorts Okay. So, how do we define the long-term return? The long-term return is going to be defined as capital R, which is the sum of rewards with a which is the sum of rewards with a discount.
1254.8s
it to the bin. And so, it's not what we
1257.7s
want. Yeah.
1260.2s
snorts Okay. So, how do we define the
1262.6s
long-term return? The long-term return
1264.4s
is going to be defined as capital R,
1268.0s
which is the sum of rewards with a
1272.9s
which is the sum of rewards with a discount.
1275.0s
Discount is a very important concept in reinforcement learning. It's also a very natural concept to think about. Can you think of what what the discount would represent in for humans? Do you have an example of what it could be? Yeah. Huh? money. Yeah. The value of money and time. Exactly. Or the energy that a robot
1278.2s
reinforcement learning. It's also a very
1280.6s
natural concept to think about. Can you
1283.2s
think of what what the discount would
1285.4s
represent in for humans? Do you have an
1288.0s
example of what it could be? Yeah.
1291.4s
Huh? money.
1293.4s
Yeah. The value of money and time.
1295.1s
Exactly. Or the energy that a robot
1297.9s
might have, things like that. Yeah. You you would rather get, you know, a dollar now than a dollar in 10 years knowing that there's some inflation, for example. That's the example of a discount. In reinforcement learning is the same. You know, let's say you have a strategy that takes so much time. You
1300.0s
you would rather get, you know, a dollar
1302.4s
now than a dollar in 10 years knowing
1304.2s
that there's some inflation, for
1306.2s
example. That's the example of a
1308.3s
discount. In reinforcement learning is
1309.8s
the same. You know, let's say you have a
1312.0s
strategy that takes so much time. You
1314.1s
need to discount it because your robot might lose energy as you're going through it. For example, these counts can vary, you know, but they stay between zero and one. Um, so what is the best strategy to follow if gamma the discount is equal to one meaning you know time doesn't matter
1315.6s
might lose energy as you're going
1317.2s
through it. For example, these counts
1319.9s
can vary, you know, but they stay
1322.1s
between zero and one.
1325.4s
Um, so what is the best strategy to
1327.9s
follow if gamma the discount is equal to
1331.4s
one meaning you know time doesn't matter
1335.1s
here if it's longer or shorter just want to maximize the return. Best strategy to follow. Someone who hasn't spoken yet.
1337.8s
to maximize the return.
1341.0s
Best strategy to follow.
1349.7s
Someone who hasn't spoken yet.
1369.5s
uh 3 minutes. You can't bounce around because you you will not get to the terminal state before the time allotted is done. But that would be a good idea if this rule was not true. What else could you do? too hard.
1371.5s
You can't bounce around because you you
1373.8s
will not get to the terminal state
1375.3s
before the time allotted is done. But
1379.2s
that would be a good idea if this rule
1381.0s
was not true.
1383.8s
What else could you do?
1394.3s
too hard.
1402.3s
give me also the maximum reward you would get.
1404.0s
would get.
1413.7s
People are are sleepy today. Yeah. Recycle. Go to the recycle. So, right, right, Go to the recycle. So, right, right, right. Yeah, that's right. Thank you. Right, right, right. And then what's your Sorry. What's your um what's your total reward? Yeah, that's right. 11. So that's where we get terminal state and we grab our
1414.2s
Go to the recycle. So, right, right,
1417.2s
Go to the recycle. So, right, right, right.
1418.2s
Yeah, that's right. Thank you.
1422.6s
Right, right, right. And then what's
1424.4s
your Sorry. What's your um what's your
1426.6s
total reward?
1429.6s
Yeah, that's right. 11. So that's where
1432.5s
we get terminal state and we grab our
1434.9s
reward of 11. Very good. Now assuming 0.9 for gamma. We're going to complexify things a little bit. I'm going to walk you through a very simple algorithm that you know allows us to sort of determine the best strategy and we will put our numbers in a matrix. So for instance um
1438.8s
0.9 for gamma. We're going to complexify
1442.6s
things a little bit. I'm going to walk
1444.3s
you through a very simple algorithm that
1446.7s
you know allows us to sort of determine
1448.7s
the best strategy and we will put our
1451.0s
numbers in a matrix. So for instance um
1454.2s
we'll define a Q table and Q stands you know is a it's a it's a value function um where the the the name Q-learning Q star you might have heard um all of these things come from Q-learning and so let's say we have a Q table which has uh the size of number of states times
1459.4s
know is a it's a it's a value function
1462.2s
um where the the the name Q-learning
1465.2s
Q star you might have heard um all of
1468.0s
these things come from Q-learning and so
1470.7s
let's say we have a Q table which has uh
1474.0s
the size of number of states times
1477.1s
number of actions so five rows two columns in our Every entry of the Q table is essentially representing how good it is to take action A in state B. Do you agree that if we had a table with these numbers essentially we solve the problem meaning at any point the agent
1480.6s
columns in our
1482.8s
Every entry of the Q table is
1486.4s
essentially representing how good it is
1489.4s
to take action
1491.8s
A in state B.
1495.8s
Do you agree that if we had a table with
1498.2s
these numbers essentially we solve the
1500.7s
problem meaning at any point the agent
1503.4s
can just look in the table I am in state three let's look at column one that would tell me the value of action one and let's do that column two it would tell me the value of action two so I have everything I need to make my have everything I need to make my decisions
1506.3s
three let's look at column one that
1508.8s
would tell me the value of action one
1510.6s
and let's do that column two it would
1512.2s
tell me the value of action two so I
1514.2s
have everything I need to make my
1515.4s
have everything I need to make my decisions
1518.0s
so that table is really the the thing you want to find in this exercise now the way we will find the table is sort of using a backtracking algorithm where we might actually codify the environment as a tree and traverse the tree. So here's what it looks like. I start in S2 and I have two
1520.1s
you want to find in this exercise now
1523.7s
the way we will find the table is sort
1526.5s
of using a backtracking algorithm where
1529.4s
we might actually
1531.4s
codify the environment as a tree and
1533.5s
traverse the tree. So here's what it
1535.4s
looks like. I start in S2 and I have two
1538.2s
options ahead of me. I can go to the left where I will get a reward of two. It's an immediate reward. The immediate reward is not discounted. It's an immediate reward. Remember the the formula for R. The immediate reward R0 is not discounted. That would take me to is not discounted. That would take me to S1.
1540.6s
left where I will get a reward of two.
1542.9s
It's an immediate reward. The immediate
1544.7s
reward is not discounted. It's an
1547.0s
immediate reward. Remember the the
1549.4s
formula for R. The immediate reward R0
1552.8s
is not discounted. That would take me to
1555.6s
is not discounted. That would take me to S1.
1557.1s
It's a terminal state, so there's nothing to do after. nothing to do after. snorts Second option, I go to the right and I get a reward of zero. That's my immediate reward. And I end up in state three. State three is not a terminal state. So I can go and do the same
1559.1s
nothing to do after.
1561.5s
nothing to do after. snorts
1562.2s
Second option, I go to the right and I
1565.0s
get a reward of zero. That's my
1567.2s
immediate reward. And I end up in state
1569.3s
three. State three is not a terminal
1571.4s
state. So I can go and do the same
1573.4s
exercise from state three. In state three, I have two options. I can go to the left where I would see a reward of zero and I will end up in S2 or I will go to the right and I will get an immediate reward of plus one. It's an immediate reward. We're not discounting
1575.6s
three, I have two options. I can go to
1577.6s
the left where I would see a reward of
1580.1s
zero and I will end up in S2 or I will
1583.4s
go to the right and I will get an
1585.2s
immediate reward of plus one. It's an
1588.1s
immediate reward. We're not discounting
1589.7s
it. I will end up in S4 and from S4 again I have two options. Back to the left to S3 with zero reward or to the right with the amazing reward of plus 10 and the terminal state of S5. So that's my map of immediate rewards. That's not my discounted return. So what we're
1592.8s
again I have two options. Back to the
1594.9s
left to S3 with zero reward or to the
1598.6s
right with the amazing reward of plus 10
1602.3s
and the terminal state of S5. So that's
1606.6s
my map of immediate rewards. That's not
1609.9s
my discounted return. So what we're
1611.8s
going to do now is we're going to backtrack up the tree in order to compute the discounted returns. Actually, if I'm in S3 right here, I see that I can get an immediate reward in S4 of one. And I want to compute my maximum return that I can get from when
1612.8s
backtrack up the tree in order to
1615.0s
compute the discounted returns.
1617.3s
Actually, if I'm in S3 right here,
1623.1s
I see that I can get an immediate reward
1626.7s
in S4 of one. And I want to compute my
1630.9s
maximum return that I can get from when
1633.4s
I'm in S3. My maximum return is that in S4 I could get a plus 10, right? But I need to discount that. My discount is 0.9. So I multiply 10 by 0.9. What it tells me is that from S4 I can expect nine plus one, which I get as an immediate reward from moving from S3 to
1636.6s
S4 I could get a plus 10, right? But I
1639.9s
need to discount that. My discount is
1642.9s
0.9. So I multiply 10 by 0.9. What it
1646.6s
tells me is that from S4 I can expect
1649.0s
nine plus one, which I get as an
1651.8s
immediate reward from moving from S3 to
1653.6s
S4, I can update this number to 10. Meaning from S3 the best you can hope for is a discounted return of 10 which is 1 plus 0.9 10. Everyone follows. Now let's do the same exercise one step before in S2. uh you know um in S2 um I have um an immediate reward of zero for
1656.3s
Meaning from S3 the best you can hope
1659.4s
for is a discounted return of 10 which
1662.7s
is 1 plus 0.9 10.
1666.3s
Everyone follows.
1668.5s
Now let's do the same exercise one step
1670.9s
before in S2. uh you know um in S2 um I
1676.7s
have um an immediate reward of zero for
1680.5s
going to S3 or an immediate reward of two for going to S1. Um S1 is not going to be worth it. We already know that because when I'm in S3 I can actually expect 10 which I have to discount. 0.9 10 gives me 9 plus 0 immediate reward from S2 to S3. That tells me that the
1683.8s
two for going to S1. Um S1 is not going
1687.8s
to be worth it. We already know that
1689.4s
because when I'm in S3 I can actually
1692.5s
expect 10 which I have to discount. 0.9
1697.0s
10 gives me 9 plus 0 immediate reward
1700.6s
from S2 to S3. That tells me that the
1703.2s
discounted return from state two which is our initial state is nine. Good follow just a simple backtracking. Now I can copy back this. So S3, I know that when I'm in S3, um uh you know, I can expect zero immediate reward to um uh to sorry, if I if I'm if I'm in S2, I
1705.9s
is our initial state is nine.
1710.6s
Good follow just a simple backtracking.
1714.8s
Now I can copy back this. So S3, I know
1717.5s
that when I'm in S3, um uh you know, I
1720.5s
can expect zero immediate reward to um
1724.3s
uh to sorry, if I if I'm if I'm in S2, I
1728.8s
can expect uh zero immediate reward plus a discount times the plus 9 that I could expect in S3. And so that gives me values that should cover everything that we have in this Q table. So I I do that backtracking. I copy paste all of that into my Q table all the way up here. And
1732.8s
a discount times the plus 9 that I could
1735.8s
expect in S3. And so that gives me
1739.6s
values that should cover everything that
1743.0s
we have in this Q table. So I I do that
1746.2s
backtracking. I copy paste all of that
1748.7s
into my Q table all the way up here. And
1751.5s
this is what I get. We essentially finish the game at this point. We um can look uh at a certain row. So let's say I'm in state number three. I look on the third row of that Q table and I see that I have two options. If I go back to S2, ultimately my discounted return will be
1754.2s
We essentially finish the game at this
1756.3s
point. We um can look uh at a certain
1760.8s
row. So let's say I'm in state number
1763.3s
three. I look on the third row of that Q
1766.4s
table and I see that I have two options.
1768.9s
If I go back to S2,
1771.4s
ultimately my discounted return will be
1775.4s
ultimately my discounted return will be 8.1, right? If I actually go to S4 on the right, I will get 10 because I will get 1 0.9 10, which is 10. So this is a toy example, but it tells you that if you were able to backtrack through the entire environment, you will
1777.0s
right? If I actually go to S4 on the
1781.5s
right, I will get 10 because I will get
1783.9s
1 0.9 10, which is 10.
1789.0s
So this is a toy example, but it tells
1791.0s
you that if you were able to backtrack
1793.2s
through the entire environment, you will
1795.4s
be able to build a massive Q table and you will be able to give it to your agent to make its decisions. agent to make its decisions. snorts snorts Yeah.
1798.2s
you will be able to give it to your
1799.8s
agent to make its decisions.
1802.6s
agent to make its decisions. snorts
1803.0s
snorts Yeah.
1812.2s
considering the time uh remaining. But in practice um you if if I remove the time component so I remove the fact that there's a three minute deadline before the garbage collector comes then uh this would uh be slightly more difficult because you would have to do a time series essentially of adding the
1814.6s
in practice um you if if I remove the
1818.2s
time component so I remove the fact that
1820.4s
there's a three minute deadline before
1822.2s
the garbage collector comes then uh this
1825.0s
would uh be slightly more difficult
1826.8s
because you would have to do a time
1828.6s
series essentially of adding the
1831.0s
discount times the reward that you collect. Yeah, but I'm simplifying here and that's why I use the three-minut and that's why I use the three-minut rule. Any snorts question on the Q table?
1833.0s
collect. Yeah, but I'm simplifying here
1835.0s
and that's why I use the three-minut
1836.5s
and that's why I use the three-minut rule.
1839.1s
Any snorts question on the Q table?
1849.0s
and in fact clears throat we can put together our strategy for gamma equals together our strategy for gamma equals 0.9. The best strategy is still the same. You go to the right and you can expect a return of 9. in reinforcement learning is this equation on the board called the Bellman optimality equation.
1850.8s
together our strategy for gamma equals
1853.0s
together our strategy for gamma equals 0.9.
1854.6s
The best strategy is still the same. You
1856.3s
go to the right and you can expect a
1859.1s
return of 9.
1865.4s
in reinforcement learning is this
1867.4s
equation on the board called the Bellman
1870.9s
optimality equation.
1873.8s
Often time you'll see it noted as Q star of state S and action A equals R gamma time the max of that same function applied to S prime A same function applied to S prime A prime. Let me explain this equation for you because it's super important. This equation is called the optimality
1877.3s
of state S and action A
1881.4s
equals R gamma time the max of that
1886.0s
same function applied to S prime A
1889.3s
same function applied to S prime A prime.
1891.2s
Let me explain this equation for you
1893.0s
because it's super important. This
1895.7s
equation is called the optimality
1898.4s
equation because your optimal Q table will follow this equation. If you have finished the game, this equation can be applied to any state action pair and it will still be true. The intuition behind why the Bellman equation is the optimality equation is that um if you're in a if you have the
1901.9s
will follow this equation. If you have
1904.2s
finished the game, this equation can be
1907.1s
applied to any state action pair and it
1909.9s
will still be true.
1912.5s
The intuition behind why the Bellman
1915.3s
equation is the optimality equation is
1917.9s
that um if you're in a if you have the
1921.4s
perfect Q function Q table um and you're in a certain state and you perform a certain action A you will observe a reward and this reward will uh you know you you have taken an action so you would be in a new state and from that new state you can repeat what you just
1925.4s
in a certain state and you perform a
1927.4s
certain action A you will observe a
1929.8s
reward and this reward will uh you know
1934.0s
you you have taken an action so you
1935.5s
would be in a new state and from that
1937.4s
new state you can repeat what you just
1939.1s
did right and because uh you've done the backtracking and stuff like that, you will uh get this equation to be true because it's the reward plus discount times the best next action that you could be taking. Does that make sense? Any question on that? That's exactly the backtracking that we did by the way. immediate reward plus
1943.0s
backtracking and stuff like that, you
1944.5s
will uh get this equation to be true
1946.9s
because it's the reward plus discount
1949.5s
times the best next action that you
1951.2s
could be taking.
1957.4s
Does that make sense? Any question on that?
1958.9s
That's exactly the backtracking that we
1960.5s
did by the way. immediate reward plus
1964.6s
discount times the best possible action that you can take in the next state s that you can take in the next state s prime. The last concept I cover in terms of vocabulary is the policy. The policy is the function that given your state is going to tell you what to do. And in
1968.1s
that you can take in the next state s
1970.2s
that you can take in the next state s prime.
1976.0s
The last concept I cover in terms of
1978.2s
vocabulary is the policy. The policy is
1980.7s
the function that given your state is
1982.6s
going to tell you what to do. And in
1985.8s
Q-learning the way this policy is defined is argmax of Qstar um across the action. So essentially what it says is like look in the table and look at a certain state s you want the policy which is what you should do. It's the function that tells you our best strategy. You just look at the two
1988.1s
defined is argmax of Qstar um across the
1993.5s
action. So essentially what it says is
1995.4s
like look in the table and look at a
1998.3s
certain state s you want the policy
2001.0s
which is what you should do. It's the
2002.6s
function that tells you our best
2004.4s
strategy. You just look at the two
2006.5s
possible actions which one has the highest Q value and select that action. That's it. it's the core of um Q-learning that you know later on you will use policies widely. There's a lot of reinforcement learning algorithms but this concept of understanding the policies the function telling us our best strategy in
2008.5s
highest Q value and select that action.
2012.6s
That's it.
2019.8s
it's the core of um Q-learning that you
2023.6s
know later on you will use policies
2025.6s
widely. There's a lot of reinforcement
2027.3s
learning algorithms but this concept of
2029.4s
understanding the policies the function
2031.0s
telling us our best strategy in
2032.9s
Q-learning it's the argmax of the best Q value in the given state. It tells you which action to take that's the core thing you need to understand. thing you need to understand. snorts So remember this belman equation because we're going to reuse it in a bit. The main issue um with this
2035.3s
value in the given state. It tells you
2037.2s
which action to take that's the core
2039.1s
thing you need to understand.
2041.0s
thing you need to understand. snorts
2041.4s
So remember this belman equation because
2043.4s
we're going to reuse it
2046.0s
in a bit. The main issue um with this
2050.6s
approach um of a Q table is that state and action spaces can be super large and having a matrix that you discover through backtracking um and where every time you want to do an action you have to look up the given state the possible action it becomes impossible. Like imagine you using this
2056.1s
and action spaces can be super large and
2060.2s
having a matrix that you discover
2063.3s
through backtracking
2065.1s
um and where every time you want to do
2066.8s
an action you have to look up the given
2069.0s
state the possible action it becomes
2072.0s
impossible. Like imagine you using this
2075.8s
algorithm for the game of go where there's so many states, there's so many possible actions, you can put your stone anywhere on the board. You can imagine how big this matrix becomes and how impossible it is to use. So that's our problem and that's the moment where deep learning comes into play.
2078.6s
there's so many states, there's so many
2080.7s
possible actions, you can put your stone
2082.4s
anywhere on the board. You can imagine
2085.0s
how big this matrix becomes and how
2087.0s
impossible it is to use. So that's our
2090.7s
problem and that's the moment where deep
2093.0s
learning comes into play.
2101.7s
the the the oh actually before I go there I'm just going to cover some vocabulary. We said the environment, the agent, the state, the action, the reward, the total return and the discount factor. We learned all of that. We saw that the Q table is the matrix of entries representing how good is it to
2103.9s
there I'm just going to cover some
2105.2s
vocabulary. We said the environment, the
2106.8s
agent, the state, the action, the
2108.1s
reward, the total return and the
2109.6s
discount factor. We learned all of that.
2111.8s
We saw that the Q table is the matrix of
2114.0s
entries representing how good is it to
2115.8s
take action A in state S. And the policy is the function that tells us what's the best strategy to adopt. And the bellman equation is satisfied by the optimal Q equation is satisfied by the optimal Q table. what I was about to say is we are going to frame the problem slightly
2119.0s
is the function that tells us what's the
2120.4s
best strategy to adopt. And the bellman
2122.0s
equation is satisfied by the optimal Q
2124.9s
equation is satisfied by the optimal Q table.
2131.4s
what I was about to say is we are going
2133.9s
to frame the problem slightly
2135.4s
differently. So instead of using a Q table, we're going to use the fact that neural networks are universal function approximators and we're going to define a Q function that's essentially a neural network. So that the function can take a state S and an action A and tell you how good that action is in state S. So
2138.6s
table, we're going to use the fact that
2140.7s
neural networks are universal function
2143.3s
approximators and we're going to define
2146.0s
a Q function that's essentially a neural
2148.3s
network. So that the function can take a
2151.3s
state S and an action A and tell you how
2155.5s
good that action is in state S. So
2159.5s
instead of a lookup in a matrix, you just run a forward pass in a neural network and it gives you the answer. That feels like a better solution for games where there's a lot of states and a lot of actions.
2161.8s
just run a forward pass in a neural
2164.2s
network and it gives you the answer.
2166.6s
That feels like a better solution for
2168.9s
games where there's a lot of states and
2170.6s
a lot of actions.
2177.2s
the past we looked for a Q table and this time we will look for a neural network. One of the things we're going to do is to define the output layer to have two outputs. So given a certain state as input think about it as a one hot vector encoding the state. So this
2179.1s
this time we will look for a neural
2181.4s
network. One of the things we're going
2183.9s
to do is to define the output layer to
2186.6s
have two outputs. So given a certain
2188.9s
state as input think about it as a one
2191.0s
hot vector encoding the state. So this
2193.5s
one is the example of state two 0 1 0 0 0. If you pass state two in this Q function with multiple layers, it will give you two outputs. One output that corresponds to Q of S action right and the other one Q of S action left because it's the two action left because it's the two actions.
2196.3s
0. If you pass state two in this Q
2199.7s
function with multiple layers, it will
2203.2s
give you two outputs. One output that
2205.5s
corresponds to Q of S
2210.3s
action right and the other one Q of S
2212.9s
action left because it's the two
2214.8s
action left because it's the two actions.
2216.3s
If we had more actions to take, we would just increase the output layer and we might have many more neurons in the output layer.
2218.3s
just increase the output layer and we
2220.4s
might have many more neurons in the
2222.2s
output layer.
2230.9s
we going to train that network? Because we're not in classic supervised learning. We don't have labels.
2233.4s
we're not in classic supervised
2235.0s
learning. We don't have labels.
2244.0s
what do you what would you do given we we don't have traditional x and y pairs how are you going to train this neural how are you going to train this neural network because remember at the beginning this neural network will give you garbage it will take a state s and it might tell
2246.2s
we don't have traditional x and y pairs
2250.1s
how are you going to train this neural
2253.6s
how are you going to train this neural network
2255.8s
because remember at the beginning this
2257.4s
neural network will give you garbage it
2259.1s
will take a state s and it might tell
2261.5s
you go to the left or to the right but it's completely random so how are you going to tune it to the level where it makes really Good decisions.
2263.0s
it's completely random so how are you
2264.9s
going to tune it to the level where it
2267.5s
makes really Good decisions.
2281.0s
Tell me more. What? this problem right now? What are the the the rules of the game that we could use in order to I'm I'm seeing what you say. You say we could estimate what good looks like but
2291.4s
this problem right now? What are the the
2293.4s
the rules of the game that we could use
2297.2s
in order to
2299.4s
I'm I'm seeing what you say. You say we
2301.0s
could estimate what good looks like but
2303.1s
based on what? that's one thing we have in every game. We have a reward structure for every state that definitely should be used in order to estimate the good what a good decision looks like. Yeah. The problem is not in every state you will see a reward. And if you look at many games of
2314.9s
that's one thing we have in every game.
2316.4s
We have a reward structure for every
2318.2s
state that definitely should be used in
2321.0s
order to estimate the good what a good
2322.9s
decision looks like. Yeah. The problem
2325.4s
is not in every state you will see a
2327.7s
reward. And if you look at many games of
2331.4s
like go you might not see a reward until 50 moves. So what do you do in this case? Yes. Can we run through a bunch of actions and space and see what the output is and get more data? Yeah. So you could you're actually um bringing up a sort of a tree search,
2334.4s
50 moves.
2336.7s
So what do you do in this case?
2339.5s
Yes. Can we run through a bunch of
2343.6s
actions and space and see what the
2345.6s
output is and get more data?
2349.3s
Yeah. So you could you're actually um
2352.6s
bringing up a sort of a tree search,
2355.0s
right? You go down the tree, you do every possible action and then you every possible action and then you backtrack. Not every possible action. So which actions? Trying to spread it out of the Okay, that's that's we're getting there. So first possibility is we just go down the tree in the game of go. You could
2356.7s
every possible action and then you
2358.7s
every possible action and then you backtrack.
2359.8s
Not every possible action.
2361.7s
So which actions?
2363.1s
Trying to spread it out of the
2365.8s
Okay, that's that's we're getting there.
2367.5s
So first possibility is we just go down
2370.1s
the tree in the game of go. You could
2372.1s
put your stone everywhere. So the tree already start by a 13 by3 options and then it's exponentially grows impossible. It's intractable. But what you said is what if there are certain actions that are more likely than others? Do we need actually to explore the entire tree? What's this like? What are you using when you're saying that?
2373.7s
already start by a 13 by3 options and
2377.0s
then it's exponentially grows
2379.3s
impossible. It's intractable. But what
2381.9s
you said is what if there are certain
2383.3s
actions that are more likely than
2384.6s
others? Do we need actually to explore
2386.4s
the entire tree? What's this like? What
2388.9s
are you using when you're saying that?
2390.5s
How do you determine what action might be better than another one? Expected return. And we're getting close. Yeah. But you know, how do you know the expected return without going through the tree once? At least you can estimate both. Okay. You can estimate it using what?
2392.2s
be better than another one?
2395.9s
Expected return. And we're getting
2397.3s
close. Yeah. But you know, how do you
2398.9s
know the expected return without going
2401.0s
through the tree once? At least
2403.0s
you can estimate both.
2405.2s
Okay. You can estimate it using what?
2414.2s
exactly what we're going to do actually. But we're going to use the the Bellman equation because there are two things we know about this problem. We know the reward structure which you brought up and we also know that the perfect Q function will follow the Bellman equation that we know as well. At the
2416.0s
But we're going to use the the Bellman
2418.0s
equation because there are two things we
2420.2s
know about this problem. We know the
2422.0s
reward structure which you brought up
2424.1s
and we also know that the perfect Q
2426.5s
function will follow the Bellman
2428.6s
equation that we know as well. At the
2430.8s
end the Bellman equation should be respected meaning for every state if you want to know the Q value of that state given an action. The way you will get that is you will look at the immediate reward plus the discount times the best Q value from the next state across all actions. that equation will be
2432.6s
respected meaning for every state if you
2436.6s
want to know the Q value of that state
2440.2s
given an action. The way you will get
2442.5s
that is you will look at the immediate
2444.0s
reward plus the discount times the best
2447.3s
Q value from the next state across all
2449.2s
actions. that equation will be
2452.0s
respected. So those are the only information we have and we're going to use them drastically to define our labels and sort of mimic a classic supervised learning approach. So here's what we have. We have our neural network. We have Q S to the left and QS to the right that represent how good it
2454.2s
information we have and we're going to
2455.7s
use them drastically to define our
2458.4s
labels and sort of mimic a classic
2460.9s
supervised learning approach. So here's
2463.0s
what we have. We have our neural
2464.2s
network. We have Q S to the left and QS
2467.7s
to the right that represent how good it
2469.4s
is to go to the left in that state versus the right. And then I've pasted the Bman equation on top right of the screen. We're going to define a loss function. So let's say for the sake of simplicity because those are scalar values that will use you know L2 loss quadratic loss that compares a certain
2470.9s
versus the right. And then I've pasted
2474.2s
the Bman equation on top right of the
2476.6s
screen. We're going to define a loss
2478.6s
function. So let's say for the sake of
2480.3s
simplicity because those are scalar
2482.6s
values that will use you know L2 loss
2487.3s
quadratic loss that compares a certain
2490.3s
label Y to um a certain Q value of a state and a certain action. So what we would like is to minimize this loss function meaning Y and the Q value for a given action in a given state is as close as possible to each other. and we're going to leverage the reward and
2495.0s
state and a certain action. So what we
2497.8s
would like is to minimize this loss
2500.1s
function meaning Y and the Q value for a
2503.4s
given action in a given state is as
2505.4s
close as possible to each other. and
2508.5s
we're going to leverage the reward and
2510.1s
the Bman equation. So let's do um two things. Right now we don't have a Y. So in supervised learning you will have a picture of a cat. There's a cat. The Y is one or zero. Here we don't have a Y. So we have to come up with an estimate
2513.0s
things. Right now we don't have a Y. So
2515.5s
in supervised learning you will have a
2516.9s
picture of a cat. There's a cat. The Y
2518.9s
is one or zero. Here we don't have a Y.
2521.8s
So we have to come up with an estimate
2523.9s
of a good Y at least better than random. So let's say at this point in time when I send a state S in the network, it turns out that Q of going to the left is higher than Q of going to the right. Which means that today at that moment the Q function tells me it's better to
2528.1s
So let's say at this point in time when
2531.6s
I send a state S in the network, it
2535.0s
turns out that Q of going to the left is
2537.6s
higher than Q of going to the right.
2540.3s
Which means that today at that moment
2542.9s
the Q function tells me it's better to
2544.8s
go to the left than to go to the right. That is random at the beginning. It's completely random. Right? So what I'm going to do is I'm going to use as my target value Y the immediate reward that I observe on the left plus gamma times the best Q value that I
2547.6s
That is random at the beginning. It's
2549.4s
completely random. Right? So what I'm
2552.4s
going to do is I'm going to use as my
2555.0s
target value Y the immediate reward that
2558.8s
I observe on the left
2561.7s
plus gamma times the best Q value that I
2567.2s
can get. So the best action that I could take in the next step based on my current Q value.
2569.8s
take in the next step based on my
2573.4s
current Q value.
2580.7s
target is off. It's not a perfect target, but it's better than nothing. Meaning, not only it tells us, hey, there is a good reward to the left. We should consider that in saying that that might be a good move because we're seeing an immediate reward. But on top of that, we also know that at the end of
2582.8s
target, but it's better than nothing.
2586.0s
Meaning, not only it tells us, hey,
2589.3s
there is a good reward to the left. We
2591.4s
should consider that in saying that that
2593.7s
might be a good move because we're
2595.6s
seeing an immediate reward. But on top
2597.7s
of that, we also know that at the end of
2600.6s
training, the Q value should follow the Bellman equation. So why don't we set the target as the Bellman equation? So we add the discounted maximum future reward when you are in the next state. So you were in state S. You go to the left now you're in state S next left and
2602.9s
Bellman equation. So why don't we set
2605.0s
the target as the Bellman equation? So
2608.0s
we add the discounted maximum future
2610.3s
reward when you are in the next state.
2612.3s
So you were in state S. You go to the
2614.2s
left now you're in state S next left and
2617.8s
you look again at your Q values and you select the best one. Then you add that number here. gasps So there is actually two forward path in that actually two forward path in that process. process. Right? There's one forward path where you send the state S in Q and you look at the two
2620.6s
select the best one. Then you add that
2622.8s
number here. gasps So there is
2625.0s
actually two forward path in that
2627.2s
actually two forward path in that process.
2629.9s
process. Right?
2632.7s
There's one forward path where you send
2635.2s
the state S in Q and you look at the two
2638.2s
options left or right and you're like okay I'm going to the left and then you're like I'm going to compare that value to a target Y but to get that target Y I need to do another forward path. So I take my action left I perform it I get an S prime state s next and I
2640.1s
okay I'm going to the left and then
2642.1s
you're like I'm going to compare that
2643.4s
value to a target Y but to get that
2646.0s
target Y I need to do another forward
2648.0s
path. So I take my action left I perform
2651.0s
it I get an S prime state s next and I
2654.6s
send that S next into the Q network. I look at the two options I have. I pick the best one and I add it here with a the best one and I add it here with a discount.
2657.5s
look at the two options I have. I pick
2659.3s
the best one and I add it here with a
2661.8s
the best one and I add it here with a discount.
2668.9s
the following is we have a Q network that's random at clears throat the beginning. It has never observed the rewards. We just know that at some point it will get to the Q um it will get to a perfect you know policy. It will get to a perfect Q function. But the best we
2671.8s
that's random at clears throat the
2673.4s
beginning. It has never observed the
2675.4s
rewards. We just know that at some point
2677.9s
it will get to the Q um it will get to a
2681.1s
perfect you know policy. It will get to
2684.2s
a perfect Q function. But the best we
2686.3s
can do right now is to say as a guide to for our agent, we will look at the immediate reward and we will look at the Bellman equation which should tell us a better estimate than where we are right now and we will try to catch up to that estimate and then we do that again and
2690.6s
for our agent, we will look at the
2692.9s
immediate reward and we will look at the
2694.6s
Bellman equation which should tell us a
2696.8s
better estimate than where we are right
2698.5s
now and we will try to catch up to that
2701.6s
estimate and then we do that again and
2704.0s
again. So remember every time your Q gets better it gets better for the next state as well. So you know the Bellman equation tells you estimate it with the second forward path and you just keep getting better and better as you're observing more rewards. observing more rewards. clears throat and snorts
2706.5s
gets better it gets better for the next
2708.9s
state as well. So you know the Bellman
2711.0s
equation tells you estimate it with the
2713.1s
second forward path and you just keep
2715.0s
getting better and better as you're
2716.6s
observing more rewards.
2720.2s
observing more rewards. clears throat and snorts
2732.0s
left and
2737.9s
uh how would it so describe the loop you clears throat clears throat imag
2739.3s
clears throat imag
2747.2s
for going to a right you again need the target yeah you would stop at that point so what you yeah this is a good question I I'll show you how we fix certain things but you do only one step meaning you have your Q value at this point and
2748.0s
yeah you would stop at that point so
2749.8s
what you yeah this is a good question I
2751.7s
I'll show you how we fix certain things
2753.4s
but you do only one step meaning
2756.6s
you have your Q value at this point and
2760.1s
it tells you go to the left and you just want to target Y. So what you do is you put left and you look at your next state. You forward propagate your next state. You look at the two options. You pick the best. You don't go further. You just use that one step. You look one
2762.4s
want to target Y. So what you do is you
2764.7s
put left and you look at your next
2766.8s
state. You forward propagate your next
2768.8s
state. You look at the two options. You
2770.5s
pick the best. You don't go further. You
2772.9s
just use that one step. You look one
2774.8s
step ahead essentially. You don't look multiple steps ahead. You could, but it would be more computationally heavy to do one more step again and so on. So yeah. Yeah. Yeah. It seems like you're learning the function locally like function locally like most
2776.6s
multiple steps ahead. You could, but it
2778.9s
would be more computationally heavy to
2780.5s
do one more step again and so on.
2783.7s
So yeah. Yeah.
2785.9s
Yeah. It seems like you're learning the
2788.0s
function locally like
2791.1s
function locally like most
2806.4s
environment the state space um how long it will take to converge but you're perfectly right that um as the Q function gets better, the estimate Y also gets better. So the two things get better together, right? Because the Y is based on the Q function. And if the state space is massive, you might have a
2809.3s
it will take to converge but you're
2811.3s
perfectly right that um as the Q
2815.2s
function gets better, the estimate Y
2817.5s
also gets better. So the two things get
2820.0s
better together, right? Because the Y is
2821.8s
based on the Q function. And if the
2824.2s
state space is massive, you might have a
2827.4s
very difficult time training this model. There's better approaches that we'll see There's better approaches that we'll see later. Yeah, correct. There was a question Yeah, correct. There was a question there. when you send state S in Q the left happens to be higher than right. But the same happens on the other side. Let's
2829.8s
There's better approaches that we'll see
2831.8s
There's better approaches that we'll see later.
2833.4s
Yeah, correct. There was a question
2836.1s
Yeah, correct. There was a question there.
2845.8s
when you send state S in Q the left
2848.3s
happens to be higher than right. But the
2850.4s
same happens on the other side. Let's
2851.8s
say let's say the left is worse than right. Then what you will do is you will define your target Y as the reward that you observe on the right plus from the next state of having gone to the right what's the best action and what's the Q value for that pair and then it will
2854.2s
right. Then what you will do is you will
2856.2s
define your target Y as the reward that
2859.2s
you observe on the right plus from the
2862.3s
next state of having gone to the right
2864.4s
what's the best action and what's the Q
2866.5s
value for that pair and then it will
2868.7s
give you the target for that scenario.
2878.1s
is that when you want to differentiate L. So you want to perform a back propagation. You want to take the derivative of L with respect of the parameters of the network. You want Y to be a fixed thing, right? Because in supervised learning Y is not differentiable. It's just a fixed number
2881.0s
L. So you want to perform a back
2883.1s
propagation. You want to take the
2884.4s
derivative of L with respect of the
2886.1s
parameters of the network. You want Y to
2888.9s
be a fixed thing, right? Because in
2890.5s
supervised learning Y is not
2891.7s
differentiable. It's just a fixed number
2893.8s
zero or one or a certain number. So here we're going to simplify and we're going to say this term that is dependent on the Q network. So technically this term has parameters. So if you actually differentiate it, it will give you a value. We'll just hold it um fixed. So
2896.2s
we're going to simplify and we're going
2897.5s
to say this term that is dependent on
2900.0s
the Q network. So technically this term
2901.9s
has parameters. So if you actually
2903.8s
differentiate it, it will give you a
2905.3s
value. We'll just hold it um fixed. So
2910.0s
we say we do use our Q network to perform an estimate of our Y but we will not differentiate it. We will say it's not it's fixed. Yeah. Can you explain why we discount this? Can you explain why we discount this? Yeah. Because you know going back to the reason we discount is like the value of
2913.0s
perform an estimate of our Y but we will
2915.5s
not differentiate it. We will say it's
2917.1s
not it's fixed. Yeah.
2919.7s
Can you explain why we discount this?
2923.2s
Can you explain why we discount this? Yeah.
2924.2s
Because you know going back to the
2926.4s
reason we discount is like the value of
2928.2s
time. It's like you probably want to say if you can win the game in 10 moves, win it in 10 moves rather than 100 moves. Um or if you can get 1 today, get 1 today rather than in 10 years. All of that is why we have a discount here. And the
2931.0s
if you can win the game in 10 moves, win
2933.3s
it in 10 moves rather than 100 moves. Um
2936.4s
or if you can get 1 today, get 1 today
2939.3s
rather than in 10 years. All of that is
2941.5s
why we have a discount here. And the
2943.8s
discount is a is a hyperparameter that you would define as well. That would influence the strategy of your agent. Once more pass. We're going to see it after. Actually, I'm going to do a concrete example because it's a little complicated. Yeah. because it's a little complicated. Yeah. Yeah. Yeah. Q.
2946.2s
you would define as well. That would
2948.9s
influence the strategy of your agent.
2952.4s
Once more pass.
2954.6s
We're going to see it after. Actually,
2955.7s
I'm going to do a concrete example
2956.9s
because it's a little complicated. Yeah.
2958.7s
because it's a little complicated. Yeah. Yeah.
2960.8s
Yeah. Q.
2963.7s
No, no, that that was it's a good point. It's not that. It's just Q of it's it's a 2 by two. It's a one by two. So you have left and right. I was just going down the first case. So I put the state left. Yeah. Yep. mean, in most games we're going to see
2966.1s
It's not that. It's just Q of it's it's
2969.2s
a 2 by two. It's a one by two. So you
2972.1s
have left and right. I was just going
2973.7s
down the first case. So I put the state
2975.7s
left. Yeah. Yep.
2986.6s
mean, in most games we're going to see
2987.9s
right now, the rewards are going to be fixed by the designer of the game, the human that's designing the game. Um, in practice, you could have a separate function that um actually comes up with the reward. We're going to see an example later in the lecture where the reward might be different in different
2989.1s
fixed by the designer of the game, the
2991.0s
human that's designing the game. Um, in
2993.7s
practice, you could have a separate
2995.0s
function that um actually comes up with
2999.0s
the reward. We're going to see an
3000.2s
example later in the lecture where the
3002.3s
reward might be different in different
3004.3s
scenarios and there's a function or sometimes called a critic that determines what's the reward in a certain state. certain state. snorts Okay, this is yeah one last question and then we move because we we're going to see a concrete example is going to be see a concrete example is going to be clear.
3006.6s
sometimes called a critic that
3008.7s
determines what's the reward in a
3010.1s
certain state.
3011.7s
certain state. snorts
3012.6s
Okay, this is yeah one last question and
3014.4s
then we move because we we're going to
3015.6s
see a concrete example is going to be
3017.0s
see a concrete example is going to be clear.
3017.6s
Um so when we like hold it fix for laptop is that what differentiates this from iterating through all of the like from iterating through all of the like possible Yeah. Yeah. So instead of doing the backtracking down the tree and going over everything, we're saying we're going to limit oursel to just picking
3021.2s
laptop is that what differentiates this
3023.4s
from iterating through all of the like
3027.4s
from iterating through all of the like possible
3028.2s
Yeah. Yeah. So instead of doing the
3031.2s
backtracking down the tree and going
3033.8s
over everything, we're saying we're
3036.1s
going to limit oursel to just picking
3038.2s
the best action based on our current understanding of the network. You see, like my network is kind of intelligent, not great. We're in the middle of training. It says that I should go to the left and then if I look at the next state when I'm in the left,
3040.0s
understanding of the network.
3042.6s
You see, like my network is kind of
3045.0s
intelligent, not great. We're in the
3046.4s
middle of training. It says that I
3048.4s
should go to the left and then if I look
3049.9s
at the next state when I'm in the left,
3052.4s
it says I should go to the right. I will trust it because it's the best I have, best estimate I have, but I will discount that. And then if you keep repeating that, it turns out that not only your estimate gets better, but your model gets trained and then ultimately both together get to an optimality
3054.6s
trust it because it's the best I have,
3056.2s
best estimate I have, but I will
3057.8s
discount that. And then if you keep
3060.0s
repeating that, it turns out that not
3062.0s
only your estimate gets better, but your
3063.9s
model gets trained and then ultimately
3065.9s
both together get to an optimality
3067.7s
both together get to an optimality equation. So it's a it's a funky concept, right? But you get it. We're going to see examples. Um, okay. So uh then once you have been able to use the Belman equation to estimate your targets, you perform classic back propagation and you update the parameters of the network and you repeat
3070.6s
So it's a it's a funky concept, right?
3073.2s
But you get it.
3075.8s
We're going to see examples. Um, okay.
3078.8s
So uh then once you have been able to
3082.3s
use the Belman equation to estimate your
3085.4s
targets, you perform classic back
3088.7s
propagation and you update the
3090.9s
parameters of the network and you repeat
3092.8s
that process. that process. Okay. Okay. snorts Uh here is concretely if you were to code it in pseudo code, here is what it would look like to train a NRL agent using Q-learning. We start by initializing our Q network parameters. So initialization, it's random at first. Then we will loop over episode. As a
3095.2s
that process. Okay.
3096.9s
Okay. snorts
3097.6s
Uh here is concretely if you were to
3100.5s
code it in pseudo code, here is what it
3102.6s
would look like to train a NRL agent
3105.9s
using Q-learning. We start by
3108.4s
initializing our Q network parameters.
3111.7s
So initialization, it's random at first.
3115.4s
Then we will loop over episode. As a
3117.4s
reminder, episodes are one full game from start to terminal state. Um within an episode, we're going to start from the initial state s and we're going to loop over time steps until we reach a terminal state. So within one time step, here's what we will do. We forward propagate the state s in the Q network.
3119.4s
from start to terminal state. Um within
3123.4s
an episode, we're going to start from
3125.3s
the initial state s and we're going to
3127.1s
loop over time steps until we reach a
3130.1s
terminal state. So within one time step,
3134.7s
here's what we will do. We forward
3136.8s
propagate the state s in the Q network.
3140.1s
We will execute the action A that has the maximum Q value. We will observe a reward and we will also observe a next state S prime. We will use that S prime to compute our target Y by forward propagating S prime in the Q network and then computing our loss function. And based on that we will
3142.3s
the maximum Q value.
3145.8s
We will observe a reward and we will
3148.0s
also observe a next state S prime. We
3151.8s
will use that S prime to compute our
3154.0s
target Y by forward propagating S prime
3156.4s
in the Q network and then computing our
3159.2s
loss function. And based on that we will
3162.7s
use gradient descent to update the parameters of the network should be simpler looked at like that right okay so this is the vanilla Q-learning so to summarize again the one the main difference is that we don't have a target and we use our own network to estimate the target and the rewards are
3164.3s
parameters of the network
3170.2s
should be simpler looked at like that right
3172.8s
okay so this is the vanilla Q-learning
3177.2s
so to summarize again the one the main
3179.8s
difference is that we don't have a
3181.3s
target and we use our own network to
3183.4s
estimate the target and the rewards are
3185.6s
what's going to help us get better over understand everything. This is an entire class at Stanford. Um, you know, an entire quarter of studying that type of stuff. So, we're trying to get the basics within an hour and a half, two basics within an hour and a half, two hours. together um
3194.5s
understand everything. This is an entire
3195.9s
class at Stanford. Um, you know, an
3198.3s
entire quarter of studying that type of
3200.1s
stuff. So, we're trying to get the
3201.8s
basics within an hour and a half, two
3203.6s
basics within an hour and a half, two hours.
3208.6s
together um
3212.5s
and apply that to an actual game. So, here's the game. It's called Breakout. We want to destroy all the bricks. Who has played Breakout in the past? Buddy, a few. Okay, good. So, you have a a paddle that you control and you're trying to destroy the bricks. If the ball gets past your paddle, you lost.
3214.6s
here's the game. It's called Breakout.
3216.6s
We want to destroy all the bricks. Who
3218.6s
has played Breakout in the past? Buddy,
3220.6s
a few. Okay, good. So, you have a a
3223.2s
paddle that you control and you're
3225.7s
trying to destroy the bricks. If the
3229.3s
ball gets past your paddle, you lost.
3231.8s
And if the bricks are all destroyed, you won. That's it. Let's do it together. What um what is the input of our Q What um what is the input of our Q network? What would you use as input to remember? Yeah. clears throat Yes. snorts Entire screen. Okay. Let's do that. So I
3233.6s
won. That's it.
3237.5s
Let's do it together.
3239.7s
What um what is the input of our Q
3242.8s
What um what is the input of our Q network?
3244.6s
What would you use as input
3247.6s
to remember? Yeah. clears throat
3258.0s
Yes. snorts
3258.2s
Entire screen. Okay. Let's do that. So I
3260.6s
take I define that as the state S which is the input to my Q network. Uh what's the output of the Q network? the output of the Q network? Yes. Do we have to do that on screen? Good question. We'll get there. I'm I'm gonna ask you, but do we have to look at
3263.1s
is the input to my Q network. Uh what's
3265.7s
the output of the Q network?
3269.7s
the output of the Q network? Yes.
3270.9s
Do we have to do that on screen?
3273.1s
Good question. We'll get there. I'm I'm
3275.0s
gonna ask you, but do we have to look at
3278.7s
the full screen? Answer is no, but we'll see why. What's the output? see why. What's the output? Yeah. Game score. The game score. Uh, no. But we're going to talk about the game score in the to talk about the game score in the back. first and then we'll talk about the
3281.1s
see why. What's the output?
3283.9s
see why. What's the output? Yeah.
3284.5s
Game score.
3285.5s
The game score. Uh, no. But we're going
3288.8s
to talk about the game score in the
3290.1s
to talk about the game score in the back.
3295.6s
first and then we'll talk about the
3297.0s
stuff we can get rid of on the inputs. But what's the output? Yeah, it's the the actions. Yeah, the actions. So, yeah, it will be the Q values associated with the actions in state S. Remember, it's a Q function. So, the output is we need one value for left, one value for right, and one value for
3299.6s
But what's the output? Yeah,
3301.3s
it's the
3304.4s
the actions.
3305.8s
Yeah, the actions. So, yeah, it will be
3308.7s
the Q values
3310.9s
associated with the actions in state S.
3313.0s
Remember, it's a Q function. So, the
3314.5s
output is we need one value for left,
3317.4s
one value for right, and one value for
3319.4s
idle. snorts You could make this game more complicated and say we have eight actions. We have a little bit to the left, a lot to the left, a lot more to the left. You know, if you had multiple buttons, but let's simplify and say three actions. Either you don't move,
3321.3s
more complicated and say we have eight
3324.2s
actions. We have a little bit to the
3325.7s
left, a lot to the left, a lot more to
3328.0s
the left. You know, if you had multiple
3329.5s
buttons, but let's simplify and say
3331.4s
three actions. Either you don't move,
3333.1s
you move to the left or you move to the right. So these are the outputs. So now let's get to the question of the screen. Do we need the entire screen? So you were saying something earlier?
3334.6s
right. So these are the outputs. So now
3336.6s
let's get to the question of the screen.
3338.6s
Do we need the entire screen?
3342.4s
So you were saying something earlier?
3354.2s
the bricks. I would argue uh you need more because there's the walls. And I guess that you could if you're an expert player, you could know where the walls are, but generally you need a little more than that. What what what would be obviously things we can get rid of? And why would we do that?
3356.5s
more because there's the walls. And I
3358.5s
guess that you could if you're an expert
3359.8s
player, you could know where the walls
3361.2s
are, but generally you need a little
3362.9s
more than that. What what what would be
3364.7s
obviously things we can get rid of? And
3366.3s
why would we do that?
3377.2s
Um, who would remove the score at the Um, who would remove the score at the top?
3380.1s
Um, who would remove the score at the top?
3386.3s
Why would you not remove it?
3402.1s
score doesn't matter. It's true. We would remove the score. So you you could actually crop the top. You could also crop the bottom. I mean, if it passed the paddle, you don't care about the few pixels at the bottom. You could get rid of them. Um this is not always true. There are games where the score matters
3404.3s
would remove the score. So you you could
3406.0s
actually crop the top. You could also
3408.4s
crop the bottom. I mean, if it passed
3410.1s
the paddle, you don't care about the few
3411.6s
pixels at the bottom. You could get rid
3413.5s
of them. Um this is not always true.
3416.3s
There are games where the score matters
3419.2s
and in fact you know I like football soccer the the in soccer if you're one zero up you can park the bus. So your strategy is dependent of the score that you have like you wouldn't park the bus if you're losing one zero. Parking the bus meaning you ask every player to come
3422.6s
soccer the the in soccer if you're one
3426.2s
zero up you can park the bus. So your
3428.6s
strategy is dependent of the score that
3431.7s
you have like you wouldn't park the bus
3433.6s
if you're losing one zero. Parking the
3435.6s
bus meaning you ask every player to come
3437.3s
back and defend. If you're losing you would actually do the opposite. You will go all out attack. So in certain games you want the scores, in others you don't want. And so it's part of the designer, the the the AI engineer that's working on that to determine what information we need and what we don't need. What else
3439.2s
would actually do the opposite. You will
3440.6s
go all out attack. So in certain games
3443.6s
you want the scores, in others you don't
3445.8s
want. And so it's part of the designer,
3447.7s
the the the AI engineer that's working
3449.6s
on that to determine what information we
3451.4s
need and what we don't need. What else
3453.0s
could we do to reduce the dimensionality of the problem and make our computation of the problem and make our computation faster? grayscale essentially. That's true. Here you actually don't need the colors. It's clears throat just nice as a user for user experience purposes. you don't need. I don't think there's different points based on the bricks that you
3454.9s
of the problem and make our computation
3456.7s
of the problem and make our computation faster?
3461.8s
grayscale essentially. That's true. Here
3464.6s
you actually don't need the colors. It's
3467.0s
clears throat just nice as a user for
3468.4s
user experience purposes. you don't
3470.3s
need. I don't think there's different
3471.7s
points based on the bricks that you
3473.4s
destroy. It's all the same. Um there actually, funny enough, this algorithm was used by um Deepine to play a lot of Atari games and they did a single pre-processing where they removed the channels because they said it doesn't matter. Turns out in one of the games, I think it was CQS, I forgot which one,
3476.6s
actually, funny enough, this algorithm
3479.3s
was used by um Deepine to play a lot of
3483.0s
Atari games and they did a single
3485.1s
pre-processing where they removed the
3487.1s
channels because they said it doesn't
3488.6s
matter. Turns out in one of the games, I
3491.2s
think it was CQS, I forgot which one,
3493.8s
the fish disappeared when you did that. And so that game didn't work. the the agent couldn't crack it uh because they thought that the same pre-processing could apply to every game, but actually they had to make a slight tweak. also correct you. So just to recap, you could do it
3496.6s
And so that game didn't work. the the
3498.7s
agent couldn't crack it uh because they
3501.3s
thought that the same pre-processing
3502.6s
could apply to every game, but actually
3504.1s
they had to make a slight tweak.
3519.4s
also correct
3520.8s
you. So just to recap, you could do it
3522.8s
even better by using a a low dimensional representation of this game that describes the game. It's true, but because we want to use a single algorithm for 50 plus Atari games, we'll say the human sees the screen, we'll just give the screen and it will probably scale better essentially. But you're perfectly right if you were
3525.9s
representation of this game that
3527.4s
describes the game. It's true, but
3529.5s
because we want to use a single
3531.0s
algorithm for 50 plus Atari games, we'll
3533.6s
say the human sees the screen, we'll
3535.7s
just give the screen and it will
3537.1s
probably scale better essentially. But
3538.9s
you're perfectly right if you were
3540.1s
working on only that game. Okay, so let's do that. We'll we'll do pre-processing. There's one last thing that nobody mentioned which is history because in fact if you get only one screen you don't know where the ball is going. So actually you can't solve the game and the way you fix that is by
3542.5s
let's do that. We'll we'll do
3543.9s
pre-processing. There's one last thing
3545.4s
that nobody mentioned which is history
3547.8s
because in fact if you get only one
3549.7s
screen you don't know where the ball is
3551.4s
going. So actually you can't solve the
3554.1s
game and the way you fix that is by
3556.3s
giving a history of multiple screens for example four screens so that you see the direction that the ball is going in. So our pre-processing function is you know called f ofs let's say and f ofs is a mix of um you know you might do convert to grayscale reduce the dimension the
3558.8s
example four screens so that you see the
3560.6s
direction that the ball is going in. So
3563.0s
our pre-processing function is you know
3565.8s
called f ofs let's say and f ofs is a
3569.0s
mix of um you know you might do convert
3572.0s
to grayscale reduce the dimension the
3574.1s
height and width and also add the history of four frames and that should be enough. Turns clears throat out in most games you will need a history a little bit of history to know where the ball is going history to know where the ball is going or in this example can just encode the uh
3575.7s
history of four frames and that should
3578.4s
be enough.
3580.2s
Turns clears throat out in most games
3581.3s
you will need a history a little bit of
3583.0s
history to know where the ball is going
3584.6s
history to know where the ball is going or
3585.7s
in this example can just encode the uh
3588.5s
like velocity vector of the ball. Yeah, you could you could replace exactly you could replace um the history so multiple screen by just adding the gradient or the velocity of where the ball is going. That's true. But would it scale to every game? You know, turns out this because we know humans look at the Atari machine and
3590.7s
you could you could replace exactly you
3592.8s
could replace
3594.5s
um the history so multiple screen by
3596.9s
just adding the gradient or the velocity
3598.7s
of where the ball is going. That's true.
3600.7s
But would it scale to every game? You
3602.6s
know, turns out this because we know
3604.6s
humans look at the Atari machine and
3606.2s
they look at pixels. This would be more likely to scale to every game.
3608.5s
likely to scale to every game.
3615.6s
sequest is a good game or space invaders where you have multiple enemies coming at you then you would need to change your pre-processing to take into account the velocity of all these enemies. So it wouldn't work the same way. While if you actually give the pixels you actually from the pixels get the velocity of all
3617.8s
where you have multiple enemies coming
3619.4s
at you then you would need to change
3622.6s
your pre-processing to take into account
3624.3s
the velocity of all these enemies. So it
3626.2s
wouldn't work the same way. While if you
3628.0s
actually give the pixels you actually
3629.8s
from the pixels get the velocity of all
3631.7s
your enemies and the directions they're your enemies and the directions they're going. Okay. So this is our pre-processing. I'm going to refer to it as fs. And um our deep Q network architecture because we're working with pixels is going to be um a convolutional network. Don't worry if you haven't learned it yet in the
3633.2s
your enemies and the directions they're going.
3635.0s
Okay. So this is our pre-processing. I'm
3637.1s
going to refer to it as fs. And um our
3639.9s
deep Q network architecture because
3641.6s
we're working with pixels is going to be
3644.6s
um a convolutional network. Don't worry
3646.6s
if you haven't learned it yet in the
3648.2s
class, but it's a bunch of con and relu activations. Um and then we end with a fully connected layer that gives us the three Q values for uh the different three Q values for uh the different actions. So nothing special here. Now, uh, we're going to go back to our vanilla training. So, this one that we saw
3650.9s
activations. Um and then we end with a
3654.2s
fully connected layer that gives us the
3658.1s
three Q values for uh the different
3660.8s
three Q values for uh the different actions.
3663.2s
So nothing special here. Now, uh, we're
3666.2s
going to go back to our vanilla
3667.8s
training. So, this one that we saw
3669.6s
together earlier, and we're going to look at tips to train reinforcement learning algorithms. Those tips are not specific to Q-learning. Some of them are applied to a lot more than Q-learning, and they're very important to know. Um, and they're part of the reason reinforcement learning has worked better in the last few years. Um, so one of the
3671.6s
look at tips to train reinforcement
3673.5s
learning algorithms. Those tips are not
3675.5s
specific to Q-learning. Some of them are
3677.4s
applied to a lot more than Q-learning,
3680.0s
and they're very important to know. Um,
3682.4s
and they're part of the reason
3683.8s
reinforcement learning has worked better
3685.9s
in the last few years. Um, so one of the
3689.6s
things that's pretty simple that we forgot to do um is the pre-processing that we just did. So anywhere I had an S, I'm going to instead run S through the pre-processing step, I'm going to use PH of S. So I initialize instead of S with P of S. I start from the initial
3691.0s
forgot to do um is the pre-processing
3693.4s
that we just did. So anywhere I had an
3695.8s
S, I'm going to instead run S through
3698.0s
the pre-processing step, I'm going to
3699.8s
use PH of S. So I initialize instead of
3703.8s
S with P of S. I start from the initial
3706.2s
state of S and then I for propagate f of I um I get the Q of that pre-processed state in action A and etc. etc. And then when I get my next state, so let's say I look at my current pre-processed state, I forward propagate it once. I see the
3710.2s
I um I get the Q of that pre-processed
3714.8s
state in action A and etc. etc. And then
3718.3s
when I get my next state, so let's say I
3722.2s
look at my current pre-processed state,
3724.9s
I forward propagate it once. I see the
3728.9s
three Q values, in our case, the two Q values. One of them was better than the other action, right? Then I get my next state S prime. I want to pre-process that state as well. Yeah, so that's pretty straightforward. You just replace all of that. all of that. snorts The second thing we forgot to do is to
3731.4s
values. One of them was better than the
3733.5s
other action, right? Then I get my next
3736.4s
state S prime. I want to pre-process
3738.5s
that state as well. Yeah, so that's
3742.4s
pretty straightforward. You just replace
3743.8s
all of that.
3746.0s
all of that. snorts
3746.2s
The second thing we forgot to do is to
3748.2s
keep track of the terminal state. In our pseudo code, there is no concept of terminal state. It's pretty easy to add. You would probably just do an if else statement. You would create a boolean to detect terminal state. So let's say your boolean is terminal equals false. And then as you loop over the time step of a
3750.2s
pseudo code, there is no concept of
3751.7s
terminal state. It's pretty easy to add.
3754.1s
You would probably just do an if else
3755.8s
statement. You would create a boolean to
3758.2s
detect terminal state. So let's say your
3759.9s
boolean is terminal equals false. And
3762.3s
then as you loop over the time step of a
3764.4s
single episode, every time you're going to check, is the state that I'm going in based on the action I'm taking a terminal state? If it's a terminal state, then get out of the loop. You know, there's nothing else after. The one thing that you need to be careful of is if it's a terminal state, then your
3766.4s
to check, is the state that I'm going in
3770.0s
based on the action I'm taking a
3771.6s
terminal state? If it's a terminal
3773.5s
state, then get out of the loop. You
3775.1s
know, there's nothing else after. The
3777.8s
one thing that you need to be careful of
3779.3s
is if it's a terminal state, then your
3781.8s
target is not the Bellman equation. It's just the the immediate reward. Remember, you get to the terminal state, you get a reward of 10. There's no Bellman equation to apply. It's just 10. It's immediate reward. There's no discount, immediate reward. There's no discount, etc.
3784.4s
just the the immediate reward. Remember,
3786.6s
you get to the terminal state, you get a
3788.4s
reward of 10. There's no Bellman
3790.6s
equation to apply. It's just 10. It's
3793.2s
immediate reward. There's no discount,
3795.8s
immediate reward. There's no discount, etc.
3803.3s
Now, we're going to look at a new method that will enable more data efficiency. It's called experience replay. One of the a couple of issues with the way we've been training so far is one the correlation of successive screens. It's like imagine in the Atari game you have the ball that's in the top left corner
3805.4s
that will enable more data efficiency.
3809.1s
It's called experience replay. One of
3811.7s
the a couple of issues with the way
3814.1s
we've been training so far is one the
3817.7s
correlation of successive screens. It's
3822.2s
like imagine in the Atari game you have
3824.6s
the ball that's in the top left corner
3826.9s
and it's traveling to the bottom right of the screen. You have like many many time step that are essentially the same. It's all the ball traveling in the same in the same place. So you're actually training repetitively on something that is not that meaningful. You don't need to just train on a batch. The equivalent
3829.4s
of the screen. You have like many many
3831.9s
time step that are essentially the same.
3833.8s
It's all the ball traveling in the same
3836.6s
in the same place. So you're actually
3839.3s
training repetitively on something that
3842.2s
is not that meaningful. You don't need
3843.8s
to just train on a batch. The equivalent
3845.7s
in supervised learning is let's say you're trying to differentiate cats and dogs and you train on a mini batch of cats, then you train on a mini batch of dogs, then you train on a mini batch of cats, and it will never converge. It will just index too much on cats and
3847.6s
you're trying to differentiate cats and
3849.1s
dogs and you train on a mini batch of
3851.2s
cats, then you train on a mini batch of
3853.1s
dogs, then you train on a mini batch of
3854.8s
cats, and it will never converge. It
3856.9s
will just index too much on cats and
3859.0s
then index too much on dogs. So you want to add some sort of a experience replay concept that we'll see in order to create more mixes in the data and get more diversity. The other thing that is important is in our current training process, we are not reusing our data. Like you experience
3861.6s
to add some sort of a
3864.4s
experience replay concept that we'll see
3866.4s
in order to create more mixes in the
3868.8s
data and get more diversity. The other
3872.5s
thing that is important is in our
3874.7s
current training process, we are not
3878.2s
reusing our data. Like you experience
3880.6s
something, you immediately train on it. You never see it again unless you reexperience the same thing sometimes in the future, which might or might not happen. Experience replay is going to help us to keep experiences in memory and maybe retrain on them on a regular basis so that one experience might be useful multiple times
3882.2s
You never see it again unless you
3884.1s
reexperience the same thing sometimes in
3885.9s
the future, which might or might not
3887.8s
happen. Experience replay is going to
3890.1s
help us to keep experiences in memory
3893.4s
and maybe retrain on them on a regular
3895.7s
basis so that one experience might be
3897.7s
useful multiple times
3900.4s
which intuitively makes sense. Like maybe you do an experience, you get an amazing reward and you don't want to forget it. You want to retrain the model on it on a regular basis. It's more data efficiency. So here's what it looks like. Um, the current way we were training was we're in a state I'm just
3901.8s
maybe you do an experience, you get an
3903.3s
amazing reward and you don't want to
3904.9s
forget it. You want to retrain the model
3906.5s
on it on a regular basis. It's more data
3909.0s
efficiency. So here's what it looks
3911.2s
like. Um, the current way we were
3913.7s
training was we're in a state I'm just
3915.7s
going to say state instead of pre-processed state, but it's pre-processed. We're in a state S. We perform action A, we get a reward R, and we get into the next state. From that next state, we perform another action A prime, we get a reward RP prime, and we get into S second and so on, you know,
3916.7s
pre-processed state, but it's
3918.0s
pre-processed. We're in a state S. We
3920.6s
perform action A, we get a reward R, and
3923.6s
we get into the next state. From that
3925.6s
next state, we perform another action A
3927.8s
prime, we get a reward RP prime, and we
3930.7s
get into S second and so on, you know,
3934.2s
and so on. And each of these would be called one experience. It's one iteration of gradient descent. It's one experience. So right now we're training on these experiences. So the training looks like I train on E1, I update my parameters. Then I train on E2, I update my parameters. Then I train on E3,
3936.0s
called one experience. It's one
3937.9s
iteration of gradient descent. It's one
3940.2s
experience. So right now we're training
3943.1s
on these experiences. So the training
3945.5s
looks like I train on E1, I update my
3948.4s
parameters. Then I train on E2, I update
3951.2s
my parameters. Then I train on E3,
3953.2s
update my parameters. Those are highly correlated because they're part of the same episode. And as I was saying with the ball traveling in one direction, that might actually not be that helpful to train on all of these, you know. So instead what we'll do is we'll use experience replay where we will collect
3955.4s
correlated because they're part of the
3957.0s
same episode. And as I was saying with
3959.4s
the ball traveling in one direction,
3961.1s
that might actually not be that helpful
3963.0s
to train on all of these, you know. So
3965.5s
instead what we'll do is we'll use
3967.4s
experience replay where we will collect
3969.8s
our first experience but instead of training on it we will put it in a memory called the replay memory D. We'll put it in there and then at every step we will sample from that memory to decide what to train on. So of course at the beginning if we just have one uh
3972.2s
training on it we will put it in a
3974.2s
memory called the replay memory D. We'll
3978.2s
put it in there and then at every step
3980.9s
we will sample from that memory to
3983.2s
decide what to train on. So of course at
3985.8s
the beginning if we just have one uh
3989.0s
experience in the memory we will train on that experience. But over time you will see that we get more diversity and reuse out of our experiences. So for example, let's say I experience E2, I put it in the memory and then instead of training on E2, I'm going to randomly sample from the memory. I might get E1
3990.6s
on that experience. But over time you
3993.0s
will see that we get more diversity and
3996.3s
reuse out of our experiences. So for
3998.4s
example, let's say I experience E2, I
4001.4s
put it in the memory and then instead of
4003.4s
training on E2, I'm going to randomly
4005.4s
sample from the memory. I might get E1
4007.8s
or I might get E2. Then I experience E3 and I put it in the memory and I might get one of the three. You know, this is the vanilla experience replay. In practice, there is more methods like prioritize sweeping which might tell you which experience you want to weigh. Maybe some experiences had a higher
4011.0s
and I put it in the memory and I might
4013.5s
get one of the three. You know, this is
4016.6s
the vanilla experience replay. In
4019.2s
practice, there is more methods like
4021.3s
prioritize sweeping which might tell you
4023.2s
which experience you want to weigh.
4025.0s
Maybe some experiences had a higher
4026.6s
gradient, so you want to prioritize them more often. You know, things like that. So, all in all, this is what our training looks like with experience replay. We experience um E1, we train on E1. Then we the next training iteration is not on E2, it's on a sample from E1 and E2. Either or. The third experience
4028.6s
more often. You know, things like that.
4032.6s
So, all in all, this is what our
4034.2s
training looks like with experience
4035.6s
replay. We experience um E1, we train on
4039.8s
E1. Then we the next training iteration
4043.5s
is not on E2, it's on a sample from E1
4045.8s
and E2. Either or. The third experience
4048.5s
is then put in the replay memory but we don't train on it. We train on a sample from whatever is in the replay memory and we repeat and that is more sample more efficient allows more reusability and less crossorrelation in our um training clears throat batch. So that's called replay memory
4050.6s
don't train on it. We train on a sample
4052.2s
from whatever is in the replay memory
4053.8s
and we repeat and that is more sample
4057.4s
more efficient allows more reusability
4060.1s
and less crossorrelation in our um
4063.0s
training clears throat batch.
4069.4s
So that's called replay memory
4072.2s
and you can use it with mini batch gradient descent. Note that you still experience in the direction that the game is played. Like we still go and take the action as expected. We just don't necessarily update our model parameter based on the action that we ended up taking. We put it in the replay
4073.8s
gradient descent. Note that you still
4077.4s
experience in the direction that the
4079.2s
game is played. Like we still go and
4081.4s
take the action as expected. We just
4083.7s
don't necessarily update our model
4085.4s
parameter based on the action that we
4087.0s
ended up taking. We put it in the replay
4089.2s
memory. We may train on it later. memory. We may train on it later. snorts snorts Okay. Um so here is how it modifies our vanilla setup. We've added an experience from state S to state S prime to the replay memory. You know, like let me walk you through it again. Within one time step, we forward propagate our
4092.5s
memory. We may train on it later. snorts
4093.2s
snorts Okay.
4095.1s
Um so here is how it modifies our
4098.0s
vanilla setup. We've added an experience
4103.0s
from state S to state S prime to the
4105.9s
replay memory. You know, like let me
4108.7s
walk you through it again. Within one
4110.7s
time step, we forward propagate our
4112.9s
state into the Q network. We execute the best action given the Q values. This gives us a reward and the next state. The next state is pre-processed. And then instead of training on that, instead of training, we just add that transition to the replay memory. And instead we sample randomly a mini batch
4114.9s
best action given the Q values. This
4117.9s
gives us a reward and the next state.
4120.2s
The next state is pre-processed. And
4122.2s
then instead of training on that,
4124.7s
instead of training, we just add that
4127.4s
transition to the replay memory. And
4130.5s
instead we sample randomly a mini batch
4132.8s
of transition from the replay memory and we train on those and we redo the same thing again and again. thing again and again. Yes. replay bias towards the start of the game clears throat sample from everything. So, aren't you more likely to trade start game just like you would sometimes want? just like you would sometimes want? Yeah.
4135.1s
we train on those and we redo the same
4138.2s
thing again and again.
4141.1s
thing again and again. Yes.
4142.6s
replay bias towards the start of the
4145.6s
game clears throat sample from
4147.7s
everything. So, aren't you more likely
4149.6s
to trade start game
4152.8s
just like you would sometimes want?
4154.9s
just like you would sometimes want? Yeah.
4156.6s
Yeah, you you you would uh you would within one episode, but you know, if you if you play multiple chess game, uh your replay memory would get already bigger. So, then you would see some end game, some middle of the game, some early games. Yeah, good. And in practice, it's actually useful
4159.8s
within one episode, but you know, if you
4162.7s
if you play multiple chess game, uh your
4165.5s
replay memory would get already bigger.
4167.6s
So, then you would see some end game,
4169.4s
some middle of the game, some early
4171.3s
games. Yeah, good.
4175.3s
And in practice, it's actually useful
4177.0s
because uh you might imagine that in a chess game, you know, all of us, let's say if you're a beginner, you you see a lot of beginning of the games. You actually people that are beginners, they're good at openings, but they're bad at end games because they don't get to play a lot of end games. Uh well,
4179.7s
chess game, you know, all of us, let's
4183.3s
say if you're a beginner, you you see a
4185.5s
lot of beginning of the games. You
4186.7s
actually people that are beginners,
4188.4s
they're good at openings, but they're
4190.2s
bad at end games because they don't get
4191.8s
to play a lot of end games. Uh well,
4194.4s
that type of approach could be useful. You can retrain on end games more often. And you know the a more advanced version of the replay memory would also weigh the experience in the replay memory based on how much the gradient is going to be. So if you have an experience that actually was super
4196.5s
You can retrain on end games more often.
4198.9s
And you know the a more advanced version
4200.9s
of the replay memory
4203.0s
would also weigh the experience in the
4206.6s
replay memory based on how much the
4208.4s
gradient is going to be. So if you have
4210.3s
an experience that actually was super
4211.8s
insightful, you can weigh it higher so that you you you prioritize grabbing it and retraining on it essentially. So let's say you blunder in chess, you might actually want to receive that blunder later so that you don't do it blunder later so that you don't do it again. Um okay so these were all the different
4214.2s
that you you you prioritize grabbing it
4216.8s
and retraining on it essentially.
4220.8s
So let's say you blunder in chess, you
4222.7s
might actually want to receive that
4224.0s
blunder later so that you don't do it
4225.8s
blunder later so that you don't do it again.
4227.6s
Um okay so these were all the different
4230.8s
methods. Another one that's very um intuitive and very important is when during the training process our um agent gets stuck gets stuck in a local minima. Uh here is how it would work in practice. Uh you start in initial state s1 and you have three states ahead of you. If you take action a1 you go to
4233.0s
intuitive and very important is when
4235.8s
during the training process our um agent
4239.7s
gets stuck gets stuck in a local minima.
4243.4s
Uh here is how it would work in
4245.0s
practice. Uh you start in initial state
4247.7s
s1 and you have three states ahead of
4249.9s
you. If you take action a1 you go to
4252.6s
state two which is a terminal state. If you take and you get a reward of zero. If you take action A2, you get to S3 also a terminal state and you get a reward of one. And if you get um action A3, get to state four terminal state with a reward of a,000. So of course to
4255.1s
you take and you get a reward of zero.
4257.6s
If you take action A2, you get to S3
4260.0s
also a terminal state and you get a
4261.8s
reward of one. And if you get um action
4264.7s
A3, get to state four terminal state
4268.2s
with a reward of a,000. So of course to
4271.4s
us it's obvious that we would want to explore the state number four. It's pretty obvious in practice. Let's say you update you you initialize your network and in the first forward path that's what you get first forward path the network is random you get Q value for action one.5 for action 2.4 four for action 3.3.
4273.2s
explore the state number four. It's
4275.4s
pretty obvious in practice. Let's say
4277.9s
you update you you initialize your
4280.2s
network and in the first forward path
4284.2s
that's what you get first forward path
4286.7s
the network is random you get Q value
4289.8s
for action one.5
4292.1s
for action 2.4 four for action 3.3.
4295.9s
What does that mean? It means the agent is saying I'm going to go to action one. So I take action one and I see an immediate reward of zero. Right? Because it's a terminal state. The Bman equation thing doesn't happen. I just have the immediate reward which becomes my target Y. And so I perform a gradient descent
4298.2s
is saying I'm going to go to action one.
4300.7s
So I take action one and I see an
4303.5s
immediate reward of zero. Right? Because
4306.7s
it's a terminal state. The Bman equation
4308.9s
thing doesn't happen. I just have the
4311.0s
immediate reward which becomes my target
4313.9s
Y. And so I perform a gradient descent
4316.4s
update to say this Q value should have been zero. So I convert this Q value to zero. Now second try. This time the Q value is saying take action two. It's the highest Q value. I take action two. I have an immediate reward ahead of me. That's one because it's a terminal state. There's no second
4318.1s
been zero.
4320.0s
So I convert this Q value to zero.
4323.0s
Now second try.
4325.8s
This time the Q value is saying take
4328.6s
action two. It's the highest Q value. I
4332.2s
take action two. I have an immediate
4334.0s
reward ahead of me. That's one because
4337.6s
it's a terminal state. There's no second
4340.0s
discounted future reward term. Um so I just take Y equals 1. I perform my gradient descent updates and this converts to one. And then third time the agent is still saying go to A2. Go to the take action A2 reward of one. Good. That's what you predicted. Nothing to do. Just keep going. We're done with
4344.2s
just take Y equals 1. I perform my
4347.3s
gradient descent updates and this
4348.7s
converts to one. And then third time the
4353.6s
agent is still saying go to A2. Go to
4356.6s
the take action A2 reward of one. Good.
4361.2s
That's what you predicted. Nothing to
4363.0s
do. Just keep going. We're done with
4365.0s
training. We're stuck. We never visit the state we actually wanted to visit. Okay. So that that that wouldn't work for us. We will never visit that state using our current algorithm. Does that make sense why we wouldn't ever visit that state? In practice, this is a big issue. The analogy of this uh concept of
4368.4s
the state we actually wanted to visit.
4372.3s
Okay. So that that that wouldn't work
4374.3s
for us. We will never visit that state
4376.6s
using our current algorithm. Does that
4378.2s
make sense why we wouldn't ever visit
4380.0s
that state?
4381.8s
In practice, this is a big issue. The
4384.4s
analogy of this uh concept of
4387.1s
exploration versus exploitation is when every day you take your bike and you cross um campus, you have a favorite route. And turns out that the more you take that route, the better you get every time. Like you get a little faster. Maybe your turn is faster or something or you can predict how many
4390.1s
every day you take your bike and you
4392.0s
cross um campus, you have a favorite
4395.4s
route. And turns out that the more you
4398.3s
take that route, the better you get
4399.9s
every time. Like you get a little
4401.3s
faster. Maybe your turn is faster or
4403.4s
something or you can predict how many
4405.1s
people are going to be at that roundabout and you know how to take it in the wide way so you go faster. We've all done that. Um that's exploitation. You exploit what you already know and you get better at it. But maybe there's another route that you're not thinking of that's pretty instead of going north
4406.1s
roundabout and you know how to take it
4407.7s
in the wide way so you go faster. We've
4409.8s
all done that. Um that's exploitation.
4413.0s
You exploit what you already know and
4414.7s
you get better at it. But maybe there's
4416.9s
another route that you're not thinking
4418.6s
of that's pretty instead of going north
4421.0s
from campus, you go south and maybe it might be better. You will never see because you don't have the courage or the patience to do it. That's the difference between exploration and difference between exploration and exploitation. In practice, a good model would be able to handle both to exploit when 22 to
4423.4s
might be better. You will never see
4425.0s
because you don't have the courage or
4426.7s
the patience to do it. That's the
4429.0s
difference between exploration and
4430.2s
difference between exploration and exploitation.
4432.2s
In practice, a good model would be able
4434.5s
to handle both to exploit when 22 to
4437.0s
exploit to explore when it needs to explore. The way we do it in practice in our pseudo code is to inject some randomness. So for example, when we are looping over time step with probability epsilon let's say 5 take a random action. So from time to time on average one time every 20 times you take a
4438.6s
explore. The way we do it in practice in
4441.3s
our pseudo code is to inject some
4444.0s
randomness. So for example, when we are
4447.7s
looping over time step with probability
4450.2s
epsilon let's say 5 take a random
4452.6s
action. So from time to time on average
4455.8s
one time every 20 times you take a
4458.1s
random action it will allow you to visit maybe a new path. The analogy in chess is you know you might use a creative move from time to time that might be worse today but might allow you to learn something and to get better over time. Yeah. Um in that uh the example we just
4460.3s
maybe a new path. The analogy in chess
4463.2s
is you know you might use a creative
4465.6s
move from time to time that might be
4467.9s
worse today but might allow you to learn
4470.0s
something and to get better over time.
4473.3s
Yeah. Um in that uh the example we just
4477.4s
covered, couldn't we resolve that by just setting the initial um Q values to just setting the initial um Q values to infinity? Setting the the Couldn't we resolve this problem by setting the initial values from into infinity? Well, the problem if you set the initial values to infinity. So you would say instead of randomly
4479.9s
just setting the initial um Q values to
4483.9s
just setting the initial um Q values to infinity?
4486.0s
Setting the the Couldn't we resolve this
4488.6s
problem by setting the initial values
4490.3s
from into infinity? Well, the problem if
4492.6s
you set the initial values to infinity.
4494.8s
So you would say instead of randomly
4496.9s
initializing your network, you initialize it in a way that the outputs are equal to infinity. Yeah. so that we wouldn't get the issue of where like the Q value of action three state one is well in practice if the three Q values are infinity then you can't make a decision on the spot so you're saying
4498.4s
initialize it in a way that the outputs
4500.2s
are equal to infinity.
4501.7s
Yeah. so that we wouldn't get the issue
4503.6s
of where like the Q value of action
4507.1s
three state one is
4509.2s
well in practice if the three Q values
4511.3s
are infinity then you can't make a
4512.9s
decision on the spot so you're saying
4514.4s
just pick one randomly because if the three are infinity you you can't decide which one to take right and also if in if it's infinity and the reward is one I mean if it's a really large number and the reward is one your gradient is going to be massive right so
4516.6s
because if the three are infinity you
4518.2s
you can't decide which one to take right
4521.9s
and also if in if it's infinity and the
4524.4s
reward is one I mean if it's a really
4526.6s
large number and the reward is one your
4528.1s
gradient is going to be massive right so
4530.9s
it's going uh I guess the loss function is going to be massive and um I don't know I imagine it would be really hard to train it but in practice you start with the random initialization because in this might be one example but you know if in the game of chess um actually
4533.1s
is going to be massive and um I don't
4536.2s
know I imagine it would be really hard
4537.5s
to train it but in practice you start
4539.8s
with the random initialization because
4541.8s
in this might be one example but you
4544.3s
know if in the game of chess um actually
4548.6s
the the reward is one at the end and zero all the time or maybe the reward is a thousand at the end and when you lose your rook it's uh it's a negative reward. You can't predict what the reward structure is going to be. We want an agent that is able to adapt to it and
4550.7s
zero all the time or maybe the reward is
4553.8s
a thousand at the end and when you lose
4555.4s
your rook it's uh it's a negative
4557.5s
reward. You can't predict what the
4559.3s
reward structure is going to be. We want
4560.6s
an agent that is able to adapt to it and
4563.4s
um it's better to find a method that can scale to different environments essentially. snorts Um okay so this was um epsilon greedy action which is adding some randomness with probability epsilon take a random action. Okay, so adding all our techniques because we get good at training reinforcement learning algorithms. This is what we have. We
4565.8s
scale to different environments
4567.6s
essentially. snorts Um
4571.5s
okay so this was um epsilon greedy
4575.8s
action which is adding some randomness
4578.6s
with probability epsilon take a random
4580.5s
action. Okay, so adding all our
4584.1s
techniques because we get good at
4586.2s
training reinforcement learning
4587.5s
algorithms. This is what we have. We
4590.0s
initialize our Q network parameters. We have a random network. We initialize our replay memory D. And then we loop over episodes. We start from an initial state. We create a boolean that allows us to detect terminal states with probability epsilon. We're going to take a random action. Otherwise, we're going to follow what we know, which is forward
4592.0s
have a random network. We initialize our
4594.2s
replay memory D. And then we loop over
4596.9s
episodes. We start from an initial
4598.5s
state. We create a boolean that allows
4600.2s
us to detect terminal states with
4602.7s
probability epsilon. We're going to take
4604.3s
a random action. Otherwise, we're going
4606.4s
to follow what we know, which is forward
4608.4s
propagate the state in the Q network. take the action that has the highest Q value that allows you to observe a reward and the next state take that next state forward propagate it again
4610.4s
take the action that has the highest Q
4612.1s
value that allows you to observe a
4614.0s
reward and the next state take that next
4616.8s
state forward propagate it again
4625.5s
uh oh no sorry observe that next state add it to the replay memory sample from the replay memory and then train on that sample and in the process you will need to do another forward path because you need to estimate your target by using the immediate reward plus the belman equation plus the discounted future
4628.4s
add it to the replay memory sample from
4631.8s
the replay memory and then train on that
4634.3s
sample and in the process you will need
4636.6s
to do another forward path because you
4638.5s
need to estimate your target by using
4640.6s
the immediate reward plus the belman
4642.6s
equation plus the discounted future
4645.0s
equation plus the discounted future reward. Okay, are you experts at Q-learning?
4647.7s
Okay, are you experts at Q-learning?
4657.0s
where we get at the end. You can claim proudly you have trained an atari. It's not that complicated as you can see other than the Bellman equation piece. Turns out the agent has discovered that it can send the ball on the back and it's actually much easier to finish the game like that which is quite
4660.0s
proudly you have trained an atari. It's
4661.9s
not that complicated as you can see
4663.3s
other than the Bellman equation piece.
4665.9s
Turns out the agent has discovered that
4668.5s
it can send the ball on the back and
4670.2s
it's actually much easier to finish the
4671.9s
game like that which is quite
4674.5s
interesting. You know, a good player would know that you can dig a tunnel and you can finish the game without too much issues. Yeah. How do you How do you actually How do you quantify when uh when the game has ended? How would the model come after the How would the model come after the train?
4675.8s
would know that you can dig a tunnel and
4677.8s
you can finish the game without too much
4679.4s
issues. Yeah.
4680.4s
How do you
4682.6s
How do you actually
4685.0s
How do you quantify when uh when the
4687.1s
game has ended?
4688.0s
How would the model come after the
4689.8s
How would the model come after the train?
4691.0s
Yeah. Well, uh, first you would you would you would start seeing um the model get to good rewards as it play like it manages to get really good rewards while earlier it might not, you know, and so that's probably your best guess for how good the model is. In practice, if you're AlphaGo, you can
4693.3s
would you would start seeing um the
4696.4s
model get to good rewards as it play
4698.6s
like it manages to get really good
4700.2s
rewards while earlier it might not, you
4703.6s
know, and so that's probably your best
4705.8s
guess for how good the model is. In
4707.8s
practice, if you're AlphaGo, you can
4709.4s
also test it against the best humans in the world and you can observe that they're losing against the model. I'm saying you have like a bunch of different clears throat chest engines and some of them are way way better than others. They have different maybe they're both based on reinforcement learning and at the end
4711.0s
the world and you can observe that
4712.7s
they're losing against the model.
4715.4s
I'm saying you have like a bunch of
4717.1s
different clears throat chest engines
4718.2s
and some of them are way way better than
4720.4s
others. They have different
4724.6s
maybe they're both based on
4725.8s
reinforcement learning and at the end
4727.3s
they maximize both of their rewards. they maximize both of their rewards. Yeah. So how do you know which model is actually doing better? clears throat You can get them to play together. So you have no idea if you do that one. No, you you could you could actually monitor the loss function and look at is
4729.8s
they maximize both of their rewards. Yeah.
4730.2s
So how do you know which model is
4731.9s
actually doing better? clears throat
4733.5s
You can get them to play together. So
4736.1s
you have no idea if you do that one.
4738.2s
No, you you could you could actually
4739.7s
monitor the loss function and look at is
4742.0s
the Bellman equation respected. If the Bellman equation is respected, then your model is really really good. And then we we're going to see an example of competitive selfplay where you get the model to play against other models and then over time as you watch them play for thousands and thousands of time, you
4745.2s
Bellman equation is respected, then your
4747.2s
model is really really good. And then we
4750.2s
we're going to see an example of
4751.3s
competitive selfplay where you get the
4753.8s
model to play against other models and
4755.7s
then over time as you watch them play
4757.7s
for thousands and thousands of time, you
4759.8s
can tell which model is ahead of another one. You can then sort of copy paste the best model into the other models and then make them play again for many times. And because you have the epsilon greedy approach, one of the model is naturally going to get better than the others because of the randomness that you add.
4761.4s
one. You can then sort of copy paste the
4765.3s
best model into the other models and
4767.3s
then make them play again for many
4768.9s
times. And because you have the epsilon
4770.8s
greedy approach, one of the model is
4772.7s
naturally going to get better than the
4773.9s
others because of the randomness that
4776.6s
you add.
4783.0s
and then we'll spend 20 minutes on the RLHF. Um, here are other examples. This is Pong uh which is one v one sequest
4784.6s
RLHF. Um, here are other examples. This
4788.1s
is Pong uh which is one v one sequest
4797.5s
the one that maybe more of you know Space Invaders very popular game as well. sighs So the impressive thing that they showed is that you can um actually solve many games with the exact same algorithm, no tweaks, which is quite impressive. Let's go a little further and talk about um advanced topics. Um here is a game
4799.2s
Space Invaders very popular game as
4802.3s
well. sighs
4803.9s
So the impressive thing that they showed
4805.6s
is that you can um actually solve many
4809.0s
games with the exact same algorithm,
4812.0s
no tweaks, which is quite impressive.
4818.3s
Let's go a little further and talk about
4820.7s
um advanced topics. Um here is a game
4824.0s
called Montezuma Revenge. Um this game is particular because you're controlling a little character right here. And this character is trying to go and grab, let's say, this key right here, and it has some obstacles or some enemies that it needs to take care of. What what do you think is going to be an
4827.6s
is particular because you're controlling
4829.9s
a little character right here. And this
4832.4s
character is trying to go and grab,
4834.5s
let's say, this key right here, and it
4837.5s
has some obstacles or some enemies that
4840.0s
it needs to take care of.
4842.5s
What what do you think is going to be an
4845.4s
issue if we apply what we just learned to this game? comparison to let's say chess or go? comparison to let's say chess or go? Yes. Yeah, the reward is very delayed. Like if you start with a random network, what are the chances that the network is going to figure out that to get to the
4848.0s
to this game?
4854.6s
comparison to let's say chess or go?
4858.2s
comparison to let's say chess or go? Yes.
4860.7s
Yeah, the reward is very delayed. Like
4863.4s
if you start with a random network, what
4866.3s
are the chances that the network is
4868.6s
going to figure out that to get to the
4870.2s
key, it actually should go in the opposite direction. It should go in the opposite direction. It could jump down here. It should catch the rope. The rope will probably allow the character to go to the ladder. It goes down the ladder. It has to go jump up this enemy. My
4872.8s
opposite direction. It should go in the
4874.3s
opposite direction. It could jump down
4876.2s
here. It should catch the rope. The rope
4878.9s
will probably allow the character to go
4880.6s
to the ladder. It goes down the ladder.
4883.5s
It has to go jump up this enemy. My
4886.4s
guess is it's in an enemy. I'm not sure, but I think it's an enemy because of the color. And I know that in gaming if it was green it might not have been an enemy but if it's gray or red it might be an enemy. And then go up the ladder
4888.5s
but I think it's an enemy because of the
4890.3s
color. And I know that in gaming if it
4892.7s
was green it might not have been an
4894.2s
enemy but if it's gray or red it might
4896.5s
be an enemy. And then go up the ladder
4899.0s
and grab the key. The chance is very low that the agent is going to make that successive good decisions to get there. You're right. Why is it use why is it easier for a human to actually solve that game? Intuition. Prior knowledge. So, for example, when you look at this game,
4902.7s
that the agent is going to make that
4905.0s
successive good decisions to get there.
4907.5s
You're right. Why is it use why is it
4910.6s
easier for a human to actually solve
4912.0s
that game?
4916.6s
Intuition. Prior knowledge. So, for
4919.7s
example, when you look at this game,
4921.3s
even if you have never played it, my guess is you would know you can go down the ladder because you know what a ladder is. Or you can see this little rope and you're like, I'm going to catch the rope. I'm going to jump and go to the other side. And you look at this
4923.4s
guess is you would know you can go down
4924.8s
the ladder because you know what a
4926.7s
ladder is. Or you can see this little
4928.9s
rope and you're like, I'm going to catch
4930.2s
the rope. I'm going to jump and go to
4931.6s
the other side. And you look at this
4933.3s
little monster and you're like, I better not touch this monster. Or if anything, I will jump on top of it. Cuz you've played Mario, let's say. So, all of this is human intuition. uh sometimes you would call as a baby survival instinct like you throw the baby in the water and
4934.9s
not touch this monster. Or if anything,
4936.5s
I will jump on top of it. Cuz you've
4938.6s
played Mario, let's say. So, all of this
4941.6s
is human intuition. uh sometimes you
4945.0s
would call as a baby survival instinct
4947.1s
like you throw the baby in the water and
4948.6s
suddenly it flips and it can um it can uh swim. Those are things that are to a certain extent encoded in our DNA but at the very least encoded in our experience of doing other things that have nothing to do with this game. And so the the problem here is called imitation
4951.5s
uh swim. Those are things that are to a
4954.0s
certain extent encoded in our DNA but at
4956.1s
the very least encoded in our experience
4957.6s
of doing other things that have nothing
4959.0s
to do with this game. And so the the
4961.3s
problem here is called imitation
4962.7s
learning. Is is there a better way to start our network than a random initialization that allows the network to for example guess that this is a ladder and turns out that if the network knows that it will be more likely to get to the reward first and then learn from that reward and then get better over
4966.0s
start our network than a random
4967.8s
initialization that allows the network
4970.0s
to for example guess that this is a
4971.6s
ladder and turns out that if the network
4973.7s
knows that it will be more likely to get
4975.5s
to the reward first and then learn from
4977.5s
that reward and then get better over
4979.0s
that reward and then get better over time. The other part that can also use human knowledge which is what we're going to see together is reinforcement learning from human feedback where you have an analogy here which is you can train a language model and it might be completely misaligned with what actually humans care about. How does
4981.0s
The other part that can also use human
4983.2s
knowledge which is what we're going to
4984.4s
see together is reinforcement learning
4986.6s
from human feedback where you have an
4990.2s
analogy here which is you can train a
4992.2s
language model and it might be
4993.8s
completely misaligned with what actually
4995.6s
humans care about. How does
4997.5s
reinforcement learning help in those situations? That's going to be the next topic in the last part of the lecture. snorts Okay, let me show you a few other results um quickly. Today we talked about DQ and deep Q learning. In practice there's a lot more reinforcement learning algorithm but you
4999.0s
situations? That's going to be the next
5000.2s
topic in the last part of the lecture.
5002.9s
snorts Okay, let me show you a few
5004.6s
other results um quickly. Today we
5007.8s
talked about DQ and deep Q learning. In
5011.0s
practice there's a lot more
5012.8s
reinforcement learning algorithm but you
5014.6s
got the gist of it. You got the concept of making good sequences of decision epsilon greedy exploration exploitation um terminal state starting state all of that you you got the one the one algorithm that is very popular right now is called PO proximal policy optimization. There is one that is even more popular right now that's actually
5016.3s
of making good sequences of decision
5018.3s
epsilon greedy exploration exploitation
5021.8s
um terminal state starting state all of
5023.8s
that you you got the one the one
5025.8s
algorithm that is very popular right now
5027.9s
is called PO proximal policy
5031.4s
optimization. There is one that is even
5033.8s
more popular right now that's actually
5035.3s
from a year ago at Stanford called DPO that we won't study in the class. One of the things to know about PO just just to go over it really quickly and and I I pasted two important papers from Schulman a few years back trust TRPO and PO um is that it is not a value based
5038.4s
that we won't study in the class. One of
5040.8s
the things to know about PO just just to
5042.6s
go over it really quickly and and I I
5044.4s
pasted two important papers from
5046.2s
Schulman a few years back trust TRPO and
5049.5s
PO um is that it is not a value based
5052.9s
algorithm. So in Q-learning you learn the Q values and then you define your policy as the arg max of the Q values. In PO, you learn the policy directly, which is a more probabilistic method. snorts Um, it also works well with continuous spaces. If you look at the Q-learning, we learned one output for
5054.8s
the Q values and then you define your
5056.6s
policy as the arg max of the Q values.
5060.6s
In PO, you learn the policy directly,
5063.4s
which is a more probabilistic method.
5065.8s
snorts Um, it also works well with
5067.8s
continuous spaces. If you look at the
5069.8s
Q-learning, we learned one output for
5072.4s
one action. If you actually have um a game that has continuous action like autonomous driving where it's not like just turn the wheel to the right or to the left, it's like what degree you turn it, it's continuous. then Q the QN would not work well or you would have to granularize the number of action a
5076.6s
game that has continuous action like
5078.4s
autonomous driving where it's not like
5080.6s
just turn the wheel to the right or to
5082.2s
the left, it's like what degree you turn
5084.4s
it, it's continuous. then Q the QN would
5087.4s
not work well or you would have to
5090.2s
granularize the number of action a
5091.8s
little bit to the right a little bit more a little more which would not be really useful instead you would use PO
5093.0s
more a little more which would not be
5094.6s
really useful instead you would use PO
5110.0s
reward in DQN um different reward structure will lead to different uh types of you know agent strategies. Uh but you're right for the game of go you could actually define the reward as one if you win and zero if you don't win. That's it. You know every move will be
5113.0s
structure will lead to different uh
5115.0s
types of you know agent strategies. Uh
5117.9s
but you're right for the game of go you
5119.9s
could actually define the reward as one
5122.1s
if you win and zero if you don't win.
5124.7s
That's it. You know every move will be
5126.3s
zero until the last move is a win. In chess you might actually do intermediate reward because you want to tell the you want to tell the agent that it's good to kill the opponent's pieces to get rid of them. You could also do end to end and say I don't give any intermediate
5128.5s
chess you might actually do intermediate
5130.2s
reward because you want to tell the you
5132.2s
want to tell the agent that it's good to
5133.8s
kill the opponent's pieces to get rid of
5136.6s
them. You could also do end to end and
5138.6s
say I don't give any intermediate
5140.2s
reward. I just give a final reward which might be more complicated to train on but it might actually lead to a more optimal strategy because in fact you could actually win without taking any piece from your opponent. So other snorts things about PO is um you know it's more probabilistic. It has a
5142.2s
might be more complicated to train on
5143.9s
but it might actually lead to a more
5145.4s
optimal strategy because in fact you
5147.9s
could actually win without taking any
5149.5s
piece from your opponent. So
5152.9s
other snorts things about PO is um you
5155.5s
know it's more probabilistic. It has a
5158.0s
concept of an expected advantage which at every steps instead of telling you how good that action is, it would tell you how much better it is than random than than the current state like how much better would it be to do certain thing versus what you would have done otherwise. I'm not going to go into the
5160.5s
at every steps instead of telling you
5162.7s
how good that action is, it would tell
5165.0s
you how much better it is than random
5167.4s
than than the current state like how
5168.8s
much better would it be to do certain
5170.9s
thing versus what you would have done
5172.2s
otherwise. I'm not going to go into the
5173.6s
details. It's all in the paper. But those are things that are important. Here's a few examples of PO. So this example on the left is from open AI a few years back where you can see it's a continuous space where uh the agent is being um bullied a little bit but um
5175.3s
those are things that are important.
5177.4s
Here's a few examples of PO. So this
5180.1s
example on the left is from open AI a
5183.8s
few years back where you can see it's a
5185.8s
continuous space where uh the agent is
5189.3s
being um bullied a little bit but um
5192.7s
it's trying to grab the rewards but it's also subject to external forces that are sort of throwing balls at it. It's a little bit mean, but um you can imagine that this is a continuous space, meaning you're controlling the nodes, you're controlling the joints of the agent, and you're controlling the forces, the
5195.7s
also subject to external forces that are
5198.1s
sort of throwing balls at it. It's a
5200.4s
little bit mean, but
5203.5s
um you can imagine that this is a
5206.1s
continuous space, meaning you're
5207.4s
controlling the nodes, you're
5209.2s
controlling the joints of the agent, and
5211.4s
you're controlling the forces, the
5213.1s
angles, and so it's a that's why PO would be better in that case. Super. Here's a competitive selfplay, which I really like other. And this is the sumo game. Push the opponent outside the ring and you get a reward. So actually it's interesting because you're seeing some emergent behavior which is they attack each other's feet
5215.8s
would be better in that case.
5218.8s
Super. Here's a competitive selfplay,
5221.4s
which I really like
5227.7s
other. And this is the sumo game. Push
5229.7s
the opponent outside the ring and you
5231.0s
get a reward.
5233.1s
So actually it's interesting because
5234.5s
you're seeing some emergent behavior
5236.2s
which is they attack each other's feet
5239.4s
or they lower their center of gravity to be more stable for example. Yeah. be more stable for example. Yeah. laughter Yeah. Yeah. It's versions sometimes different initializations for example. So no but good question. So often time what open would do back you know back in that time is they would create copies of
5242.5s
be more stable for example. Yeah.
5247.3s
be more stable for example. Yeah. laughter
5248.3s
Yeah. Yeah. It's versions sometimes
5250.4s
different initializations for example.
5254.2s
So no but good question. So often time
5256.9s
what open would do back you know back in
5259.2s
that time is they would create copies of
5262.4s
the same model they would initialize them differently and they would let them learn and turns out one of the model will get better than the others and then they will copy again that model to the rest and do the same thing again and again pretty much. Oh yeah it's kind of funny isn't it? That's a good catch.
5264.6s
them differently and they would let them
5266.4s
learn and turns out one of the model
5268.3s
will get better than the others and then
5270.2s
they will copy again that model to the
5272.2s
rest and do the same thing again and
5273.8s
again pretty much.
5276.2s
Oh yeah it's kind of funny isn't it?
5279.4s
That's a good catch.
5285.8s
that's a good goal. laughter
5293.8s
they're a little awkward, you have to say, but but it it works. Okay, so this, you know, I let you watch the video, it's going to be shared, but um here's another set of games that are even more complicated that I mentioned early on. Open AAI 5 which you can think of an
5295.3s
say, but but it it works. Okay, so this,
5299.3s
you know, I let you watch the video,
5300.6s
it's going to be shared, but um here's
5302.8s
another set of games that are even more
5305.0s
complicated that I mentioned early on.
5307.0s
Open AAI 5 which you can think of an
5309.6s
equivalent of League of Legend Dota where you have um 5v5 game so you have to collaborate etc which makes it adds like literally one additional um degree of complexity. Uh and Starcraft uh Alpha Star from Deep Mind is an example of where the observation is not the entire state you have fog and so that adds
5311.8s
where you have um 5v5 game so you have
5315.8s
to collaborate etc which makes it adds
5318.7s
like literally one additional um degree
5321.8s
of complexity. Uh and Starcraft uh Alpha
5325.8s
Star from Deep Mind is an example of
5328.6s
where the observation is not the entire
5331.2s
state you have fog and so that adds
5333.6s
another layer of complexity. not going to see that together today. Um, I would encourage you to look at the AlphaGo documentary on Netflix if you haven't. Who has seen it already? Nobody. Okay. Well, um, you can now watch it with a different eye, understanding reinforcement learning. And at some point in the in the
5335.8s
to see that together today. Um,
5339.4s
I would encourage you to look at the
5342.1s
AlphaGo documentary on Netflix if you
5344.4s
haven't. Who has seen it already?
5346.7s
Nobody. Okay. Well, um, you can now
5349.8s
watch it with a different eye,
5351.4s
understanding reinforcement learning.
5353.4s
And at some point in the in the
5357.6s
documentary, you will see that um Alph Go makes a very odd move, a very creative move. And people are like, I don't understand that move. Even the top researchers or the best players would say in the video, they don't understand that move. It turns out that that move is very unintuitive for humans because
5361.4s
Go makes a very odd move, a very
5365.0s
creative move. And people are like, I
5367.8s
don't understand that move. Even the top
5370.0s
researchers or the best players would
5372.4s
say in the video, they don't understand
5373.9s
that move. It turns out that that move
5376.5s
is very unintuitive for humans because
5379.0s
as humans, we are trained to maximize our chances of winning. Like literally, if I can eat all your pieces in chess, I will eat all your pieces. And if I can surround your stones in go as much as I can, I will do it. The agent is just programmed to win. So that move actually
5381.6s
our chances of winning. Like literally,
5383.7s
if I can eat all your pieces in chess, I
5386.1s
will eat all your pieces. And if I can
5388.5s
surround your stones in go as much as I
5390.8s
can, I will do it. The agent is just
5393.2s
programmed to win. So that move actually
5395.4s
look counterintuitive because the agent doesn't care about winning by one or winning by a, you know, 20 stones. It just cares about winning. And that move specifically put the agent in a good place to win by a small margin. Yeah. So that's an example of an insight that you will learn. you you understand from this
5397.0s
doesn't care about winning by one or
5399.4s
winning by a, you know, 20 stones. It
5401.9s
just cares about winning. And that move
5404.2s
specifically put the agent in a good
5406.6s
place to win by a small margin. Yeah. So
5409.2s
that's an example of an insight that you
5410.9s
will learn. you you understand from this
5412.8s
class and you will see in the in the class and you will see in the in the documentary. Okay, I think we have 10 minutes. I'm I'm just going to introduce uh reinforcement learning from human feedback because it's a more um modern topic that um is very trendy right now. It's important to know and so let's look
5415.4s
class and you will see in the in the documentary.
5418.2s
Okay, I think we have 10 minutes. I'm
5420.4s
I'm just going to introduce uh
5422.2s
reinforcement learning from human
5423.7s
feedback because it's a more um modern
5427.8s
topic that um is very trendy right now.
5430.6s
It's important to know and so let's look
5432.7s
at it together. We're going to start by recapping how language models are trained in a nutshell and then we'll see what self what supervised fine-tuning looks like. We'll talk about how do we train a critique model, a reward model and then finally what RLHF looks like and why is it so trending in the news. So snorts
5434.6s
recapping how language models are
5436.1s
trained in a nutshell and then we'll see
5439.0s
what self what supervised fine-tuning
5441.3s
looks like. We'll talk about how do we
5443.4s
train a critique model, a reward model
5445.8s
and then finally what RLHF looks like
5449.0s
and why is it so trending in the news.
5451.4s
So snorts
5452.5s
um our training objective for language models is next token prediction, right? We've already talked about it in a former lecture. The idea is that I will get some inputs. I'm reading Wikipedia, let's say, or some sort of a text online and I read a sentence and I predict the last token and I do that again and
5456.2s
models is next token prediction, right?
5458.4s
We've already talked about it in a
5459.8s
former lecture. The idea is that I will
5462.7s
get some inputs. I'm reading Wikipedia,
5464.7s
let's say, or some sort of a text online
5467.4s
and I read a sentence and I predict the
5470.9s
last token and I do that again and
5472.6s
again. So for example, deep learning and then deep learning is deep learning is so deep learning is so cool and that's it. Um, so you get the idea, right? you always predict the next token and then over time it forces the model to um explicit u um emerging behaviors and it understands the connections
5476.0s
then deep learning
5477.9s
is deep learning is
5480.8s
so deep learning is so cool
5485.3s
and that's it. Um, so you get the idea,
5488.4s
right? you always predict the next token
5490.1s
and then over time it forces the model
5492.3s
to um explicit u um emerging behaviors
5496.4s
and it understands the connections
5498.0s
between those concepts and it's really good at generating text. We compute a loss function. You're actually going to study this loss function in uh C 5. So I'm not going to talk about it right now, but you perform, you know, a gradient descent loop. And this is how you get your first pre-trained language
5500.2s
good at generating text. We compute a
5503.8s
loss function. You're actually going to
5505.0s
study this loss function in uh C 5. So
5509.1s
I'm not going to talk about it right
5510.2s
now, but you perform, you know, a
5512.9s
gradient descent loop. And this is how
5516.5s
you get your first pre-trained language
5519.0s
model. Get a pre-trained language model. You can call it on a text or a prompt and it will continually generate and you call it again and again and again and it generate generate generates. Everybody's comfortable with that, right? Okay. So, uh that's how we trained a language model. But there is a couple of
5520.9s
You can call it on a text or a prompt
5523.0s
and it will continually generate and you
5524.7s
call it again and again and again and it
5526.3s
generate generate generates. Everybody's
5528.8s
comfortable with that, right? Okay. So,
5532.5s
uh that's how we trained a language
5534.2s
model. But there is a couple of
5535.8s
problems. The first problem is that online data does not reflect online data does not reflect helpfulness. So to give you a concrete example um what you might find in the training set is something like deep learning is so cool when actually what you might find in practice is people asking what is deep learning.
5540.6s
online data does not reflect
5543.0s
online data does not reflect helpfulness.
5545.7s
So to give you a concrete example um
5548.6s
what you might find in the training set
5550.4s
is something like deep learning is so
5552.2s
cool when actually what you might find
5555.0s
in practice is people asking what is
5557.4s
deep learning.
5559.5s
So the data is not really reflective of you want an agent to be helpful and that's a problem because the model was trained to continue text rather than answer questions and in practice you would see it's a big problem. Another problem is the model has no concept of good, polite, or
5561.9s
you want an agent to be helpful and
5564.0s
that's a problem because the model was
5566.9s
trained to continue text rather than
5568.7s
answer questions
5570.9s
and in practice you would see it's a big
5572.4s
problem. Another problem is the model
5574.7s
has no concept of good, polite, or
5577.4s
helpful yet. And to give you a concrete example, you might actually ask a pre-trained language model, my laptop won't turn on. What should I do? And then the model responds because it has Reddit on Reddit or on Wikipedia is laptops sometimes don't turn on because of power issues. Um, which is not what
5580.4s
example, you might actually ask a
5582.6s
pre-trained language model, my laptop
5584.9s
won't turn on. What should I do? And
5587.4s
then the model responds because it has
5589.1s
Reddit on Reddit or on Wikipedia is
5591.8s
laptops sometimes don't turn on because
5593.8s
of power issues. Um, which is not what
5596.4s
you ask. You ask what should I do? And in fact, a better answer would have been check your charger if is properly clears throat connected or the outlet works. If that's fine, try holding the power button for 10 seconds. If it still doesn't start, the battery or motorboard may blah blah blah blah. That's a better
5599.4s
in fact, a better answer would have been
5601.7s
check your charger if is properly
5603.6s
clears throat connected or the outlet
5605.4s
works. If that's fine, try holding the
5607.0s
power button for 10 seconds. If it still
5608.9s
doesn't start, the battery or motorboard
5610.3s
may blah blah blah blah. That's a better
5611.9s
answer. That's what you want an a language model to do nowadays. Um, and the model um can give you factual text because that's what it's been trained on, but it's it doesn't understand being helpful or um having an answer that looks like a humanlike answer. So our solution to it will start with
5613.4s
language model to do nowadays. Um, and
5616.2s
the model um can give you factual text
5619.4s
because that's what it's been trained
5620.5s
on, but it's it doesn't understand being
5622.2s
helpful or um having an answer that
5625.0s
looks like a humanlike answer.
5629.0s
So our solution to it will start with
5631.6s
using supervised fine-tuning um which is going to be learning from human written demonstrations of helpful behavior and then we get to even further and use RHF which will optimize not only for human written sentences or paragraphs but for preferences and the word preference is the keyword. Let's talk about how we can improve our pre-trained model with
5634.6s
going to be learning from human written
5636.6s
demonstrations of helpful behavior and
5639.4s
then we get to even further and use RHF
5642.5s
which will optimize not only for human
5644.9s
written sentences or paragraphs but for
5647.4s
preferences and the word preference is
5649.7s
the keyword. Let's talk about how we can
5652.3s
improve our pre-trained model with
5653.6s
supervised finetuning. I'll take um that we want to align models with human written responses and the step one that we're going to lose is to build a data set. Let's build a data set of human prompt response pairs. So what actually open is going to do I'll explain it in a second is it
5655.6s
I'll take um that we want to align
5658.4s
models with human written responses
5662.3s
and the step one that we're going to
5664.7s
lose is to build a data set. Let's build
5666.9s
a data set of human prompt response
5669.9s
pairs. So what actually open is going to
5673.0s
do I'll explain it in a second is it
5674.8s
might collect some of the prompts that we all use and then ask humans to respond to those prompts and put that in a data set. It might al also ask separately experts to write really good prompts and then answer those prompts. It's a fully human-made data set. And then we use that data set to fine-tune
5677.4s
we all use and then ask humans to
5679.9s
respond to those prompts and put that in
5681.7s
a data set. It might al also ask
5684.1s
separately experts to write really good
5686.1s
prompts and then answer those prompts.
5689.0s
It's a fully human-made data set. And
5691.8s
then we use that data set to fine-tune
5694.2s
our pre-trained model. And by now you've learned fine-tuning in the online video. So you know what I'm talking about using supervised learning. So what it looks like is I take my pre-trained model that I just told you how we train and then I give it a prompt explain deep learning to a beginner and I also will
5696.2s
learned fine-tuning in the online video.
5698.2s
So you know what I'm talking about using
5700.4s
supervised learning. So what it looks
5702.0s
like is I take my pre-trained model that
5705.0s
I just told you how we train and then I
5707.4s
give it a prompt explain deep learning
5709.9s
to a beginner and I also will
5711.9s
concatenate to it a response a good response written by a human. Deep learning is a type of machine learning that uses neural and then I expect the model to come up with the word networks. So it's literally do whatever we did to train the pre-trained model but we do it on human written front response pairs
5716.9s
response written by a human. Deep
5719.2s
learning is a type of machine learning
5720.9s
that uses neural
5723.3s
and then I expect the model to come up
5725.0s
with the word networks.
5727.3s
So it's literally do whatever we did to
5730.4s
train the pre-trained model but we do it
5732.6s
on human written front response pairs
5740.6s
use the you know the same loss function how far the model's response is from a human response um you do that many times and you will get SFT supervised and you will get SFT supervised fine-tuning um but it has some shortcomings
5742.9s
how far the model's response is from a
5744.8s
human response um you do that many times
5748.1s
and you will get SFT supervised
5751.4s
and you will get SFT supervised fine-tuning
5753.0s
um but it has some shortcomings
5760.6s
that is extremely costly to collect. In fact, I believe in the first version of that instruct GPT there was only 13,000 prompt response pairs and turns out it did really well despite that.
5763.8s
fact, I believe in the first version of
5765.8s
that instruct GPT there was only 13,000
5770.1s
prompt response pairs and turns out it
5773.1s
did really well despite that.
5781.0s
generalize well because you're again you're not doing reinforcement learning here. You're doing supervised learning and so you're just showing a set of examples 13,000 examples that you want to learn but it's what tells you that it will generalize to an unseen prompt that will come up from your user base. And so this approach SFT really teaches the
5783.7s
you're not doing reinforcement learning
5785.3s
here. You're doing supervised learning
5788.8s
and so you're just showing a set of
5790.8s
examples 13,000 examples that you want
5793.3s
to learn but it's what tells you that it
5795.8s
will generalize to an unseen prompt that
5798.6s
will come up from your user base. And so
5802.3s
this approach SFT really teaches the
5805.6s
model to imitate good behavior from humans. And that's the key. It's it's imitation. It is not preference imitation. It is not preference optimization. To do preference optimization, that's where we're going to train a reward model and we're going to do proper RHF. So let me talk to you about the RM reward model and then I'll tell
5807.8s
humans. And that's the key. It's it's
5810.2s
imitation. It is not preference
5812.7s
imitation. It is not preference optimization.
5815.6s
To do preference optimization,
5818.0s
that's where we're going to train a
5819.4s
reward model and we're going to do
5821.0s
proper RHF. So let me talk to you about
5824.0s
the RM reward model and then I'll tell
5827.0s
you about RHF in a nutshell. The problem of um RHF is to align not with human responses but with human with human responses but with human preferences. So what what's going to happen is we're going to train a separate model to predict which responses human prefer and we're going to call that model the
5831.3s
The problem of um RHF is to align not
5835.8s
with human responses but with human
5837.8s
with human responses but with human preferences.
5839.7s
So what what's going to happen is we're
5842.1s
going to train a separate model to
5845.0s
predict which responses human prefer and
5848.6s
we're going to call that model the
5850.1s
reward model. It's a separate model from whatever we've trained before. The model um is going to use data from labelers. So you're going to show labelers two or more responses to the same prompt and those responses will be sampled from the SFT. So your best model right now is the SFT. You will sample three or four
5851.8s
whatever we've trained before. The model
5855.0s
um is going to use data from labelers.
5857.9s
So you're going to show labelers two or
5860.9s
more responses to the same prompt
5864.1s
and those responses will be sampled from
5867.6s
the SFT. So your best model right now is
5870.3s
the SFT. You will sample three or four
5873.7s
responses and you know how we sample right you you can tweak the temperature. You can select not only the top priority word, the top the softmax layers number one word, but you can sometimes sample differently and you will get a variety of answers. And then you will ask a human labeler to say answer B is better
5875.6s
right you you can tweak the temperature.
5877.7s
You can select not only the top priority
5880.2s
word, the top the softmax layers number
5883.1s
one word, but you can sometimes sample
5885.4s
differently and you will get a variety
5887.3s
of answers. And then you will ask a
5889.7s
human labeler to say answer B is better
5892.0s
than answer C and answer C is better than answer A and answer A is sort of equal to answer D. Let's say equal to answer D. Let's say um they will be asked which response they prefer and it can get more complicated. It doesn't have to be just a simple ranking. You you have multiple liker
5894.3s
than answer A and answer A is sort of
5896.9s
equal to answer D. Let's say
5900.7s
equal to answer D. Let's say um
5903.0s
they will be asked which response they
5904.6s
prefer and it can get more complicated.
5907.0s
It doesn't have to be just a simple
5908.4s
ranking. You you have multiple liker
5910.8s
scale methods and so on. But the the point is that you will collect those pair-wise comparison that we call preference data and you will use it to train a reward model which is initialized from your SFT. So your SFT is here. It's your best model to date and you're going to modify the last
5913.4s
point is that you will collect those
5914.8s
pair-wise comparison that we call
5916.6s
preference data and you will use it to
5918.7s
train a reward model which is
5920.7s
initialized from your SFT. So your SFT
5923.6s
is here. It's your best model to date
5926.5s
and you're going to modify the last
5928.6s
layer. So the softmax layer at the end of a language model that will tell you this is the token we should output or this is the word we should write. Um instead of that you'll get rid of that layer. You'll put a scalar value as output. You'll put a linear layer with a
5931.1s
of a language model that will tell you
5933.4s
this is the token we should output or
5935.4s
this is the word we should write. Um
5937.8s
instead of that you'll get rid of that
5939.5s
layer. You'll put a scalar value as
5941.9s
output. You'll put a linear layer with a
5943.9s
scalar value that will represent the reward head. It will predict the the reward you know uh which is a proxy for the preference of the human. The way you'll train that reward model is you'll give it um a batch of two um you know you give it a prompt X with a response A
5945.9s
reward head. It will predict the the
5948.2s
reward you know uh which is a proxy for
5950.9s
the preference of the human. The way
5952.6s
you'll train that reward model is you'll
5955.2s
give it um a batch of two um you know
5958.7s
you give it a prompt X with a response A
5962.1s
and the preference of the user and you'll give it the same prompt with a response B from the SFT with the preference of the user. So here the user is saying response A is better than response B. And so if you actually were sending that in this model, you will get a predicted reward for the preferred
5963.9s
you'll give it the same prompt with a
5965.9s
response B from the SFT with the
5968.5s
preference of the user. So here the user
5970.3s
is saying response A is better than
5972.7s
response B. And so if you actually were
5975.3s
sending that in this model, you will get
5978.0s
a predicted reward for the preferred
5980.2s
answer and a predicted reward for the this preferred answer that allows you to uh train using a loss function that I'm not going to cover given our our time sensitivity. The loss function will encourage the model to assign higher rewards to preferred responses. So you're trying to dissociate the higher reward better preference from the lower reward for
5982.4s
this preferred answer
5985.8s
that allows you to uh train using a loss
5992.6s
function that I'm not going to cover
5993.7s
given our our time sensitivity. The loss
5996.4s
function will encourage the model to
5998.0s
assign higher rewards to preferred
6000.8s
responses. So you're trying to
6002.9s
dissociate the higher reward better
6006.1s
preference from the lower reward for
6007.8s
lower preference. And it turns out that if you do that many times, you will have a reward model that given a prompt and a response will be approximating human preference. So you've just trained a critique that represents your humans. It's a proxy for what humans prefer. It's clears throat been trained on a lot of human preferences.
6009.7s
if you do that many times, you will have
6012.0s
a reward model that given a prompt and a
6014.8s
response will be approximating human
6018.2s
preference. So you've just trained a
6020.6s
critique that represents your humans.
6023.8s
It's a proxy for what humans prefer.
6025.9s
It's clears throat been trained on a
6026.9s
lot of human preferences.
6029.1s
The reason it's better to use a model than actual humans is because we can use it widely on all sorts of inputs and it can scale from a data standpoint. Also note that this method is better than SFT because it's way easier to ask humans what's your preference between those two things and to ask them to come
6030.7s
than actual humans is because we can use
6032.7s
it widely on all sorts of inputs and it
6035.7s
can scale from a data standpoint.
6038.8s
Also note that this method is better
6040.6s
than SFT because it's way easier to ask
6043.5s
humans what's your preference between
6044.9s
those two things and to ask them to come
6046.4s
up with answers to prompts. Takes way less time. And if you've used SH GPT, you've probably been asked before to to to tell them which response you prefer. Yeah. So once trained the reward model represses the human as the evaluator during reinforcement learning with from human feedback and reinforcement learning from human feedback is very
6048.7s
less time. And if you've used SH GPT,
6051.4s
you've probably been asked before to to
6053.2s
to tell them which response you prefer.
6057.4s
Yeah. So once trained the reward model
6060.1s
represses the human as the evaluator
6062.1s
during reinforcement learning with from
6064.6s
human feedback and reinforcement
6066.5s
learning from human feedback is very
6069.5s
comfortable for you. Now I will I will show you what it looks like given the Q-learning algorithm we learned. But essentially we we have first talked the model what good behavior looks like with SFT and then we built a reward model that can tell us how good an answer is according to human preferences and the
6071.4s
show you what it looks like given the
6072.7s
Q-learning algorithm we learned. But
6074.7s
essentially we we have first talked the
6077.0s
model what good behavior looks like with
6079.1s
SFT and then we built a reward model
6081.8s
that can tell us how good an answer is
6083.7s
according to human preferences and the
6086.2s
RHF approach is where we will let this model practice get scored by the reward model or the critic and update itself to produce higher scoring answers. So more preferred answers and it's the same as the games we've seen together but some things differ. So I just pasted here the exact setup that we've learned together
6089.2s
model practice get scored by the reward
6092.2s
model or the critic and update itself to
6094.9s
produce higher scoring answers. So more
6097.9s
preferred answers and it's the same as
6101.6s
the games we've seen together but some
6104.6s
things differ. So I just pasted here the
6107.3s
exact setup that we've learned together
6110.5s
um for reinforcement learning. The differences are the following. You know our objective is still to maximize expected reward that is produced um by the reward model aligned with human the reward model aligned with human preferences. The agent is the language model being fine-tuned. The environment is the space of possible prompts and continuations.
6113.1s
differences are the following. You know
6115.4s
our objective is still to maximize
6117.2s
expected reward that is produced um by
6120.1s
the reward model aligned with human
6121.9s
the reward model aligned with human preferences.
6127.4s
The agent is the language model being fine-tuned.
6130.4s
The environment is the space of possible
6132.6s
prompts and continuations.
6135.6s
It's any any text that you can encounter. The state is the specific prompt plus the tokens that were generated so far. The next state is one more token added and the action is the next token that is chosen by the agent or the model. which is of course determined by the
6138.1s
encounter. The state is the specific
6141.6s
prompt plus the tokens that were
6143.9s
generated so far. The next state is one
6147.0s
more token added and the action is the
6150.8s
next token that is chosen by the agent
6152.6s
or the model.
6154.0s
which is of course determined by the
6156.2s
which is of course determined by the policy laughter and then the reward is estimated by the reward model that we trained to represent human preferences. in this case one episode is one full prompt. So imagine that you get a prompt and you start generating and you go through this reinforcement learning loop
6157.7s
laughter and then the reward is
6161.0s
estimated by the reward model that we
6163.7s
trained to represent human preferences.
6171.0s
in this case one episode is one full
6174.2s
prompt. So imagine that you get a prompt
6177.4s
and you start generating and you go
6179.0s
through this reinforcement learning loop
6180.6s
and you observe the rewards and then you try to maximize the future rewards and then at the end of training you end up with having your pre-trained model turn into an SFT and your SFT turn into a way better model using RHF.
6182.2s
try to maximize the future rewards and
6184.5s
then at the end of training you end up
6186.2s
with having your pre-trained model turn
6188.3s
into an SFT and your SFT turn into a way
6191.0s
better model using RHF.
6201.8s
this. The model does not get a reward at every single token. It gets a reward at the end of a sequence when the completion is finished because reward model was uh was asked to rate prompts and responses together. So you need to finish the generation in order to see what's the reward. And so again going
6204.9s
every single token. It gets a reward at
6208.7s
the end of a sequence when the
6210.5s
completion is finished because reward
6212.9s
model was uh was asked to rate prompts
6216.2s
and responses together. So you need to
6218.3s
finish the generation in order to see
6220.7s
what's the reward. And so again going
6223.0s
back to making good sequences of decision that's exactly it. You want the model to make enough good sequences of decision so that the response is preferred by the critique which represents a proxy to the human represents a proxy to the human preferences. So all intermediary rewards are typically zero and that makes it a very
6224.6s
decision that's exactly it. You want the
6226.2s
model to make enough good sequences of
6227.8s
decision so that the response is
6230.6s
preferred by the critique which
6232.3s
represents a proxy to the human
6234.6s
represents a proxy to the human preferences.
6236.7s
So all intermediary rewards are
6238.6s
typically zero and that makes it a very
6241.5s
sparse reward episodic tasks just like a game of chess where you only get a reward when you finish assuming you're not defining intermediary reward. So you only know if you did well at the end and you have to then use that information to update your network and get a better proxy for it. Super. There's a very nice
6246.1s
game of chess where you only get a
6247.7s
reward when you finish assuming you're
6249.2s
not defining intermediary reward. So you
6251.9s
only know if you did well at the end and
6254.4s
you have to then use that information to
6257.0s
update your network and get a better
6259.2s
proxy for it. Super. There's a very nice
6262.2s
video. We're not going to play it for the sake of time, but I will send it online. Uh it's from 4 days ago. It's uh you know former Stanford students Andre Karpati who is very thoughtful and articulate and was explaining 4 days ago why reinforcement learning can be terrible at times and that human minds work way more
6263.3s
the sake of time, but I will send it
6264.8s
online. Uh it's from 4 days ago. It's uh
6268.1s
you know
6269.9s
former Stanford
6271.8s
students Andre Karpati who is very
6274.5s
thoughtful and articulate and was
6276.6s
explaining 4 days ago why reinforcement
6279.4s
learning can be terrible at times and
6282.1s
that human minds work way more
6284.6s
efficiently. And so I would encourage you to watch this 4-minute video because he's very clearly outlining why reinforcement learning is still not great even if it's the best thing we can use in many ways.
6286.3s
you to watch this 4-minute video because
6288.6s
he's very clearly outlining why
6290.5s
reinforcement learning is still not
6293.0s
great even if it's the best thing we can
6294.6s
use in many ways.