Name: Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning
Uploaded: 2026-02-08T15:13:15.779
Duration: 6301 s
Description: Read the full transcript of "Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning" by Stanford Online. Practice English listening and readin...

9.7s for Stanford deep learning CS230. Um today's lecture is going to be about deep reinforcement learning. I actually switched uh the original plan of talking about neural network interpretability and LLM visualization simply because you you haven't had the chance to study attention maps, you know, convolutional neural networks and so it would have been an overkill to do

12.6s Um today's lecture is going to be about

15.1s deep reinforcement learning. I actually

17.2s switched uh the original plan of talking

20.3s about neural network interpretability

23.4s and LLM visualization

25.8s simply because you you haven't had the

28.0s chance to study attention maps, you

31.1s know, convolutional neural networks and

32.9s so it would have been an overkill to do

34.5s that week five. So we're going to talk about neural network interpretability and visualization in a later lecture actually. Um but today our focus will be on deep reinforcement learning uh which is probably my favorite lecture of uh the class. I I feel like I say that every week but it's okay. I like it. Um

36.4s about neural network interpretability

38.2s and visualization in a later lecture

41.0s actually. Um but today our focus will be

44.2s on deep reinforcement learning uh which

46.7s is probably my favorite lecture of uh

49.7s the class. I I feel like I say that

51.2s every week but it's okay. I like it. Um

54.8s the agenda is pretty packed. We're going to start with uh deep reinforcement learning which you can think of as the marriage between deep learning and reinforcement learning. Together the baby is called deep reinforcement learning. and we're going to see how reinforcement learning works and how neural networks can play a part in building a reinforcement learning um

58.1s to start with uh deep reinforcement

61.4s learning which you can think of as the

64.8s marriage between deep learning and

67.0s reinforcement learning. Together the

68.8s baby is called deep reinforcement

70.2s learning. and we're going to see how

72.8s reinforcement learning works and how

74.6s neural networks can play a part in

78.0s building a reinforcement learning um

81.3s agent. Um in the second half of the class we will focus on a very specific um you know concept called reinforcement learning from human feedback that you might have heard of. It's one of the core concept that really made the difference between what you might have remembered as GPT2 and chat GPT. You know, that's the leap.

85.0s class we will focus on a very specific

89.4s um you know concept called reinforcement

92.4s learning from human feedback that you

95.1s might have heard of. It's one of the

96.9s core concept that

99.6s really made the difference between what

103.4s you might have remembered as GPT2

106.2s and chat GPT. You know, that's the leap.

109.1s Uh that's really the the the technique that had uh that has you know democratized um access to LLM because of the performance improvements and the alignment with humans. So, we're going to see what what is this concept of R LHF and how um does it work and why does it allow us to align a language model to

112.0s that had uh that has you know

114.5s democratized um access to LLM because of

117.6s the performance improvements and the

119.8s alignment with humans. So, we're going

122.2s to see what what is this concept of R

125.0s LHF and how um does it work and why does

129.0s it allow us to align a language model to

131.3s human preferences. Ready to go? As always, let's try to make it interactive. make it interactive. snorts and clears throat So, the motivation behind deep reinforcement learning and as usual, you're going to have all the most important papers that are covered in the class listed at the bottom of each slide. Um reinforcement learning has

134.2s Ready to go? As always, let's try to

136.6s make it interactive.

138.2s make it interactive. snorts and clears throat

139.0s So, the motivation behind deep

140.9s reinforcement learning and as usual,

142.6s you're going to have all the most

144.6s important papers that are covered in the

146.8s class listed at the bottom of each

148.4s slide. Um reinforcement learning has

151.3s grown in popularity. Um one of uh uh the you know very popular papers called human level control through deep reinforcement learning is the work um from deep mind um has showed us that a single algorithmtraining method can allow us to train AI that can play many many Atari games better than

156.0s you know very popular papers called

158.6s human level control through deep

160.1s reinforcement learning is the work um

163.5s from deep mind um has showed us that a

168.5s single algorithmtraining

171.2s method can allow us to train AI that can

176.2s play many many Atari games better than

179.4s humans. single algorithm over 40 50 games where it exceeds human capability which is quite impressive when you thought about the fact that you know machine learning used to be niche and you would have to train a really niche algorithm to perform different task here's an algorithm that can just learn sort of every Atari game a little later

183.0s games where it exceeds human capability

186.3s which is quite impressive when you

188.2s thought about the fact that you know

190.1s machine learning used to be niche and

191.8s you would have to train a really niche

193.4s algorithm to perform different task

195.1s here's an algorithm that can just learn

197.3s sort of every Atari game a little later

201.1s you might have heard of Alpho Alph Go is a is an algorithm that was developed to beat and exceed human performance performance in the game of Go. We'll talk about it a little more. The game of Go is a very complex game. Some would argue way more complex than chess from a decision-m standpoint and from the

205.2s a is an algorithm that was developed to

207.6s beat and exceed human performance

209.6s performance in the game of Go. We'll

211.4s talk about it a little more. The game of

213.4s Go is a very complex game. Some would

216.2s argue way more complex than chess from a

220.6s decision-m standpoint and from the

223.2s possibilities that can happen on the board. And so, um, it it actually got solved, um, in 2017 again by the Deep M Deep Mind team and and David Silver's lab um, later on. And again another great paper from deep mind had showed us that reinforcement learning can also be used for strategy game that might be a

225.5s board. And so, um, it it actually got

228.3s solved, um, in 2017 again by the Deep M

233.0s Deep Mind team and and David Silver's

235.2s lab um, later on. And again another

239.3s great paper from deep mind had showed us

241.4s that reinforcement learning can also be

244.2s used for strategy game that might be a

248.3s touch more complex than chess or go that might actually involve multiple players playing with each other or against each other. Some of you might have played Starcraft for example. That's an example of a game where um it requires a lot of long long-term thinking, short-term thinking. Another one is, you know,

252.7s might actually involve multiple players

255.4s playing with each other or against each

258.1s other. Some of you might have played

259.8s Starcraft for example. That's an example

261.7s of a game where um it requires a lot of

264.6s long long-term thinking, short-term

265.9s thinking. Another one is, you know,

267.9s Dota. Some of you might have played Dota or League of Legend where you have a team playing against another team. Those are examples of games that involve multiple agents playing collaboratively. And it's pretty hard to develop systems that can play with each other against multiple opponents. Um and finally most recently this is 2022 so alongside the

269.8s or League of Legend where you have a

271.6s team playing against another team. Those

273.4s are examples of games that involve

275.1s multiple agents playing collaboratively.

277.8s And it's pretty hard to develop systems

280.3s that can play with each other against

282.9s multiple opponents. Um and finally most

287.0s recently this is 2022 so alongside the

289.7s release of Chad GPT um this paper that introduces the concept of reinforcement learning with human feedback applied to aligning language models with human preferences and we'll talk about that later. So all all this to say that reinforcement learning allowed um us to exceed human performance in a variety of tasks. The first one um I want us to

293.4s introduces the concept of reinforcement

295.2s learning with human feedback applied to

297.3s aligning language models with human

300.1s preferences and we'll talk about that

302.2s later. So all all this to say that

305.2s reinforcement learning allowed um us to

308.6s exceed human performance in a variety of

311.4s tasks. The first one um I want us to

314.6s think about is the the game of go. So let's say that you were asked to solve the game of go with classic supervised learning. Okay, everything we've seen together so far labeled data. How would you solve the game of go with classic supervised learning? be the label

317.0s let's say that you were asked to solve

321.6s the game of go with classic supervised

324.2s learning. Okay, everything we've seen

326.4s together so far labeled data. How would

329.4s you solve the game of go with classic

332.0s supervised learning?

337.8s be the label

340.2s be the label etc.

356.6s plenty of games, hopefully from good players if you want the the algorithm to work. Um, and you look at X as the input being the current state of the board and Y as the next state of the board. And this would tell you what move was selected and you learn the move essentially. And hopefully if you do

359.1s players if you want the the algorithm to

362.4s work. Um, and you look at X as the input

366.1s being the current state of the board and

368.4s Y as the next state of the board. And

371.2s this would tell you what move was

372.6s selected and you learn the move

374.8s essentially. And hopefully if you do

376.2s that across many many games, you know, you might you might see the the agent become more attuned to the game and and and develop better strategies. So, you know, really hopefully it's a professional player. What are the disadvantages of that or the shortcomings that you can anticipate?

378.8s you might you might see the the agent

382.0s become more attuned to the game and and

384.2s and develop better strategies. So, you

388.2s know, really hopefully it's a

389.4s professional player. What are the

391.5s disadvantages of that or the

393.5s shortcomings that you can anticipate?

397.4s shortcomings that you can anticipate? Yes. see the entire space of possible states of the board, which is what you said. So you might miss out on a lot of different strategies. So the game of go is actually a game with two players. One

413.0s see the entire space of possible states

416.1s of the board, which is what you said. So

418.5s you might miss out on a lot of different

420.0s strategies. So the game of go is

422.6s actually a game with two players. One

425.0s player that uses the black stones and one player that uses the white stones. And iteratively they're going to place those stones on the grid, a 13 by13 grid that you can see on screen with the goal of surrounding their opponent. So you're constantly trying to surround the stones of the opponent and the opponent is

427.5s one player that uses the white stones.

429.8s And iteratively they're going to place

431.6s those stones on the grid, a 13 by13 grid

436.9s that you can see on screen with the goal

439.2s of surrounding their opponent. So you're

441.2s constantly trying to surround the stones

443.1s of the opponent and the opponent is

445.1s trying to surround your stones. And so you can imagine that for every intersection on the grid there is multiple possibilities. Either there's a black stone or a white stone or nothing. And on a 13 by13 grid you can imagine how many possibilities of a board state there are. It's um impossible to capture

447.3s you can imagine that for every

450.0s intersection on the grid there is

453.4s multiple possibilities. Either there's a

455.1s black stone or a white stone or nothing.

458.2s And on a 13 by13 grid you can imagine

461.0s how many possibilities of a board state

463.1s there are. It's um impossible to capture

466.6s all of that with historical moves from professional players. It would just never cover that. The same thing could be said in chess as well. you you know that even the professional players can plan X number of steps in advance but nobody knows where the game takes you and in the late stages of the games or

468.5s professional players. It would just

470.2s never cover that. The same thing could

472.7s be said in chess as well. you you know

474.4s that even the professional players can

476.1s plan X number of steps in advance but

479.0s nobody knows where the game takes you

481.2s and in the late stages of the games or

483.6s the end games players always find themselves playing a different game and that's part of the magic of being good at chess. snorts Um so yeah that's the problem. What's another problem or shortcoming beyond the fact that we can't observe possibly all the states?

486.2s themselves playing a different game and

487.8s that's part of the magic of being good

489.6s at chess. snorts Um so yeah that's the

492.1s problem. What's another problem or

493.9s shortcoming beyond the fact that we

495.6s can't observe possibly all the states?

517.0s you said, well, first, you don't even know if this was a good move, you know? So, maybe it was not even a good move and you're learning something that was not a good move and you're labeling it as a good move. And second, um, you're actually only getting partial information, meaning you don't have the

519.4s know if this was a good move, you know?

521.3s So, maybe it was not even a good move

522.6s and you're learning something that was

523.8s not a good move and you're labeling it

525.3s as a good move. And second, um, you're

528.2s actually only getting partial

529.5s information, meaning you don't have the

531.1s information of what's in the person's mind and what strategy they're trying to execute. So your store, you're sort of looking at a single example among a long-term strategy and you can't expect the model to guess what's the long-term strategy because it was just trained on X and Y and matching the inputs to a

533.0s mind and what strategy they're trying to

534.9s execute. So your store, you're sort of

537.2s looking at a single example among a

540.7s long-term strategy and you can't expect

543.4s the model to guess what's the long-term

545.4s strategy because it was just trained on

547.4s X and Y and matching the inputs to a

550.4s possible output. So you you don't really have any concept of a strategy at that point. it looks one off at every decisions of the model. Okay, those are really good points. Um the other one is the ground truth might be illdefined. What I mean by that is um even the best humans in the world do not

552.5s have any concept of a strategy at that

554.4s point. it looks one off at every

556.7s decisions of the model. Okay, those are

560.0s really good points. Um the other one is

563.2s the ground truth might be illdefined.

566.5s What I mean by that is um

571.1s even the best humans in the world do not

574.4s play their best game every day and even their best game is not the ground truth. And that creates an issue because you're essentially training against a target that is off by a certain margin. You're never going to get better than the best human and the best human is not the best possible um existing the best possible

577.0s their best game is not the ground truth.

579.8s And that creates an issue because you're

582.2s essentially training against a target

584.0s that is off by a certain margin. You're

586.3s never going to get better than the best

588.1s human and the best human is not the best

589.8s possible um existing the best possible

592.6s strategy at every point. So you could argue what if we get a panel of experts that we're monitoring and those are the best players in the world. Even with a panel of expert that decides every move, you're still have an illdefined ground truth, you know. So that's a big issue. Too many states

594.8s argue what if we get a panel of experts

598.2s that we're monitoring and those are the

600.1s best players in the world. Even with a

602.3s panel of expert that decides every move,

605.0s you're still have an illdefined ground

607.4s truth, you know.

610.1s So that's a big issue. Too many states

611.9s in the game you mentioned and we will likely not generalize which is what you said meaning we're looking at one-off situations. We're not looking at entire strategies and so when we face a board state that we've never seen before because the model was not trained on strategy it sort of we get stuck. Yeah.

613.6s likely not generalize which is what you

615.2s said meaning we're looking at one-off

616.9s situations. We're not looking at entire

618.6s strategies and so when we face a board

621.8s state that we've never seen before

624.2s because the model was not trained on

625.9s strategy it sort of we get stuck. Yeah.

631.4s Okay. And this is an example of a perfect application for reinforcement learning because reinforcement learning is all about delayed labels and making sequences of good decisions. So if you had to remember in one sentence what's RL RL is making good sequences of decisions, sequences of good decisions, decisions, sequences of good decisions, sorry.

634.0s perfect application for reinforcement

636.0s learning because reinforcement learning

638.7s is all about delayed labels and making

642.7s sequences of good decisions. So if you

646.1s had to remember in one sentence what's

648.2s RL RL is making good sequences of

653.0s decisions, sequences of good decisions,

655.6s decisions, sequences of good decisions, sorry.

667.1s the difference between you know classic supervised learning and RL is in uh in classic supervised learning you teach by example. In reinforcement learning you teach by experience which is also a different concept. You're not just showing cats and non-cats to a model. You're actually letting the model experience an environment until it figures out what

669.0s supervised learning and RL is in uh in

673.0s classic supervised learning you teach by

675.0s example. In reinforcement learning you

677.4s teach by experience

679.4s which is also a different concept.

680.9s You're not just showing cats and

683.7s non-cats to a model. You're actually

686.7s letting the model experience an

689.0s environment until it figures out what

691.8s were the best decision it made and learns from them. reinforcement learning applications I'm going to mention them. We we have we have gaming of course that we already covered. What are other applications of AI where we need good sequences of of AI where we need good sequences of decisions? decisions? clears throat clears throat Yes.

693.8s learns from them.

698.8s reinforcement learning applications I'm

700.6s going to mention them. We we have we

702.1s have gaming of course that we already

703.7s covered. What are other applications

707.4s of AI where we need good sequences of

709.8s of AI where we need good sequences of decisions?

712.6s decisions? clears throat

713.6s clears throat Yes.

715.1s autonomous driving. Yeah, correct. I mean, in driving, you could argue RL could work and there's some RL going on, but what you mean, I think, is you you have some sort of a dynamic planning algorithm that allows you to strategize. If you see a a red light ahead, you might start slowing down over time, but

717.3s mean, in driving, you could argue RL

719.7s could work and there's some RL going on,

722.5s but what you mean, I think, is you you

724.8s have some sort of a dynamic planning

726.7s algorithm that allows you to strategize.

729.6s If you see a a red light ahead, you

732.5s might start slowing down over time, but

735.1s maybe it will turn green, so you might not slow down completely. This is an example of a strategy that you need, of example of a strategy that you need, of course. course. Yeah. robotic controlling. That's a great example also related to autonomous driving. But imagine you uh want to

736.5s not slow down completely. This is an

738.6s example of a strategy that you need, of

741.2s example of a strategy that you need, of course.

742.4s course. Yeah.

744.2s robotic controlling. That's a great

745.7s example also related to autonomous

747.6s driving. But imagine you uh want to

750.5s teach to a robot to move from point A to point B. The number of good decisions that the robot needs to make in terms of moving each of their joints is tremendous. Like it's actually super unlikely that a robot would move from A to B if it's not trained to make good sequences of decisions.

753.0s point B. The number of good decisions

756.1s that the robot needs to make in terms of

758.6s moving each of their joints is

761.4s tremendous. Like it's actually super

763.7s unlikely that a robot would move from A

765.4s to B if it's not trained to make good

767.5s sequences of decisions.

783.8s like it, but it happens to be the biggest one over enforcement learning.

785.3s biggest one over enforcement learning.

795.9s Marketing. You're right. So, yeah, we talked about robotics. Advertisement is another example. Um, advertisement another example. Um, advertisement clears throat is a long game. Like companies are showing you multiple ads before you buy. And in fact, the reason rein reinforcement learning is important is because, you know, they're planning a strategy that might lead a buyer to

797.4s talked about robotics. Advertisement is

799.3s another example. Um, advertisement

803.0s another example. Um, advertisement clears throat

803.8s is a long game. Like companies are

806.6s showing you multiple ads before you buy.

809.5s And in fact, the reason rein

811.4s reinforcement learning is important is

813.2s because, you know, they're planning a

815.3s strategy that might lead a buyer to

817.6s execute a purchase over time and it requires long-term thinking. So there's a lot of reinforcement learning applied to marketing, advertisement, real time bidding, processes, etc. Okay, clear on what RL is and how it differs from classic supervised differs from classic supervised learning? No. Okay. Um, so let's put some vocabulary around that concept. In

820.3s requires long-term thinking. So there's

822.6s a lot of reinforcement learning applied

825.1s to marketing, advertisement, real time

828.6s bidding, processes, etc.

832.4s Okay, clear on what RL is and how it

835.0s differs from classic supervised

836.3s differs from classic supervised learning?

838.1s No. Okay. Um, so let's put some

841.0s vocabulary around that concept. In

843.1s reinforcement learning, you have an agent and the agent interacts with an agent and the agent interacts with an environment. environment, the agent will perform certain actions that we will denote a t where t is a time step. And the environment will show you states that transition from time step t to time step

844.4s agent and the agent interacts with an

847.8s agent and the agent interacts with an environment.

853.3s environment, the agent will perform

855.7s certain actions that we will denote a t

860.5s where t is a time step. And the

863.4s environment will show you states that

867.1s transition from time step t to time step

869.4s t one. So subject to an action at an environment may transition from a st to st plus one. You can think of the game of go. I take the action of putting my black stone on a certain grid intersection and the environment has changed. It moved from the state has changed. It moved from state time step t

873.8s environment may transition from a st to

876.6s st plus one. You can think of the game

878.4s of go. I take the action of putting my

880.8s black stone on a certain grid

882.6s intersection and the environment has

884.8s changed. It moved from the state has

886.6s changed. It moved from state time step t

888.8s to time step t one where my stone is on the grid. After that um state update happens, there's two things that the agent observes. the the the agent observes an observation that we will note OT and a reward RT. and of course the goal of the agent will be to maximize the rewards.

890.8s on the grid. After that um state update

895.3s happens, there's two things that the

897.8s agent observes. the the the agent

899.5s observes an observation that we will

901.4s note OT and a reward RT.

910.8s and of course the goal of the agent will

912.5s be to maximize the rewards.

919.6s we'll talk about it a little more. Um the observation sometimes is equal to the state. Can someone guess why we might need two concepts instead of a single concept? Why is it important to have a state and an observation?

922.0s the observation sometimes is equal to

924.4s the state. Can someone guess why we

927.0s might need two concepts instead of a

929.1s single concept?

931.0s Why is it important to have a state and

934.5s an observation?

945.0s environment may not be fully um you know transparent to the user and so for example in chess or in go the observation is actually equal to the state you see everything on your board all the information is available to you if you play League of Legends or if you play League of Legends or Starcraft

948.6s transparent to the user and so for

951.0s example in chess or in go

953.9s the observation is actually equal to the

955.7s state you see everything on your board

957.5s all the information is available to you

959.8s if you play League of Legends or

962.6s if you play League of Legends or Starcraft

964.2s you know the concept of you know I think in English it's called like a cloud or a fog I think it's a fog you only see certain part of the map until you have explored everything or until your friends are sort of visiting the other parts of the map. And so the observation

966.2s in English it's called like a cloud or a

968.3s fog I think it's a fog you only see

970.9s certain part of the map until you have

972.8s explored everything or until your

974.6s friends are sort of visiting the other

976.8s parts of the map. And so the observation

978.9s is actually less information than the states of the environment. Okay. And then the last piece of vocabulary is a transition. When I refer to a transition, I refer of the process of getting from state T plus one, which means we're in state T. The agent takes an action A. It observes OT and a reward

981.9s states of the environment.

985.1s Okay. And then the last piece of

986.8s vocabulary is a transition. When I refer

988.6s to a transition, I refer of the process

991.1s of getting from state T plus one, which

993.7s means we're in state T. The agent takes

996.1s an action A. It observes OT and a reward

999.6s RT and it transition to the next state ST plus one. Question. there are there examples of environment where the state is so large that the where the state is so large that the entire entire clears throat

1002.1s ST plus one. Question.

1017.6s there are there examples of environment

1020.5s where the state is so large that the

1023.8s where the state is so large that the entire

1025.7s entire clears throat

1030.1s reasons. Yeah. Yeah. You might have sighs games. I mean, look at open world games. Like truly, you could you could argue, I don't know, there are some games where you might press start and you see the entire environment, but who cares of what's happening 20,000 kilometers west of you if you're in a certain location. That might not

1032.3s sighs games. I mean, look at open

1034.7s world games. Like truly, you could you

1037.6s could argue, I don't know, there are

1039.0s some games where you might press start

1040.9s and you see the entire environment, but

1042.5s who cares of what's happening 20,000

1045.7s kilometers west of you if you're in a

1048.2s certain location. That might not

1050.1s influence your strategy. So you might actually put some sort of a you know trust circle or like some sort of a circle in which you observe which you think has 99 of the information you need possibly for computational reasons. That's a good point. Okay, let's get to a practical example of a reinforcement learning algorithm

1051.8s actually put some sort of a you know

1055.3s trust circle or like some sort of a

1058.0s circle in which you observe which you

1059.8s think has 99 of the information you

1061.8s need possibly for computational reasons.

1063.7s That's a good point.

1065.8s Okay, let's get to a practical example

1068.2s of a reinforcement learning algorithm

1070.6s and develop it together. This example is called recycling is good because recycling is good but also because it's a simple example illustrative of reinforcement learning. So let's say we have a a small environment with uh five states. There is a starting state marked in brown which is state two. It's our it's our

1073.3s This example is called recycling is good

1075.7s because recycling is good but also

1077.4s because it's a simple example

1079.5s illustrative of reinforcement learning.

1081.8s So let's say we have a a small

1084.5s environment with uh five states. There

1088.6s is a starting state marked in brown

1091.8s which is state two. It's our it's our

1094.4s initial state. And then on the right side, sorry, on the left side, you have state one, which is a garbage. And it's great to get to the garbage because you're going to be able to recite to to put in the garbage the, you know, the stuff that you have in your hands. You

1099.0s side, sorry, on the left side, you have

1101.8s state one, which is a garbage. And it's

1104.9s great to get to the garbage because

1106.3s you're going to be able to recite to to

1108.0s put in the garbage the, you know, the

1111.1s stuff that you have in your hands. You

1112.8s know, you're trying to throw away some garbage and the garbage can happens to be there and so we would expect there to be a reward. On the other side, if you actually go to the right, you might pass by state three, which is empty. You might pass by stage four where there is

1114.9s garbage and the garbage can happens to

1116.6s be there and so we would expect there to

1119.4s be a reward. On the other side, if you

1121.7s actually go to the right, you might pass

1123.8s by state three, which is empty. You

1126.0s might pass by stage four where there is

1128.7s a chocolate uh packaging that is left on the ground that you can pick up and it's good to pick it up. And then on stage five, state five, you have the recycle bean which is more valuable than the garbage can because you can recycle and you should get better rewards for that.

1131.8s the ground that you can pick up and it's

1134.6s good to pick it up. And then on stage

1137.3s five, state five, you have the recycle

1139.7s bean which is more valuable than the

1141.4s garbage can because you can recycle and

1143.7s you should get better rewards for that.

1146.6s So that's our game. In this game, we define a reward that is associated with the type of behaviors that we want the agents to learn. Um, and the reward is as followed. That's just one example. Plus two for throwing your garbage in the normal can, plus one for picking up the chocolate packaging, and plus 10 if

1148.8s define a reward that is associated with

1151.0s the type of behaviors that we want the

1152.8s agents to learn. Um, and the reward is

1155.5s as followed. That's just one example.

1157.8s Plus two for throwing your garbage in

1160.4s the normal can, plus one for picking up

1163.3s the chocolate packaging, and plus 10 if

1166.0s you manage to make it to the recycle you manage to make it to the recycle bin. Is it clear? Now, the goal will be, and that's the case in reinforcement learning often time to maximize the return. We'll define formally the return but think about it as maximize the amount of rewards that you get as you go through

1167.7s you manage to make it to the recycle bin.

1169.7s Is it clear?

1172.0s Now, the goal will be, and that's the

1175.2s case in reinforcement learning often

1177.5s time to maximize the return.

1181.5s We'll define formally the return but

1183.2s think about it as maximize the amount of

1185.1s rewards that you get as you go through

1187.1s this journey and you make your decisions. In this specific game, we have five states and there's three types of state. In brown is the initial states. We have normal states and we have in blue terminal states. When you get to a terminal states in reinforcement learning, it will typically end the game. it will end one

1188.4s decisions. In this specific game, we

1191.3s have five states and there's three types

1193.5s of state. In brown is the initial

1195.9s states. We have normal states and we

1198.6s have in blue terminal states. When you

1201.1s get to a terminal states in

1202.8s reinforcement learning, it will

1204.7s typically end the game. it will end one

1209.4s episode of the game. You move to another episode. You'll get back to the starting state or initial state and you'll redo another episode. The possible actions for our agent here are going to be fairly simple. Left and are going to be fairly simple. Left and right. And we're going to add an additional

1212.9s episode. You'll get back to the starting

1214.3s state or initial state and you'll redo

1216.0s another episode.

1218.5s The possible actions for our agent here

1220.5s are going to be fairly simple. Left and

1222.7s are going to be fairly simple. Left and right.

1225.2s And we're going to add an additional

1226.8s rule that is important, which is that the garbage collector comes in 3 minutes and it takes a minute to get from one state to the other. Why is that an important rule to add to the game? Can you guess? Can you guess? Yeah. forth between uh stage three and stage four. You just collect a bunch of

1228.4s the garbage collector comes in 3 minutes

1232.4s and it takes a minute to get from one

1234.6s state to the other. Why is that an

1237.0s important rule to add to the game?

1241.9s Can you guess?

1244.2s Can you guess? Yeah.

1249.1s forth between uh stage three and stage

1251.6s four. You just collect a bunch of

1253.3s chocolate packaging and you never make it to the bin. And so, it's not what we want. Yeah. snorts Okay. So, how do we define the long-term return? The long-term return is going to be defined as capital R, which is the sum of rewards with a which is the sum of rewards with a discount.

1254.8s it to the bin. And so, it's not what we

1257.7s want. Yeah.

1260.2s snorts Okay. So, how do we define the

1262.6s long-term return? The long-term return

1264.4s is going to be defined as capital R,

1268.0s which is the sum of rewards with a

1272.9s which is the sum of rewards with a discount.

1275.0s Discount is a very important concept in reinforcement learning. It's also a very natural concept to think about. Can you think of what what the discount would represent in for humans? Do you have an example of what it could be? Yeah. Huh? money. Yeah. The value of money and time. Exactly. Or the energy that a robot

1278.2s reinforcement learning. It's also a very

1280.6s natural concept to think about. Can you

1283.2s think of what what the discount would

1285.4s represent in for humans? Do you have an

1288.0s example of what it could be? Yeah.

1291.4s Huh? money.

1293.4s Yeah. The value of money and time.

1295.1s Exactly. Or the energy that a robot

1297.9s might have, things like that. Yeah. You you would rather get, you know, a dollar now than a dollar in 10 years knowing that there's some inflation, for example. That's the example of a discount. In reinforcement learning is the same. You know, let's say you have a strategy that takes so much time. You

1300.0s you would rather get, you know, a dollar

1302.4s now than a dollar in 10 years knowing

1304.2s that there's some inflation, for

1306.2s example. That's the example of a

1308.3s discount. In reinforcement learning is

1309.8s the same. You know, let's say you have a

1312.0s strategy that takes so much time. You

1314.1s need to discount it because your robot might lose energy as you're going through it. For example, these counts can vary, you know, but they stay between zero and one. Um, so what is the best strategy to follow if gamma the discount is equal to one meaning you know time doesn't matter

1315.6s might lose energy as you're going

1317.2s through it. For example, these counts

1319.9s can vary, you know, but they stay

1322.1s between zero and one.

1325.4s Um, so what is the best strategy to

1327.9s follow if gamma the discount is equal to

1331.4s one meaning you know time doesn't matter

1335.1s here if it's longer or shorter just want to maximize the return. Best strategy to follow. Someone who hasn't spoken yet.

1337.8s to maximize the return.

1341.0s Best strategy to follow.

1349.7s Someone who hasn't spoken yet.

1369.5s uh 3 minutes. You can't bounce around because you you will not get to the terminal state before the time allotted is done. But that would be a good idea if this rule was not true. What else could you do? too hard.

1371.5s You can't bounce around because you you

1373.8s will not get to the terminal state

1375.3s before the time allotted is done. But

1379.2s that would be a good idea if this rule

1381.0s was not true.

1383.8s What else could you do?

1394.3s too hard.

1402.3s give me also the maximum reward you would get.

1404.0s would get.

1413.7s People are are sleepy today. Yeah. Recycle. Go to the recycle. So, right, right, Go to the recycle. So, right, right, right. Yeah, that's right. Thank you. Right, right, right. And then what's your Sorry. What's your um what's your total reward? Yeah, that's right. 11. So that's where we get terminal state and we grab our

1414.2s Go to the recycle. So, right, right,

1417.2s Go to the recycle. So, right, right, right.

1418.2s Yeah, that's right. Thank you.

1422.6s Right, right, right. And then what's

1424.4s your Sorry. What's your um what's your

1426.6s total reward?

1429.6s Yeah, that's right. 11. So that's where

1432.5s we get terminal state and we grab our

1434.9s reward of 11. Very good. Now assuming 0.9 for gamma. We're going to complexify things a little bit. I'm going to walk you through a very simple algorithm that you know allows us to sort of determine the best strategy and we will put our numbers in a matrix. So for instance um

1438.8s 0.9 for gamma. We're going to complexify

1442.6s things a little bit. I'm going to walk

1444.3s you through a very simple algorithm that

1446.7s you know allows us to sort of determine

1448.7s the best strategy and we will put our

1451.0s numbers in a matrix. So for instance um

1454.2s we'll define a Q table and Q stands you know is a it's a it's a value function um where the the the name Q-learning Q star you might have heard um all of these things come from Q-learning and so let's say we have a Q table which has uh the size of number of states times

1459.4s know is a it's a it's a value function

1462.2s um where the the the name Q-learning

1465.2s Q star you might have heard um all of

1468.0s these things come from Q-learning and so

1470.7s let's say we have a Q table which has uh

1474.0s the size of number of states times

1477.1s number of actions so five rows two columns in our Every entry of the Q table is essentially representing how good it is to take action A in state B. Do you agree that if we had a table with these numbers essentially we solve the problem meaning at any point the agent

1480.6s columns in our

1482.8s Every entry of the Q table is

1486.4s essentially representing how good it is

1489.4s to take action

1491.8s A in state B.

1495.8s Do you agree that if we had a table with

1498.2s these numbers essentially we solve the

1500.7s problem meaning at any point the agent

1503.4s can just look in the table I am in state three let's look at column one that would tell me the value of action one and let's do that column two it would tell me the value of action two so I have everything I need to make my have everything I need to make my decisions

1506.3s three let's look at column one that

1508.8s would tell me the value of action one

1510.6s and let's do that column two it would

1512.2s tell me the value of action two so I

1514.2s have everything I need to make my

1515.4s have everything I need to make my decisions

1518.0s so that table is really the the thing you want to find in this exercise now the way we will find the table is sort of using a backtracking algorithm where we might actually codify the environment as a tree and traverse the tree. So here's what it looks like. I start in S2 and I have two

1520.1s you want to find in this exercise now

1523.7s the way we will find the table is sort

1526.5s of using a backtracking algorithm where

1529.4s we might actually

1531.4s codify the environment as a tree and

1533.5s traverse the tree. So here's what it

1535.4s looks like. I start in S2 and I have two

1538.2s options ahead of me. I can go to the left where I will get a reward of two. It's an immediate reward. The immediate reward is not discounted. It's an immediate reward. Remember the the formula for R. The immediate reward R0 is not discounted. That would take me to is not discounted. That would take me to S1.

1540.6s left where I will get a reward of two.

1542.9s It's an immediate reward. The immediate

1544.7s reward is not discounted. It's an

1547.0s immediate reward. Remember the the

1549.4s formula for R. The immediate reward R0

1552.8s is not discounted. That would take me to

1555.6s is not discounted. That would take me to S1.

1557.1s It's a terminal state, so there's nothing to do after. nothing to do after. snorts Second option, I go to the right and I get a reward of zero. That's my immediate reward. And I end up in state three. State three is not a terminal state. So I can go and do the same

1559.1s nothing to do after.

1561.5s nothing to do after. snorts

1562.2s Second option, I go to the right and I

1565.0s get a reward of zero. That's my

1567.2s immediate reward. And I end up in state

1569.3s three. State three is not a terminal

1571.4s state. So I can go and do the same

1573.4s exercise from state three. In state three, I have two options. I can go to the left where I would see a reward of zero and I will end up in S2 or I will go to the right and I will get an immediate reward of plus one. It's an immediate reward. We're not discounting

1575.6s three, I have two options. I can go to

1577.6s the left where I would see a reward of

1580.1s zero and I will end up in S2 or I will

1583.4s go to the right and I will get an

1585.2s immediate reward of plus one. It's an

1588.1s immediate reward. We're not discounting

1589.7s it. I will end up in S4 and from S4 again I have two options. Back to the left to S3 with zero reward or to the right with the amazing reward of plus 10 and the terminal state of S5. So that's my map of immediate rewards. That's not my discounted return. So what we're

1592.8s again I have two options. Back to the

1594.9s left to S3 with zero reward or to the

1598.6s right with the amazing reward of plus 10

1602.3s and the terminal state of S5. So that's

1606.6s my map of immediate rewards. That's not

1609.9s my discounted return. So what we're

1611.8s going to do now is we're going to backtrack up the tree in order to compute the discounted returns. Actually, if I'm in S3 right here, I see that I can get an immediate reward in S4 of one. And I want to compute my maximum return that I can get from when

1612.8s backtrack up the tree in order to

1615.0s compute the discounted returns.

1617.3s Actually, if I'm in S3 right here,

1623.1s I see that I can get an immediate reward

1626.7s in S4 of one. And I want to compute my

1630.9s maximum return that I can get from when

1633.4s I'm in S3. My maximum return is that in S4 I could get a plus 10, right? But I need to discount that. My discount is 0.9. So I multiply 10 by 0.9. What it tells me is that from S4 I can expect nine plus one, which I get as an immediate reward from moving from S3 to

1636.6s S4 I could get a plus 10, right? But I

1639.9s need to discount that. My discount is

1642.9s 0.9. So I multiply 10 by 0.9. What it

1646.6s tells me is that from S4 I can expect

1649.0s nine plus one, which I get as an

1651.8s immediate reward from moving from S3 to

1653.6s S4, I can update this number to 10. Meaning from S3 the best you can hope for is a discounted return of 10 which is 1 plus 0.9 10. Everyone follows. Now let's do the same exercise one step before in S2. uh you know um in S2 um I have um an immediate reward of zero for

1656.3s Meaning from S3 the best you can hope

1659.4s for is a discounted return of 10 which

1662.7s is 1 plus 0.9 10.

1666.3s Everyone follows.

1668.5s Now let's do the same exercise one step

1670.9s before in S2. uh you know um in S2 um I

1676.7s have um an immediate reward of zero for

1680.5s going to S3 or an immediate reward of two for going to S1. Um S1 is not going to be worth it. We already know that because when I'm in S3 I can actually expect 10 which I have to discount. 0.9 10 gives me 9 plus 0 immediate reward from S2 to S3. That tells me that the

1683.8s two for going to S1. Um S1 is not going

1687.8s to be worth it. We already know that

1689.4s because when I'm in S3 I can actually

1692.5s expect 10 which I have to discount. 0.9

1697.0s 10 gives me 9 plus 0 immediate reward

1700.6s from S2 to S3. That tells me that the

1703.2s discounted return from state two which is our initial state is nine. Good follow just a simple backtracking. Now I can copy back this. So S3, I know that when I'm in S3, um uh you know, I can expect zero immediate reward to um uh to sorry, if I if I'm if I'm in S2, I

1705.9s is our initial state is nine.

1710.6s Good follow just a simple backtracking.

1714.8s Now I can copy back this. So S3, I know

1717.5s that when I'm in S3, um uh you know, I

1720.5s can expect zero immediate reward to um

1724.3s uh to sorry, if I if I'm if I'm in S2, I

1728.8s can expect uh zero immediate reward plus a discount times the plus 9 that I could expect in S3. And so that gives me values that should cover everything that we have in this Q table. So I I do that backtracking. I copy paste all of that into my Q table all the way up here. And

1732.8s a discount times the plus 9 that I could

1735.8s expect in S3. And so that gives me

1739.6s values that should cover everything that

1743.0s we have in this Q table. So I I do that

1746.2s backtracking. I copy paste all of that

1748.7s into my Q table all the way up here. And

1751.5s this is what I get. We essentially finish the game at this point. We um can look uh at a certain row. So let's say I'm in state number three. I look on the third row of that Q table and I see that I have two options. If I go back to S2, ultimately my discounted return will be

1754.2s We essentially finish the game at this

1756.3s point. We um can look uh at a certain

1760.8s row. So let's say I'm in state number

1763.3s three. I look on the third row of that Q

1766.4s table and I see that I have two options.

1768.9s If I go back to S2,

1771.4s ultimately my discounted return will be

1775.4s ultimately my discounted return will be 8.1, right? If I actually go to S4 on the right, I will get 10 because I will get 1 0.9 10, which is 10. So this is a toy example, but it tells you that if you were able to backtrack through the entire environment, you will

1777.0s right? If I actually go to S4 on the

1781.5s right, I will get 10 because I will get

1783.9s 1 0.9 10, which is 10.

1789.0s So this is a toy example, but it tells

1791.0s you that if you were able to backtrack

1793.2s through the entire environment, you will

1795.4s be able to build a massive Q table and you will be able to give it to your agent to make its decisions. agent to make its decisions. snorts snorts Yeah.

1798.2s you will be able to give it to your

1799.8s agent to make its decisions.

1802.6s agent to make its decisions. snorts

1803.0s snorts Yeah.

1812.2s considering the time uh remaining. But in practice um you if if I remove the time component so I remove the fact that there's a three minute deadline before the garbage collector comes then uh this would uh be slightly more difficult because you would have to do a time series essentially of adding the

1814.6s in practice um you if if I remove the

1818.2s time component so I remove the fact that

1820.4s there's a three minute deadline before

1822.2s the garbage collector comes then uh this

1825.0s would uh be slightly more difficult

1826.8s because you would have to do a time

1828.6s series essentially of adding the

1831.0s discount times the reward that you collect. Yeah, but I'm simplifying here and that's why I use the three-minut and that's why I use the three-minut rule. Any snorts question on the Q table?

1833.0s collect. Yeah, but I'm simplifying here

1835.0s and that's why I use the three-minut

1836.5s and that's why I use the three-minut rule.

1839.1s Any snorts question on the Q table?

1849.0s and in fact clears throat we can put together our strategy for gamma equals together our strategy for gamma equals 0.9. The best strategy is still the same. You go to the right and you can expect a return of 9. in reinforcement learning is this equation on the board called the Bellman optimality equation.

1850.8s together our strategy for gamma equals

1853.0s together our strategy for gamma equals 0.9.

1854.6s The best strategy is still the same. You

1856.3s go to the right and you can expect a

1859.1s return of 9.

1865.4s in reinforcement learning is this

1867.4s equation on the board called the Bellman

1870.9s optimality equation.

1873.8s Often time you'll see it noted as Q star of state S and action A equals R gamma time the max of that same function applied to S prime A same function applied to S prime A prime. Let me explain this equation for you because it's super important. This equation is called the optimality

1877.3s of state S and action A

1881.4s equals R gamma time the max of that

1886.0s same function applied to S prime A

1889.3s same function applied to S prime A prime.

1891.2s Let me explain this equation for you

1893.0s because it's super important. This

1895.7s equation is called the optimality

1898.4s equation because your optimal Q table will follow this equation. If you have finished the game, this equation can be applied to any state action pair and it will still be true. The intuition behind why the Bellman equation is the optimality equation is that um if you're in a if you have the

1901.9s will follow this equation. If you have

1904.2s finished the game, this equation can be

1907.1s applied to any state action pair and it

1909.9s will still be true.

1912.5s The intuition behind why the Bellman

1915.3s equation is the optimality equation is

1917.9s that um if you're in a if you have the

1921.4s perfect Q function Q table um and you're in a certain state and you perform a certain action A you will observe a reward and this reward will uh you know you you have taken an action so you would be in a new state and from that new state you can repeat what you just

1925.4s in a certain state and you perform a

1927.4s certain action A you will observe a

1929.8s reward and this reward will uh you know

1934.0s you you have taken an action so you

1935.5s would be in a new state and from that

1937.4s new state you can repeat what you just

1939.1s did right and because uh you've done the backtracking and stuff like that, you will uh get this equation to be true because it's the reward plus discount times the best next action that you could be taking. Does that make sense? Any question on that? That's exactly the backtracking that we did by the way. immediate reward plus

1943.0s backtracking and stuff like that, you

1944.5s will uh get this equation to be true

1946.9s because it's the reward plus discount

1949.5s times the best next action that you

1951.2s could be taking.

1957.4s Does that make sense? Any question on that?

1958.9s That's exactly the backtracking that we

1960.5s did by the way. immediate reward plus

1964.6s discount times the best possible action that you can take in the next state s that you can take in the next state s prime. The last concept I cover in terms of vocabulary is the policy. The policy is the function that given your state is going to tell you what to do. And in

1968.1s that you can take in the next state s

1970.2s that you can take in the next state s prime.

1976.0s The last concept I cover in terms of

1978.2s vocabulary is the policy. The policy is

1980.7s the function that given your state is

1982.6s going to tell you what to do. And in

1985.8s Q-learning the way this policy is defined is argmax of Qstar um across the action. So essentially what it says is like look in the table and look at a certain state s you want the policy which is what you should do. It's the function that tells you our best strategy. You just look at the two

1988.1s defined is argmax of Qstar um across the

1993.5s action. So essentially what it says is

1995.4s like look in the table and look at a

1998.3s certain state s you want the policy

2001.0s which is what you should do. It's the

2002.6s function that tells you our best

2004.4s strategy. You just look at the two

2006.5s possible actions which one has the highest Q value and select that action. That's it. it's the core of um Q-learning that you know later on you will use policies widely. There's a lot of reinforcement learning algorithms but this concept of understanding the policies the function telling us our best strategy in

2008.5s highest Q value and select that action.

2012.6s That's it.

2019.8s it's the core of um Q-learning that you

2023.6s know later on you will use policies

2025.6s widely. There's a lot of reinforcement

2027.3s learning algorithms but this concept of

2029.4s understanding the policies the function

2031.0s telling us our best strategy in

2032.9s Q-learning it's the argmax of the best Q value in the given state. It tells you which action to take that's the core thing you need to understand. thing you need to understand. snorts So remember this belman equation because we're going to reuse it in a bit. The main issue um with this

2035.3s value in the given state. It tells you

2037.2s which action to take that's the core

2039.1s thing you need to understand.

2041.0s thing you need to understand. snorts

2041.4s So remember this belman equation because

2043.4s we're going to reuse it

2046.0s in a bit. The main issue um with this

2050.6s approach um of a Q table is that state and action spaces can be super large and having a matrix that you discover through backtracking um and where every time you want to do an action you have to look up the given state the possible action it becomes impossible. Like imagine you using this

2056.1s and action spaces can be super large and

2060.2s having a matrix that you discover

2063.3s through backtracking

2065.1s um and where every time you want to do

2066.8s an action you have to look up the given

2069.0s state the possible action it becomes

2072.0s impossible. Like imagine you using this

2075.8s algorithm for the game of go where there's so many states, there's so many possible actions, you can put your stone anywhere on the board. You can imagine how big this matrix becomes and how impossible it is to use. So that's our problem and that's the moment where deep learning comes into play.

2078.6s there's so many states, there's so many

2080.7s possible actions, you can put your stone

2082.4s anywhere on the board. You can imagine

2085.0s how big this matrix becomes and how

2087.0s impossible it is to use. So that's our

2090.7s problem and that's the moment where deep

2093.0s learning comes into play.

2101.7s the the the oh actually before I go there I'm just going to cover some vocabulary. We said the environment, the agent, the state, the action, the reward, the total return and the discount factor. We learned all of that. We saw that the Q table is the matrix of entries representing how good is it to

2103.9s there I'm just going to cover some

2105.2s vocabulary. We said the environment, the

2106.8s agent, the state, the action, the

2108.1s reward, the total return and the

2109.6s discount factor. We learned all of that.

2111.8s We saw that the Q table is the matrix of

2114.0s entries representing how good is it to

2115.8s take action A in state S. And the policy is the function that tells us what's the best strategy to adopt. And the bellman equation is satisfied by the optimal Q equation is satisfied by the optimal Q table. what I was about to say is we are going to frame the problem slightly

2119.0s is the function that tells us what's the

2120.4s best strategy to adopt. And the bellman

2122.0s equation is satisfied by the optimal Q

2124.9s equation is satisfied by the optimal Q table.

2131.4s what I was about to say is we are going

2133.9s to frame the problem slightly

2135.4s differently. So instead of using a Q table, we're going to use the fact that neural networks are universal function approximators and we're going to define a Q function that's essentially a neural network. So that the function can take a state S and an action A and tell you how good that action is in state S. So

2138.6s table, we're going to use the fact that

2140.7s neural networks are universal function

2143.3s approximators and we're going to define

2146.0s a Q function that's essentially a neural

2148.3s network. So that the function can take a

2151.3s state S and an action A and tell you how

2155.5s good that action is in state S. So

2159.5s instead of a lookup in a matrix, you just run a forward pass in a neural network and it gives you the answer. That feels like a better solution for games where there's a lot of states and a lot of actions.

2161.8s just run a forward pass in a neural

2164.2s network and it gives you the answer.

2166.6s That feels like a better solution for

2168.9s games where there's a lot of states and

2170.6s a lot of actions.

2177.2s the past we looked for a Q table and this time we will look for a neural network. One of the things we're going to do is to define the output layer to have two outputs. So given a certain state as input think about it as a one hot vector encoding the state. So this

2179.1s this time we will look for a neural

2181.4s network. One of the things we're going

2183.9s to do is to define the output layer to

2186.6s have two outputs. So given a certain

2188.9s state as input think about it as a one

2191.0s hot vector encoding the state. So this

2193.5s one is the example of state two 0 1 0 0 0. If you pass state two in this Q function with multiple layers, it will give you two outputs. One output that corresponds to Q of S action right and the other one Q of S action left because it's the two action left because it's the two actions.

2196.3s 0. If you pass state two in this Q

2199.7s function with multiple layers, it will

2203.2s give you two outputs. One output that

2205.5s corresponds to Q of S

2210.3s action right and the other one Q of S

2212.9s action left because it's the two

2214.8s action left because it's the two actions.

2216.3s If we had more actions to take, we would just increase the output layer and we might have many more neurons in the output layer.

2218.3s just increase the output layer and we

2220.4s might have many more neurons in the

2222.2s output layer.

2230.9s we going to train that network? Because we're not in classic supervised learning. We don't have labels.

2233.4s we're not in classic supervised

2235.0s learning. We don't have labels.

2244.0s what do you what would you do given we we don't have traditional x and y pairs how are you going to train this neural how are you going to train this neural network because remember at the beginning this neural network will give you garbage it will take a state s and it might tell

2246.2s we don't have traditional x and y pairs

2250.1s how are you going to train this neural

2253.6s how are you going to train this neural network

2255.8s because remember at the beginning this

2257.4s neural network will give you garbage it

2259.1s will take a state s and it might tell

2261.5s you go to the left or to the right but it's completely random so how are you going to tune it to the level where it makes really Good decisions.

2263.0s it's completely random so how are you

2264.9s going to tune it to the level where it

2267.5s makes really Good decisions.

2281.0s Tell me more. What? this problem right now? What are the the the rules of the game that we could use in order to I'm I'm seeing what you say. You say we could estimate what good looks like but

2291.4s this problem right now? What are the the

2293.4s the rules of the game that we could use

2297.2s in order to

2299.4s I'm I'm seeing what you say. You say we

2301.0s could estimate what good looks like but

2303.1s based on what? that's one thing we have in every game. We have a reward structure for every state that definitely should be used in order to estimate the good what a good decision looks like. Yeah. The problem is not in every state you will see a reward. And if you look at many games of

2314.9s that's one thing we have in every game.

2316.4s We have a reward structure for every

2318.2s state that definitely should be used in

2321.0s order to estimate the good what a good

2322.9s decision looks like. Yeah. The problem

2325.4s is not in every state you will see a

2327.7s reward. And if you look at many games of

2331.4s like go you might not see a reward until 50 moves. So what do you do in this case? Yes. Can we run through a bunch of actions and space and see what the output is and get more data? Yeah. So you could you're actually um bringing up a sort of a tree search,

2334.4s 50 moves.

2336.7s So what do you do in this case?

2339.5s Yes. Can we run through a bunch of

2343.6s actions and space and see what the

2345.6s output is and get more data?

2349.3s Yeah. So you could you're actually um

2352.6s bringing up a sort of a tree search,

2355.0s right? You go down the tree, you do every possible action and then you every possible action and then you backtrack. Not every possible action. So which actions? Trying to spread it out of the Okay, that's that's we're getting there. So first possibility is we just go down the tree in the game of go. You could

2356.7s every possible action and then you

2358.7s every possible action and then you backtrack.

2359.8s Not every possible action.

2361.7s So which actions?

2363.1s Trying to spread it out of the

2365.8s Okay, that's that's we're getting there.

2367.5s So first possibility is we just go down

2370.1s the tree in the game of go. You could

2372.1s put your stone everywhere. So the tree already start by a 13 by3 options and then it's exponentially grows impossible. It's intractable. But what you said is what if there are certain actions that are more likely than others? Do we need actually to explore the entire tree? What's this like? What are you using when you're saying that?

2373.7s already start by a 13 by3 options and

2377.0s then it's exponentially grows

2379.3s impossible. It's intractable. But what

2381.9s you said is what if there are certain

2383.3s actions that are more likely than

2384.6s others? Do we need actually to explore

2386.4s the entire tree? What's this like? What

2388.9s are you using when you're saying that?

2390.5s How do you determine what action might be better than another one? Expected return. And we're getting close. Yeah. But you know, how do you know the expected return without going through the tree once? At least you can estimate both. Okay. You can estimate it using what?

2392.2s be better than another one?

2395.9s Expected return. And we're getting

2397.3s close. Yeah. But you know, how do you

2398.9s know the expected return without going

2401.0s through the tree once? At least

2403.0s you can estimate both.

2405.2s Okay. You can estimate it using what?

2414.2s exactly what we're going to do actually. But we're going to use the the Bellman equation because there are two things we know about this problem. We know the reward structure which you brought up and we also know that the perfect Q function will follow the Bellman equation that we know as well. At the

2416.0s But we're going to use the the Bellman

2418.0s equation because there are two things we

2420.2s know about this problem. We know the

2422.0s reward structure which you brought up

2424.1s and we also know that the perfect Q

2426.5s function will follow the Bellman

2428.6s equation that we know as well. At the

2430.8s end the Bellman equation should be respected meaning for every state if you want to know the Q value of that state given an action. The way you will get that is you will look at the immediate reward plus the discount times the best Q value from the next state across all actions. that equation will be

2432.6s respected meaning for every state if you

2436.6s want to know the Q value of that state

2440.2s given an action. The way you will get

2442.5s that is you will look at the immediate

2444.0s reward plus the discount times the best

2447.3s Q value from the next state across all

2449.2s actions. that equation will be

2452.0s respected. So those are the only information we have and we're going to use them drastically to define our labels and sort of mimic a classic supervised learning approach. So here's what we have. We have our neural network. We have Q S to the left and QS to the right that represent how good it

2454.2s information we have and we're going to

2455.7s use them drastically to define our

2458.4s labels and sort of mimic a classic

2460.9s supervised learning approach. So here's

2463.0s what we have. We have our neural

2464.2s network. We have Q S to the left and QS

2467.7s to the right that represent how good it

2469.4s is to go to the left in that state versus the right. And then I've pasted the Bman equation on top right of the screen. We're going to define a loss function. So let's say for the sake of simplicity because those are scalar values that will use you know L2 loss quadratic loss that compares a certain

2470.9s versus the right. And then I've pasted

2474.2s the Bman equation on top right of the

2476.6s screen. We're going to define a loss

2478.6s function. So let's say for the sake of

2480.3s simplicity because those are scalar

2482.6s values that will use you know L2 loss

2487.3s quadratic loss that compares a certain

2490.3s label Y to um a certain Q value of a state and a certain action. So what we would like is to minimize this loss function meaning Y and the Q value for a given action in a given state is as close as possible to each other. and we're going to leverage the reward and

2495.0s state and a certain action. So what we

2497.8s would like is to minimize this loss

2500.1s function meaning Y and the Q value for a

2503.4s given action in a given state is as

2505.4s close as possible to each other. and

2508.5s we're going to leverage the reward and

2510.1s the Bman equation. So let's do um two things. Right now we don't have a Y. So in supervised learning you will have a picture of a cat. There's a cat. The Y is one or zero. Here we don't have a Y. So we have to come up with an estimate

2513.0s things. Right now we don't have a Y. So

2515.5s in supervised learning you will have a

2516.9s picture of a cat. There's a cat. The Y

2518.9s is one or zero. Here we don't have a Y.

2521.8s So we have to come up with an estimate

2523.9s of a good Y at least better than random. So let's say at this point in time when I send a state S in the network, it turns out that Q of going to the left is higher than Q of going to the right. Which means that today at that moment the Q function tells me it's better to

2528.1s So let's say at this point in time when

2531.6s I send a state S in the network, it

2535.0s turns out that Q of going to the left is

2537.6s higher than Q of going to the right.

2540.3s Which means that today at that moment

2542.9s the Q function tells me it's better to

2544.8s go to the left than to go to the right. That is random at the beginning. It's completely random. Right? So what I'm going to do is I'm going to use as my target value Y the immediate reward that I observe on the left plus gamma times the best Q value that I

2547.6s That is random at the beginning. It's

2549.4s completely random. Right? So what I'm

2552.4s going to do is I'm going to use as my

2555.0s target value Y the immediate reward that

2558.8s I observe on the left

2561.7s plus gamma times the best Q value that I

2567.2s can get. So the best action that I could take in the next step based on my current Q value.

2569.8s take in the next step based on my

2573.4s current Q value.

2580.7s target is off. It's not a perfect target, but it's better than nothing. Meaning, not only it tells us, hey, there is a good reward to the left. We should consider that in saying that that might be a good move because we're seeing an immediate reward. But on top of that, we also know that at the end of

2582.8s target, but it's better than nothing.

2586.0s Meaning, not only it tells us, hey,

2589.3s there is a good reward to the left. We

2591.4s should consider that in saying that that

2593.7s might be a good move because we're

2595.6s seeing an immediate reward. But on top

2597.7s of that, we also know that at the end of

2600.6s training, the Q value should follow the Bellman equation. So why don't we set the target as the Bellman equation? So we add the discounted maximum future reward when you are in the next state. So you were in state S. You go to the left now you're in state S next left and

2602.9s Bellman equation. So why don't we set

2605.0s the target as the Bellman equation? So

2608.0s we add the discounted maximum future

2610.3s reward when you are in the next state.

2612.3s So you were in state S. You go to the

2614.2s left now you're in state S next left and

2617.8s you look again at your Q values and you select the best one. Then you add that number here. gasps So there is actually two forward path in that actually two forward path in that process. process. Right? There's one forward path where you send the state S in Q and you look at the two

2620.6s select the best one. Then you add that

2622.8s number here. gasps So there is

2625.0s actually two forward path in that

2627.2s actually two forward path in that process.

2629.9s process. Right?

2632.7s There's one forward path where you send

2635.2s the state S in Q and you look at the two

2638.2s options left or right and you're like okay I'm going to the left and then you're like I'm going to compare that value to a target Y but to get that target Y I need to do another forward path. So I take my action left I perform it I get an S prime state s next and I

2640.1s okay I'm going to the left and then

2642.1s you're like I'm going to compare that

2643.4s value to a target Y but to get that

2646.0s target Y I need to do another forward

2648.0s path. So I take my action left I perform

2651.0s it I get an S prime state s next and I

2654.6s send that S next into the Q network. I look at the two options I have. I pick the best one and I add it here with a the best one and I add it here with a discount.

2657.5s look at the two options I have. I pick

2659.3s the best one and I add it here with a

2661.8s the best one and I add it here with a discount.

2668.9s the following is we have a Q network that's random at clears throat the beginning. It has never observed the rewards. We just know that at some point it will get to the Q um it will get to a perfect you know policy. It will get to a perfect Q function. But the best we

2671.8s that's random at clears throat the

2673.4s beginning. It has never observed the

2675.4s rewards. We just know that at some point

2677.9s it will get to the Q um it will get to a

2681.1s perfect you know policy. It will get to

2684.2s a perfect Q function. But the best we

2686.3s can do right now is to say as a guide to for our agent, we will look at the immediate reward and we will look at the Bellman equation which should tell us a better estimate than where we are right now and we will try to catch up to that estimate and then we do that again and

2690.6s for our agent, we will look at the

2692.9s immediate reward and we will look at the

2694.6s Bellman equation which should tell us a

2696.8s better estimate than where we are right

2698.5s now and we will try to catch up to that

2701.6s estimate and then we do that again and

2704.0s again. So remember every time your Q gets better it gets better for the next state as well. So you know the Bellman equation tells you estimate it with the second forward path and you just keep getting better and better as you're observing more rewards. observing more rewards. clears throat and snorts

2706.5s gets better it gets better for the next

2708.9s state as well. So you know the Bellman

2711.0s equation tells you estimate it with the

2713.1s second forward path and you just keep

2715.0s getting better and better as you're

2716.6s observing more rewards.

2720.2s observing more rewards. clears throat and snorts

2732.0s left and

2737.9s uh how would it so describe the loop you clears throat clears throat imag

2739.3s clears throat imag

2747.2s for going to a right you again need the target yeah you would stop at that point so what you yeah this is a good question I I'll show you how we fix certain things but you do only one step meaning you have your Q value at this point and

2748.0s yeah you would stop at that point so

2749.8s what you yeah this is a good question I

2751.7s I'll show you how we fix certain things

2753.4s but you do only one step meaning

2756.6s you have your Q value at this point and

2760.1s it tells you go to the left and you just want to target Y. So what you do is you put left and you look at your next state. You forward propagate your next state. You look at the two options. You pick the best. You don't go further. You just use that one step. You look one

2762.4s want to target Y. So what you do is you

2764.7s put left and you look at your next

2766.8s state. You forward propagate your next

2768.8s state. You look at the two options. You

2770.5s pick the best. You don't go further. You

2772.9s just use that one step. You look one

2774.8s step ahead essentially. You don't look multiple steps ahead. You could, but it would be more computationally heavy to do one more step again and so on. So yeah. Yeah. Yeah. It seems like you're learning the function locally like function locally like most

2776.6s multiple steps ahead. You could, but it

2778.9s would be more computationally heavy to

2780.5s do one more step again and so on.

2783.7s So yeah. Yeah.

2785.9s Yeah. It seems like you're learning the

2788.0s function locally like

2791.1s function locally like most

2806.4s environment the state space um how long it will take to converge but you're perfectly right that um as the Q function gets better, the estimate Y also gets better. So the two things get better together, right? Because the Y is based on the Q function. And if the state space is massive, you might have a

2809.3s it will take to converge but you're

2811.3s perfectly right that um as the Q

2815.2s function gets better, the estimate Y

2817.5s also gets better. So the two things get

2820.0s better together, right? Because the Y is

2821.8s based on the Q function. And if the

2824.2s state space is massive, you might have a

2827.4s very difficult time training this model. There's better approaches that we'll see There's better approaches that we'll see later. Yeah, correct. There was a question Yeah, correct. There was a question there. when you send state S in Q the left happens to be higher than right. But the same happens on the other side. Let's

2829.8s There's better approaches that we'll see

2831.8s There's better approaches that we'll see later.

2833.4s Yeah, correct. There was a question

2836.1s Yeah, correct. There was a question there.

2845.8s when you send state S in Q the left

2848.3s happens to be higher than right. But the

2850.4s same happens on the other side. Let's

2851.8s say let's say the left is worse than right. Then what you will do is you will define your target Y as the reward that you observe on the right plus from the next state of having gone to the right what's the best action and what's the Q value for that pair and then it will

2854.2s right. Then what you will do is you will

2856.2s define your target Y as the reward that

2859.2s you observe on the right plus from the

2862.3s next state of having gone to the right

2864.4s what's the best action and what's the Q

2866.5s value for that pair and then it will

2868.7s give you the target for that scenario.

2878.1s is that when you want to differentiate L. So you want to perform a back propagation. You want to take the derivative of L with respect of the parameters of the network. You want Y to be a fixed thing, right? Because in supervised learning Y is not differentiable. It's just a fixed number

2881.0s L. So you want to perform a back

2883.1s propagation. You want to take the

2884.4s derivative of L with respect of the

2886.1s parameters of the network. You want Y to

2888.9s be a fixed thing, right? Because in

2890.5s supervised learning Y is not

2891.7s differentiable. It's just a fixed number

2893.8s zero or one or a certain number. So here we're going to simplify and we're going to say this term that is dependent on the Q network. So technically this term has parameters. So if you actually differentiate it, it will give you a value. We'll just hold it um fixed. So

2896.2s we're going to simplify and we're going

2897.5s to say this term that is dependent on

2900.0s the Q network. So technically this term

2901.9s has parameters. So if you actually

2903.8s differentiate it, it will give you a

2905.3s value. We'll just hold it um fixed. So

2910.0s we say we do use our Q network to perform an estimate of our Y but we will not differentiate it. We will say it's not it's fixed. Yeah. Can you explain why we discount this? Can you explain why we discount this? Yeah. Because you know going back to the reason we discount is like the value of

2913.0s perform an estimate of our Y but we will

2915.5s not differentiate it. We will say it's

2917.1s not it's fixed. Yeah.

2919.7s Can you explain why we discount this?

2923.2s Can you explain why we discount this? Yeah.

2924.2s Because you know going back to the

2926.4s reason we discount is like the value of

2928.2s time. It's like you probably want to say if you can win the game in 10 moves, win it in 10 moves rather than 100 moves. Um or if you can get 1 today, get 1 today rather than in 10 years. All of that is why we have a discount here. And the

2931.0s if you can win the game in 10 moves, win

2933.3s it in 10 moves rather than 100 moves. Um

2936.4s or if you can get 1 today, get 1 today

2939.3s rather than in 10 years. All of that is

2941.5s why we have a discount here. And the

2943.8s discount is a is a hyperparameter that you would define as well. That would influence the strategy of your agent. Once more pass. We're going to see it after. Actually, I'm going to do a concrete example because it's a little complicated. Yeah. because it's a little complicated. Yeah. Yeah. Yeah. Q.

2946.2s you would define as well. That would

2948.9s influence the strategy of your agent.

2952.4s Once more pass.

2954.6s We're going to see it after. Actually,

2955.7s I'm going to do a concrete example

2956.9s because it's a little complicated. Yeah.

2958.7s because it's a little complicated. Yeah. Yeah.

2960.8s Yeah. Q.

2963.7s No, no, that that was it's a good point. It's not that. It's just Q of it's it's a 2 by two. It's a one by two. So you have left and right. I was just going down the first case. So I put the state left. Yeah. Yep. mean, in most games we're going to see

2966.1s It's not that. It's just Q of it's it's

2969.2s a 2 by two. It's a one by two. So you

2972.1s have left and right. I was just going

2973.7s down the first case. So I put the state

2975.7s left. Yeah. Yep.

2986.6s mean, in most games we're going to see

2987.9s right now, the rewards are going to be fixed by the designer of the game, the human that's designing the game. Um, in practice, you could have a separate function that um actually comes up with the reward. We're going to see an example later in the lecture where the reward might be different in different

2989.1s fixed by the designer of the game, the

2991.0s human that's designing the game. Um, in

2993.7s practice, you could have a separate

2995.0s function that um actually comes up with

2999.0s the reward. We're going to see an

3000.2s example later in the lecture where the

3002.3s reward might be different in different

3004.3s scenarios and there's a function or sometimes called a critic that determines what's the reward in a certain state. certain state. snorts Okay, this is yeah one last question and then we move because we we're going to see a concrete example is going to be see a concrete example is going to be clear.

3006.6s sometimes called a critic that

3008.7s determines what's the reward in a

3010.1s certain state.

3011.7s certain state. snorts

3012.6s Okay, this is yeah one last question and

3014.4s then we move because we we're going to

3015.6s see a concrete example is going to be

3017.0s see a concrete example is going to be clear.

3017.6s Um so when we like hold it fix for laptop is that what differentiates this from iterating through all of the like from iterating through all of the like possible Yeah. Yeah. So instead of doing the backtracking down the tree and going over everything, we're saying we're going to limit oursel to just picking

3021.2s laptop is that what differentiates this

3023.4s from iterating through all of the like

3027.4s from iterating through all of the like possible

3028.2s Yeah. Yeah. So instead of doing the

3031.2s backtracking down the tree and going

3033.8s over everything, we're saying we're

3036.1s going to limit oursel to just picking

3038.2s the best action based on our current understanding of the network. You see, like my network is kind of intelligent, not great. We're in the middle of training. It says that I should go to the left and then if I look at the next state when I'm in the left,

3040.0s understanding of the network.

3042.6s You see, like my network is kind of

3045.0s intelligent, not great. We're in the

3046.4s middle of training. It says that I

3048.4s should go to the left and then if I look

3049.9s at the next state when I'm in the left,

3052.4s it says I should go to the right. I will trust it because it's the best I have, best estimate I have, but I will discount that. And then if you keep repeating that, it turns out that not only your estimate gets better, but your model gets trained and then ultimately both together get to an optimality

3054.6s trust it because it's the best I have,

3056.2s best estimate I have, but I will

3057.8s discount that. And then if you keep

3060.0s repeating that, it turns out that not

3062.0s only your estimate gets better, but your

3063.9s model gets trained and then ultimately

3065.9s both together get to an optimality

3067.7s both together get to an optimality equation. So it's a it's a funky concept, right? But you get it. We're going to see examples. Um, okay. So uh then once you have been able to use the Belman equation to estimate your targets, you perform classic back propagation and you update the parameters of the network and you repeat

3070.6s So it's a it's a funky concept, right?

3073.2s But you get it.

3075.8s We're going to see examples. Um, okay.

3078.8s So uh then once you have been able to

3082.3s use the Belman equation to estimate your

3085.4s targets, you perform classic back

3088.7s propagation and you update the

3090.9s parameters of the network and you repeat

3092.8s that process. that process. Okay. Okay. snorts Uh here is concretely if you were to code it in pseudo code, here is what it would look like to train a NRL agent using Q-learning. We start by initializing our Q network parameters. So initialization, it's random at first. Then we will loop over episode. As a

3095.2s that process. Okay.

3096.9s Okay. snorts

3097.6s Uh here is concretely if you were to

3100.5s code it in pseudo code, here is what it

3102.6s would look like to train a NRL agent

3105.9s using Q-learning. We start by

3108.4s initializing our Q network parameters.

3111.7s So initialization, it's random at first.

3115.4s Then we will loop over episode. As a

3117.4s reminder, episodes are one full game from start to terminal state. Um within an episode, we're going to start from the initial state s and we're going to loop over time steps until we reach a terminal state. So within one time step, here's what we will do. We forward propagate the state s in the Q network.

3119.4s from start to terminal state. Um within

3123.4s an episode, we're going to start from

3125.3s the initial state s and we're going to

3127.1s loop over time steps until we reach a

3130.1s terminal state. So within one time step,

3134.7s here's what we will do. We forward

3136.8s propagate the state s in the Q network.

3140.1s We will execute the action A that has the maximum Q value. We will observe a reward and we will also observe a next state S prime. We will use that S prime to compute our target Y by forward propagating S prime in the Q network and then computing our loss function. And based on that we will

3142.3s the maximum Q value.

3145.8s We will observe a reward and we will

3148.0s also observe a next state S prime. We

3151.8s will use that S prime to compute our

3154.0s target Y by forward propagating S prime

3156.4s in the Q network and then computing our

3159.2s loss function. And based on that we will

3162.7s use gradient descent to update the parameters of the network should be simpler looked at like that right okay so this is the vanilla Q-learning so to summarize again the one the main difference is that we don't have a target and we use our own network to estimate the target and the rewards are

3164.3s parameters of the network

3170.2s should be simpler looked at like that right

3172.8s okay so this is the vanilla Q-learning

3177.2s so to summarize again the one the main

3179.8s difference is that we don't have a

3181.3s target and we use our own network to

3183.4s estimate the target and the rewards are

3185.6s what's going to help us get better over understand everything. This is an entire class at Stanford. Um, you know, an entire quarter of studying that type of stuff. So, we're trying to get the basics within an hour and a half, two basics within an hour and a half, two hours. together um

3194.5s understand everything. This is an entire

3195.9s class at Stanford. Um, you know, an

3198.3s entire quarter of studying that type of

3200.1s stuff. So, we're trying to get the

3201.8s basics within an hour and a half, two

3203.6s basics within an hour and a half, two hours.

3208.6s together um

3212.5s and apply that to an actual game. So, here's the game. It's called Breakout. We want to destroy all the bricks. Who has played Breakout in the past? Buddy, a few. Okay, good. So, you have a a paddle that you control and you're trying to destroy the bricks. If the ball gets past your paddle, you lost.

3214.6s here's the game. It's called Breakout.

3216.6s We want to destroy all the bricks. Who

3218.6s has played Breakout in the past? Buddy,

3220.6s a few. Okay, good. So, you have a a

3223.2s paddle that you control and you're

3225.7s trying to destroy the bricks. If the

3229.3s ball gets past your paddle, you lost.

3231.8s And if the bricks are all destroyed, you won. That's it. Let's do it together. What um what is the input of our Q What um what is the input of our Q network? What would you use as input to remember? Yeah. clears throat Yes. snorts Entire screen. Okay. Let's do that. So I

3233.6s won. That's it.

3237.5s Let's do it together.

3239.7s What um what is the input of our Q

3242.8s What um what is the input of our Q network?

3244.6s What would you use as input

3247.6s to remember? Yeah. clears throat

3258.0s Yes. snorts

3258.2s Entire screen. Okay. Let's do that. So I

3260.6s take I define that as the state S which is the input to my Q network. Uh what's the output of the Q network? the output of the Q network? Yes. Do we have to do that on screen? Good question. We'll get there. I'm I'm gonna ask you, but do we have to look at

3263.1s is the input to my Q network. Uh what's

3265.7s the output of the Q network?

3269.7s the output of the Q network? Yes.

3270.9s Do we have to do that on screen?

3273.1s Good question. We'll get there. I'm I'm

3275.0s gonna ask you, but do we have to look at

3278.7s the full screen? Answer is no, but we'll see why. What's the output? see why. What's the output? Yeah. Game score. The game score. Uh, no. But we're going to talk about the game score in the to talk about the game score in the back. first and then we'll talk about the

3281.1s see why. What's the output?

3283.9s see why. What's the output? Yeah.

3284.5s Game score.

3285.5s The game score. Uh, no. But we're going

3288.8s to talk about the game score in the

3290.1s to talk about the game score in the back.

3295.6s first and then we'll talk about the

3297.0s stuff we can get rid of on the inputs. But what's the output? Yeah, it's the the actions. Yeah, the actions. So, yeah, it will be the Q values associated with the actions in state S. Remember, it's a Q function. So, the output is we need one value for left, one value for right, and one value for

3299.6s But what's the output? Yeah,

3301.3s it's the

3304.4s the actions.

3305.8s Yeah, the actions. So, yeah, it will be

3308.7s the Q values

3310.9s associated with the actions in state S.

3313.0s Remember, it's a Q function. So, the

3314.5s output is we need one value for left,

3317.4s one value for right, and one value for

3319.4s idle. snorts You could make this game more complicated and say we have eight actions. We have a little bit to the left, a lot to the left, a lot more to the left. You know, if you had multiple buttons, but let's simplify and say three actions. Either you don't move,

3321.3s more complicated and say we have eight

3324.2s actions. We have a little bit to the

3325.7s left, a lot to the left, a lot more to

3328.0s the left. You know, if you had multiple

3329.5s buttons, but let's simplify and say

3331.4s three actions. Either you don't move,

3333.1s you move to the left or you move to the right. So these are the outputs. So now let's get to the question of the screen. Do we need the entire screen? So you were saying something earlier?

3334.6s right. So these are the outputs. So now

3336.6s let's get to the question of the screen.

3338.6s Do we need the entire screen?

3342.4s So you were saying something earlier?

3354.2s the bricks. I would argue uh you need more because there's the walls. And I guess that you could if you're an expert player, you could know where the walls are, but generally you need a little more than that. What what what would be obviously things we can get rid of? And why would we do that?

3356.5s more because there's the walls. And I

3358.5s guess that you could if you're an expert

3359.8s player, you could know where the walls

3361.2s are, but generally you need a little

3362.9s more than that. What what what would be

3364.7s obviously things we can get rid of? And

3366.3s why would we do that?

3377.2s Um, who would remove the score at the Um, who would remove the score at the top?

3380.1s Um, who would remove the score at the top?

3386.3s Why would you not remove it?

3402.1s score doesn't matter. It's true. We would remove the score. So you you could actually crop the top. You could also crop the bottom. I mean, if it passed the paddle, you don't care about the few pixels at the bottom. You could get rid of them. Um this is not always true. There are games where the score matters

3404.3s would remove the score. So you you could

3406.0s actually crop the top. You could also

3408.4s crop the bottom. I mean, if it passed

3410.1s the paddle, you don't care about the few

3411.6s pixels at the bottom. You could get rid

3413.5s of them. Um this is not always true.

3416.3s There are games where the score matters

3419.2s and in fact you know I like football soccer the the in soccer if you're one zero up you can park the bus. So your strategy is dependent of the score that you have like you wouldn't park the bus if you're losing one zero. Parking the bus meaning you ask every player to come

3422.6s soccer the the in soccer if you're one

3426.2s zero up you can park the bus. So your

3428.6s strategy is dependent of the score that

3431.7s you have like you wouldn't park the bus

3433.6s if you're losing one zero. Parking the

3435.6s bus meaning you ask every player to come

3437.3s back and defend. If you're losing you would actually do the opposite. You will go all out attack. So in certain games you want the scores, in others you don't want. And so it's part of the designer, the the the AI engineer that's working on that to determine what information we need and what we don't need. What else

3439.2s would actually do the opposite. You will

3440.6s go all out attack. So in certain games

3443.6s you want the scores, in others you don't

3445.8s want. And so it's part of the designer,

3447.7s the the the AI engineer that's working

3449.6s on that to determine what information we

3451.4s need and what we don't need. What else

3453.0s could we do to reduce the dimensionality of the problem and make our computation of the problem and make our computation faster? grayscale essentially. That's true. Here you actually don't need the colors. It's clears throat just nice as a user for user experience purposes. you don't need. I don't think there's different points based on the bricks that you

3454.9s of the problem and make our computation

3456.7s of the problem and make our computation faster?

3461.8s grayscale essentially. That's true. Here

3464.6s you actually don't need the colors. It's

3467.0s clears throat just nice as a user for

3468.4s user experience purposes. you don't

3470.3s need. I don't think there's different

3471.7s points based on the bricks that you

3473.4s destroy. It's all the same. Um there actually, funny enough, this algorithm was used by um Deepine to play a lot of Atari games and they did a single pre-processing where they removed the channels because they said it doesn't matter. Turns out in one of the games, I think it was CQS, I forgot which one,

3476.6s actually, funny enough, this algorithm

3479.3s was used by um Deepine to play a lot of

3483.0s Atari games and they did a single

3485.1s pre-processing where they removed the

3487.1s channels because they said it doesn't

3488.6s matter. Turns out in one of the games, I

3491.2s think it was CQS, I forgot which one,

3493.8s the fish disappeared when you did that. And so that game didn't work. the the agent couldn't crack it uh because they thought that the same pre-processing could apply to every game, but actually they had to make a slight tweak. also correct you. So just to recap, you could do it

3496.6s And so that game didn't work. the the

3498.7s agent couldn't crack it uh because they

3501.3s thought that the same pre-processing

3502.6s could apply to every game, but actually

3504.1s they had to make a slight tweak.

3519.4s also correct

3520.8s you. So just to recap, you could do it

3522.8s even better by using a a low dimensional representation of this game that describes the game. It's true, but because we want to use a single algorithm for 50 plus Atari games, we'll say the human sees the screen, we'll just give the screen and it will probably scale better essentially. But you're perfectly right if you were

3525.9s representation of this game that

3527.4s describes the game. It's true, but

3529.5s because we want to use a single

3531.0s algorithm for 50 plus Atari games, we'll

3533.6s say the human sees the screen, we'll

3535.7s just give the screen and it will

3537.1s probably scale better essentially. But

3538.9s you're perfectly right if you were

3540.1s working on only that game. Okay, so let's do that. We'll we'll do pre-processing. There's one last thing that nobody mentioned which is history because in fact if you get only one screen you don't know where the ball is going. So actually you can't solve the game and the way you fix that is by

3542.5s let's do that. We'll we'll do

3543.9s pre-processing. There's one last thing

3545.4s that nobody mentioned which is history

3547.8s because in fact if you get only one

3549.7s screen you don't know where the ball is

3551.4s going. So actually you can't solve the

3554.1s game and the way you fix that is by

3556.3s giving a history of multiple screens for example four screens so that you see the direction that the ball is going in. So our pre-processing function is you know called f ofs let's say and f ofs is a mix of um you know you might do convert to grayscale reduce the dimension the

3558.8s example four screens so that you see the

3560.6s direction that the ball is going in. So

3563.0s our pre-processing function is you know

3565.8s called f ofs let's say and f ofs is a

3569.0s mix of um you know you might do convert

3572.0s to grayscale reduce the dimension the

3574.1s height and width and also add the history of four frames and that should be enough. Turns clears throat out in most games you will need a history a little bit of history to know where the ball is going history to know where the ball is going or in this example can just encode the uh

3575.7s history of four frames and that should

3578.4s be enough.

3580.2s Turns clears throat out in most games

3581.3s you will need a history a little bit of

3583.0s history to know where the ball is going

3584.6s history to know where the ball is going or

3585.7s in this example can just encode the uh

3588.5s like velocity vector of the ball. Yeah, you could you could replace exactly you could replace um the history so multiple screen by just adding the gradient or the velocity of where the ball is going. That's true. But would it scale to every game? You know, turns out this because we know humans look at the Atari machine and

3590.7s you could you could replace exactly you

3592.8s could replace

3594.5s um the history so multiple screen by

3596.9s just adding the gradient or the velocity

3598.7s of where the ball is going. That's true.

3600.7s But would it scale to every game? You

3602.6s know, turns out this because we know

3604.6s humans look at the Atari machine and

3606.2s they look at pixels. This would be more likely to scale to every game.

3608.5s likely to scale to every game.

3615.6s sequest is a good game or space invaders where you have multiple enemies coming at you then you would need to change your pre-processing to take into account the velocity of all these enemies. So it wouldn't work the same way. While if you actually give the pixels you actually from the pixels get the velocity of all

3617.8s where you have multiple enemies coming

3619.4s at you then you would need to change

3622.6s your pre-processing to take into account

3624.3s the velocity of all these enemies. So it

3626.2s wouldn't work the same way. While if you

3628.0s actually give the pixels you actually

3629.8s from the pixels get the velocity of all

3631.7s your enemies and the directions they're your enemies and the directions they're going. Okay. So this is our pre-processing. I'm going to refer to it as fs. And um our deep Q network architecture because we're working with pixels is going to be um a convolutional network. Don't worry if you haven't learned it yet in the

3633.2s your enemies and the directions they're going.

3635.0s Okay. So this is our pre-processing. I'm

3637.1s going to refer to it as fs. And um our

3639.9s deep Q network architecture because

3641.6s we're working with pixels is going to be

3644.6s um a convolutional network. Don't worry

3646.6s if you haven't learned it yet in the

3648.2s class, but it's a bunch of con and relu activations. Um and then we end with a fully connected layer that gives us the three Q values for uh the different three Q values for uh the different actions. So nothing special here. Now, uh, we're going to go back to our vanilla training. So, this one that we saw

3650.9s activations. Um and then we end with a

3654.2s fully connected layer that gives us the

3658.1s three Q values for uh the different

3660.8s three Q values for uh the different actions.

3663.2s So nothing special here. Now, uh, we're

3666.2s going to go back to our vanilla

3667.8s training. So, this one that we saw

3669.6s together earlier, and we're going to look at tips to train reinforcement learning algorithms. Those tips are not specific to Q-learning. Some of them are applied to a lot more than Q-learning, and they're very important to know. Um, and they're part of the reason reinforcement learning has worked better in the last few years. Um, so one of the

3671.6s look at tips to train reinforcement

3673.5s learning algorithms. Those tips are not

3675.5s specific to Q-learning. Some of them are

3677.4s applied to a lot more than Q-learning,

3680.0s and they're very important to know. Um,

3682.4s and they're part of the reason

3683.8s reinforcement learning has worked better

3685.9s in the last few years. Um, so one of the

3689.6s things that's pretty simple that we forgot to do um is the pre-processing that we just did. So anywhere I had an S, I'm going to instead run S through the pre-processing step, I'm going to use PH of S. So I initialize instead of S with P of S. I start from the initial

3691.0s forgot to do um is the pre-processing

3693.4s that we just did. So anywhere I had an

3695.8s S, I'm going to instead run S through

3698.0s the pre-processing step, I'm going to

3699.8s use PH of S. So I initialize instead of

3703.8s S with P of S. I start from the initial

3706.2s state of S and then I for propagate f of I um I get the Q of that pre-processed state in action A and etc. etc. And then when I get my next state, so let's say I look at my current pre-processed state, I forward propagate it once. I see the

3710.2s I um I get the Q of that pre-processed

3714.8s state in action A and etc. etc. And then

3718.3s when I get my next state, so let's say I

3722.2s look at my current pre-processed state,

3724.9s I forward propagate it once. I see the

3728.9s three Q values, in our case, the two Q values. One of them was better than the other action, right? Then I get my next state S prime. I want to pre-process that state as well. Yeah, so that's pretty straightforward. You just replace all of that. all of that. snorts The second thing we forgot to do is to

3731.4s values. One of them was better than the

3733.5s other action, right? Then I get my next

3736.4s state S prime. I want to pre-process

3738.5s that state as well. Yeah, so that's

3742.4s pretty straightforward. You just replace

3743.8s all of that.

3746.0s all of that. snorts

3746.2s The second thing we forgot to do is to

3748.2s keep track of the terminal state. In our pseudo code, there is no concept of terminal state. It's pretty easy to add. You would probably just do an if else statement. You would create a boolean to detect terminal state. So let's say your boolean is terminal equals false. And then as you loop over the time step of a

3750.2s pseudo code, there is no concept of

3751.7s terminal state. It's pretty easy to add.

3754.1s You would probably just do an if else

3755.8s statement. You would create a boolean to

3758.2s detect terminal state. So let's say your

3759.9s boolean is terminal equals false. And

3762.3s then as you loop over the time step of a

3764.4s single episode, every time you're going to check, is the state that I'm going in based on the action I'm taking a terminal state? If it's a terminal state, then get out of the loop. You know, there's nothing else after. The one thing that you need to be careful of is if it's a terminal state, then your

3766.4s to check, is the state that I'm going in

3770.0s based on the action I'm taking a

3771.6s terminal state? If it's a terminal

3773.5s state, then get out of the loop. You

3775.1s know, there's nothing else after. The

3777.8s one thing that you need to be careful of

3779.3s is if it's a terminal state, then your

3781.8s target is not the Bellman equation. It's just the the immediate reward. Remember, you get to the terminal state, you get a reward of 10. There's no Bellman equation to apply. It's just 10. It's immediate reward. There's no discount, immediate reward. There's no discount, etc.

3784.4s just the the immediate reward. Remember,

3786.6s you get to the terminal state, you get a

3788.4s reward of 10. There's no Bellman

3790.6s equation to apply. It's just 10. It's

3793.2s immediate reward. There's no discount,

3795.8s immediate reward. There's no discount, etc.

3803.3s Now, we're going to look at a new method that will enable more data efficiency. It's called experience replay. One of the a couple of issues with the way we've been training so far is one the correlation of successive screens. It's like imagine in the Atari game you have the ball that's in the top left corner

3805.4s that will enable more data efficiency.

3809.1s It's called experience replay. One of

3811.7s the a couple of issues with the way

3814.1s we've been training so far is one the

3817.7s correlation of successive screens. It's

3822.2s like imagine in the Atari game you have

3824.6s the ball that's in the top left corner

3826.9s and it's traveling to the bottom right of the screen. You have like many many time step that are essentially the same. It's all the ball traveling in the same in the same place. So you're actually training repetitively on something that is not that meaningful. You don't need to just train on a batch. The equivalent

3829.4s of the screen. You have like many many

3831.9s time step that are essentially the same.

3833.8s It's all the ball traveling in the same

3836.6s in the same place. So you're actually

3839.3s training repetitively on something that

3842.2s is not that meaningful. You don't need

3843.8s to just train on a batch. The equivalent

3845.7s in supervised learning is let's say you're trying to differentiate cats and dogs and you train on a mini batch of cats, then you train on a mini batch of dogs, then you train on a mini batch of cats, and it will never converge. It will just index too much on cats and

3847.6s you're trying to differentiate cats and

3849.1s dogs and you train on a mini batch of

3851.2s cats, then you train on a mini batch of

3853.1s dogs, then you train on a mini batch of

3854.8s cats, and it will never converge. It

3856.9s will just index too much on cats and

3859.0s then index too much on dogs. So you want to add some sort of a experience replay concept that we'll see in order to create more mixes in the data and get more diversity. The other thing that is important is in our current training process, we are not reusing our data. Like you experience

3861.6s to add some sort of a

3864.4s experience replay concept that we'll see

3866.4s in order to create more mixes in the

3868.8s data and get more diversity. The other

3872.5s thing that is important is in our

3874.7s current training process, we are not

3878.2s reusing our data. Like you experience

3880.6s something, you immediately train on it. You never see it again unless you reexperience the same thing sometimes in the future, which might or might not happen. Experience replay is going to help us to keep experiences in memory and maybe retrain on them on a regular basis so that one experience might be useful multiple times

3882.2s You never see it again unless you

3884.1s reexperience the same thing sometimes in

3885.9s the future, which might or might not

3887.8s happen. Experience replay is going to

3890.1s help us to keep experiences in memory

3893.4s and maybe retrain on them on a regular

3895.7s basis so that one experience might be

3897.7s useful multiple times

3900.4s which intuitively makes sense. Like maybe you do an experience, you get an amazing reward and you don't want to forget it. You want to retrain the model on it on a regular basis. It's more data efficiency. So here's what it looks like. Um, the current way we were training was we're in a state I'm just

3901.8s maybe you do an experience, you get an

3903.3s amazing reward and you don't want to

3904.9s forget it. You want to retrain the model

3906.5s on it on a regular basis. It's more data

3909.0s efficiency. So here's what it looks

3911.2s like. Um, the current way we were

3913.7s training was we're in a state I'm just

3915.7s going to say state instead of pre-processed state, but it's pre-processed. We're in a state S. We perform action A, we get a reward R, and we get into the next state. From that next state, we perform another action A prime, we get a reward RP prime, and we get into S second and so on, you know,

3916.7s pre-processed state, but it's

3918.0s pre-processed. We're in a state S. We

3920.6s perform action A, we get a reward R, and

3923.6s we get into the next state. From that

3925.6s next state, we perform another action A

3927.8s prime, we get a reward RP prime, and we

3930.7s get into S second and so on, you know,

3934.2s and so on. And each of these would be called one experience. It's one iteration of gradient descent. It's one experience. So right now we're training on these experiences. So the training looks like I train on E1, I update my parameters. Then I train on E2, I update my parameters. Then I train on E3,

3936.0s called one experience. It's one

3937.9s iteration of gradient descent. It's one

3940.2s experience. So right now we're training

3943.1s on these experiences. So the training

3945.5s looks like I train on E1, I update my

3948.4s parameters. Then I train on E2, I update

3951.2s my parameters. Then I train on E3,

3953.2s update my parameters. Those are highly correlated because they're part of the same episode. And as I was saying with the ball traveling in one direction, that might actually not be that helpful to train on all of these, you know. So instead what we'll do is we'll use experience replay where we will collect

3955.4s correlated because they're part of the

3957.0s same episode. And as I was saying with

3959.4s the ball traveling in one direction,

3961.1s that might actually not be that helpful

3963.0s to train on all of these, you know. So

3965.5s instead what we'll do is we'll use

3967.4s experience replay where we will collect

3969.8s our first experience but instead of training on it we will put it in a memory called the replay memory D. We'll put it in there and then at every step we will sample from that memory to decide what to train on. So of course at the beginning if we just have one uh

3972.2s training on it we will put it in a

3974.2s memory called the replay memory D. We'll

3978.2s put it in there and then at every step

3980.9s we will sample from that memory to

3983.2s decide what to train on. So of course at

3985.8s the beginning if we just have one uh

3989.0s experience in the memory we will train on that experience. But over time you will see that we get more diversity and reuse out of our experiences. So for example, let's say I experience E2, I put it in the memory and then instead of training on E2, I'm going to randomly sample from the memory. I might get E1

3990.6s on that experience. But over time you

3993.0s will see that we get more diversity and

3996.3s reuse out of our experiences. So for

3998.4s example, let's say I experience E2, I

4001.4s put it in the memory and then instead of

4003.4s training on E2, I'm going to randomly

4005.4s sample from the memory. I might get E1

4007.8s or I might get E2. Then I experience E3 and I put it in the memory and I might get one of the three. You know, this is the vanilla experience replay. In practice, there is more methods like prioritize sweeping which might tell you which experience you want to weigh. Maybe some experiences had a higher

4011.0s and I put it in the memory and I might

4013.5s get one of the three. You know, this is

4016.6s the vanilla experience replay. In

4019.2s practice, there is more methods like

4021.3s prioritize sweeping which might tell you

4023.2s which experience you want to weigh.

4025.0s Maybe some experiences had a higher

4026.6s gradient, so you want to prioritize them more often. You know, things like that. So, all in all, this is what our training looks like with experience replay. We experience um E1, we train on E1. Then we the next training iteration is not on E2, it's on a sample from E1 and E2. Either or. The third experience

4028.6s more often. You know, things like that.

4032.6s So, all in all, this is what our

4034.2s training looks like with experience

4035.6s replay. We experience um E1, we train on

4039.8s E1. Then we the next training iteration

4043.5s is not on E2, it's on a sample from E1

4045.8s and E2. Either or. The third experience

4048.5s is then put in the replay memory but we don't train on it. We train on a sample from whatever is in the replay memory and we repeat and that is more sample more efficient allows more reusability and less crossorrelation in our um training clears throat batch. So that's called replay memory

4050.6s don't train on it. We train on a sample

4052.2s from whatever is in the replay memory

4053.8s and we repeat and that is more sample

4057.4s more efficient allows more reusability

4060.1s and less crossorrelation in our um

4063.0s training clears throat batch.

4069.4s So that's called replay memory

4072.2s and you can use it with mini batch gradient descent. Note that you still experience in the direction that the game is played. Like we still go and take the action as expected. We just don't necessarily update our model parameter based on the action that we ended up taking. We put it in the replay

4073.8s gradient descent. Note that you still

4077.4s experience in the direction that the

4079.2s game is played. Like we still go and

4081.4s take the action as expected. We just

4083.7s don't necessarily update our model

4085.4s parameter based on the action that we

4087.0s ended up taking. We put it in the replay

4089.2s memory. We may train on it later. memory. We may train on it later. snorts snorts Okay. Um so here is how it modifies our vanilla setup. We've added an experience from state S to state S prime to the replay memory. You know, like let me walk you through it again. Within one time step, we forward propagate our

4092.5s memory. We may train on it later. snorts

4093.2s snorts Okay.

4095.1s Um so here is how it modifies our

4098.0s vanilla setup. We've added an experience

4103.0s from state S to state S prime to the

4105.9s replay memory. You know, like let me

4108.7s walk you through it again. Within one

4110.7s time step, we forward propagate our

4112.9s state into the Q network. We execute the best action given the Q values. This gives us a reward and the next state. The next state is pre-processed. And then instead of training on that, instead of training, we just add that transition to the replay memory. And instead we sample randomly a mini batch

4114.9s best action given the Q values. This

4117.9s gives us a reward and the next state.

4120.2s The next state is pre-processed. And

4122.2s then instead of training on that,

4124.7s instead of training, we just add that

4127.4s transition to the replay memory. And

4130.5s instead we sample randomly a mini batch

4132.8s of transition from the replay memory and we train on those and we redo the same thing again and again. thing again and again. Yes. replay bias towards the start of the game clears throat sample from everything. So, aren't you more likely to trade start game just like you would sometimes want? just like you would sometimes want? Yeah.

4135.1s we train on those and we redo the same

4138.2s thing again and again.

4141.1s thing again and again. Yes.

4142.6s replay bias towards the start of the

4145.6s game clears throat sample from

4147.7s everything. So, aren't you more likely

4149.6s to trade start game

4152.8s just like you would sometimes want?

4154.9s just like you would sometimes want? Yeah.

4156.6s Yeah, you you you would uh you would within one episode, but you know, if you if you play multiple chess game, uh your replay memory would get already bigger. So, then you would see some end game, some middle of the game, some early games. Yeah, good. And in practice, it's actually useful

4159.8s within one episode, but you know, if you

4162.7s if you play multiple chess game, uh your

4165.5s replay memory would get already bigger.

4167.6s So, then you would see some end game,

4169.4s some middle of the game, some early

4171.3s games. Yeah, good.

4175.3s And in practice, it's actually useful

4177.0s because uh you might imagine that in a chess game, you know, all of us, let's say if you're a beginner, you you see a lot of beginning of the games. You actually people that are beginners, they're good at openings, but they're bad at end games because they don't get to play a lot of end games. Uh well,

4179.7s chess game, you know, all of us, let's

4183.3s say if you're a beginner, you you see a

4185.5s lot of beginning of the games. You

4186.7s actually people that are beginners,

4188.4s they're good at openings, but they're

4190.2s bad at end games because they don't get

4191.8s to play a lot of end games. Uh well,

4194.4s that type of approach could be useful. You can retrain on end games more often. And you know the a more advanced version of the replay memory would also weigh the experience in the replay memory based on how much the gradient is going to be. So if you have an experience that actually was super

4196.5s You can retrain on end games more often.

4198.9s And you know the a more advanced version

4200.9s of the replay memory

4203.0s would also weigh the experience in the

4206.6s replay memory based on how much the

4208.4s gradient is going to be. So if you have

4210.3s an experience that actually was super

4211.8s insightful, you can weigh it higher so that you you you prioritize grabbing it and retraining on it essentially. So let's say you blunder in chess, you might actually want to receive that blunder later so that you don't do it blunder later so that you don't do it again. Um okay so these were all the different

4214.2s that you you you prioritize grabbing it

4216.8s and retraining on it essentially.

4220.8s So let's say you blunder in chess, you

4222.7s might actually want to receive that

4224.0s blunder later so that you don't do it

4225.8s blunder later so that you don't do it again.

4227.6s Um okay so these were all the different

4230.8s methods. Another one that's very um intuitive and very important is when during the training process our um agent gets stuck gets stuck in a local minima. Uh here is how it would work in practice. Uh you start in initial state s1 and you have three states ahead of you. If you take action a1 you go to

4233.0s intuitive and very important is when

4235.8s during the training process our um agent

4239.7s gets stuck gets stuck in a local minima.

4243.4s Uh here is how it would work in

4245.0s practice. Uh you start in initial state

4247.7s s1 and you have three states ahead of

4249.9s you. If you take action a1 you go to

4252.6s state two which is a terminal state. If you take and you get a reward of zero. If you take action A2, you get to S3 also a terminal state and you get a reward of one. And if you get um action A3, get to state four terminal state with a reward of a,000. So of course to

4255.1s you take and you get a reward of zero.

4257.6s If you take action A2, you get to S3

4260.0s also a terminal state and you get a

4261.8s reward of one. And if you get um action

4264.7s A3, get to state four terminal state

4268.2s with a reward of a,000. So of course to

4271.4s us it's obvious that we would want to explore the state number four. It's pretty obvious in practice. Let's say you update you you initialize your network and in the first forward path that's what you get first forward path the network is random you get Q value for action one.5 for action 2.4 four for action 3.3.

4273.2s explore the state number four. It's

4275.4s pretty obvious in practice. Let's say

4277.9s you update you you initialize your

4280.2s network and in the first forward path

4284.2s that's what you get first forward path

4286.7s the network is random you get Q value

4289.8s for action one.5

4292.1s for action 2.4 four for action 3.3.

4295.9s What does that mean? It means the agent is saying I'm going to go to action one. So I take action one and I see an immediate reward of zero. Right? Because it's a terminal state. The Bman equation thing doesn't happen. I just have the immediate reward which becomes my target Y. And so I perform a gradient descent

4298.2s is saying I'm going to go to action one.

4300.7s So I take action one and I see an

4303.5s immediate reward of zero. Right? Because

4306.7s it's a terminal state. The Bman equation

4308.9s thing doesn't happen. I just have the

4311.0s immediate reward which becomes my target

4313.9s Y. And so I perform a gradient descent

4316.4s update to say this Q value should have been zero. So I convert this Q value to zero. Now second try. This time the Q value is saying take action two. It's the highest Q value. I take action two. I have an immediate reward ahead of me. That's one because it's a terminal state. There's no second

4318.1s been zero.

4320.0s So I convert this Q value to zero.

4323.0s Now second try.

4325.8s This time the Q value is saying take

4328.6s action two. It's the highest Q value. I

4332.2s take action two. I have an immediate

4334.0s reward ahead of me. That's one because

4337.6s it's a terminal state. There's no second

4340.0s discounted future reward term. Um so I just take Y equals 1. I perform my gradient descent updates and this converts to one. And then third time the agent is still saying go to A2. Go to the take action A2 reward of one. Good. That's what you predicted. Nothing to do. Just keep going. We're done with

4344.2s just take Y equals 1. I perform my

4347.3s gradient descent updates and this

4348.7s converts to one. And then third time the

4353.6s agent is still saying go to A2. Go to

4356.6s the take action A2 reward of one. Good.

4361.2s That's what you predicted. Nothing to

4363.0s do. Just keep going. We're done with

4365.0s training. We're stuck. We never visit the state we actually wanted to visit. Okay. So that that that wouldn't work for us. We will never visit that state using our current algorithm. Does that make sense why we wouldn't ever visit that state? In practice, this is a big issue. The analogy of this uh concept of

4368.4s the state we actually wanted to visit.

4372.3s Okay. So that that that wouldn't work

4374.3s for us. We will never visit that state

4376.6s using our current algorithm. Does that

4378.2s make sense why we wouldn't ever visit

4380.0s that state?

4381.8s In practice, this is a big issue. The

4384.4s analogy of this uh concept of

4387.1s exploration versus exploitation is when every day you take your bike and you cross um campus, you have a favorite route. And turns out that the more you take that route, the better you get every time. Like you get a little faster. Maybe your turn is faster or something or you can predict how many

4390.1s every day you take your bike and you

4392.0s cross um campus, you have a favorite

4395.4s route. And turns out that the more you

4398.3s take that route, the better you get

4399.9s every time. Like you get a little

4401.3s faster. Maybe your turn is faster or

4403.4s something or you can predict how many

4405.1s people are going to be at that roundabout and you know how to take it in the wide way so you go faster. We've all done that. Um that's exploitation. You exploit what you already know and you get better at it. But maybe there's another route that you're not thinking of that's pretty instead of going north

4406.1s roundabout and you know how to take it

4407.7s in the wide way so you go faster. We've

4409.8s all done that. Um that's exploitation.

4413.0s You exploit what you already know and

4414.7s you get better at it. But maybe there's

4416.9s another route that you're not thinking

4418.6s of that's pretty instead of going north

4421.0s from campus, you go south and maybe it might be better. You will never see because you don't have the courage or the patience to do it. That's the difference between exploration and difference between exploration and exploitation. In practice, a good model would be able to handle both to exploit when 22 to

4423.4s might be better. You will never see

4425.0s because you don't have the courage or

4426.7s the patience to do it. That's the

4429.0s difference between exploration and

4430.2s difference between exploration and exploitation.

4432.2s In practice, a good model would be able

4434.5s to handle both to exploit when 22 to

4437.0s exploit to explore when it needs to explore. The way we do it in practice in our pseudo code is to inject some randomness. So for example, when we are looping over time step with probability epsilon let's say 5 take a random action. So from time to time on average one time every 20 times you take a

4438.6s explore. The way we do it in practice in

4441.3s our pseudo code is to inject some

4444.0s randomness. So for example, when we are

4447.7s looping over time step with probability

4450.2s epsilon let's say 5 take a random

4452.6s action. So from time to time on average

4455.8s one time every 20 times you take a

4458.1s random action it will allow you to visit maybe a new path. The analogy in chess is you know you might use a creative move from time to time that might be worse today but might allow you to learn something and to get better over time. Yeah. Um in that uh the example we just

4460.3s maybe a new path. The analogy in chess

4463.2s is you know you might use a creative

4465.6s move from time to time that might be

4467.9s worse today but might allow you to learn

4470.0s something and to get better over time.

4473.3s Yeah. Um in that uh the example we just

4477.4s covered, couldn't we resolve that by just setting the initial um Q values to just setting the initial um Q values to infinity? Setting the the Couldn't we resolve this problem by setting the initial values from into infinity? Well, the problem if you set the initial values to infinity. So you would say instead of randomly

4479.9s just setting the initial um Q values to

4483.9s just setting the initial um Q values to infinity?

4486.0s Setting the the Couldn't we resolve this

4488.6s problem by setting the initial values

4490.3s from into infinity? Well, the problem if

4492.6s you set the initial values to infinity.

4494.8s So you would say instead of randomly

4496.9s initializing your network, you initialize it in a way that the outputs are equal to infinity. Yeah. so that we wouldn't get the issue of where like the Q value of action three state one is well in practice if the three Q values are infinity then you can't make a decision on the spot so you're saying

4498.4s initialize it in a way that the outputs

4500.2s are equal to infinity.

4501.7s Yeah. so that we wouldn't get the issue

4503.6s of where like the Q value of action

4507.1s three state one is

4509.2s well in practice if the three Q values

4511.3s are infinity then you can't make a

4512.9s decision on the spot so you're saying

4514.4s just pick one randomly because if the three are infinity you you can't decide which one to take right and also if in if it's infinity and the reward is one I mean if it's a really large number and the reward is one your gradient is going to be massive right so

4516.6s because if the three are infinity you

4518.2s you can't decide which one to take right

4521.9s and also if in if it's infinity and the

4524.4s reward is one I mean if it's a really

4526.6s large number and the reward is one your

4528.1s gradient is going to be massive right so

4530.9s it's going uh I guess the loss function is going to be massive and um I don't know I imagine it would be really hard to train it but in practice you start with the random initialization because in this might be one example but you know if in the game of chess um actually

4533.1s is going to be massive and um I don't

4536.2s know I imagine it would be really hard

4537.5s to train it but in practice you start

4539.8s with the random initialization because

4541.8s in this might be one example but you

4544.3s know if in the game of chess um actually

4548.6s the the reward is one at the end and zero all the time or maybe the reward is a thousand at the end and when you lose your rook it's uh it's a negative reward. You can't predict what the reward structure is going to be. We want an agent that is able to adapt to it and

4550.7s zero all the time or maybe the reward is

4553.8s a thousand at the end and when you lose

4555.4s your rook it's uh it's a negative

4557.5s reward. You can't predict what the

4559.3s reward structure is going to be. We want

4560.6s an agent that is able to adapt to it and

4563.4s um it's better to find a method that can scale to different environments essentially. snorts Um okay so this was um epsilon greedy action which is adding some randomness with probability epsilon take a random action. Okay, so adding all our techniques because we get good at training reinforcement learning algorithms. This is what we have. We

4565.8s scale to different environments

4567.6s essentially. snorts Um

4571.5s okay so this was um epsilon greedy

4575.8s action which is adding some randomness

4578.6s with probability epsilon take a random

4580.5s action. Okay, so adding all our

4584.1s techniques because we get good at

4586.2s training reinforcement learning

4587.5s algorithms. This is what we have. We

4590.0s initialize our Q network parameters. We have a random network. We initialize our replay memory D. And then we loop over episodes. We start from an initial state. We create a boolean that allows us to detect terminal states with probability epsilon. We're going to take a random action. Otherwise, we're going to follow what we know, which is forward

4592.0s have a random network. We initialize our

4594.2s replay memory D. And then we loop over

4596.9s episodes. We start from an initial

4598.5s state. We create a boolean that allows

4600.2s us to detect terminal states with

4602.7s probability epsilon. We're going to take

4604.3s a random action. Otherwise, we're going

4606.4s to follow what we know, which is forward

4608.4s propagate the state in the Q network. take the action that has the highest Q value that allows you to observe a reward and the next state take that next state forward propagate it again

4610.4s take the action that has the highest Q

4612.1s value that allows you to observe a

4614.0s reward and the next state take that next

4616.8s state forward propagate it again

4625.5s uh oh no sorry observe that next state add it to the replay memory sample from the replay memory and then train on that sample and in the process you will need to do another forward path because you need to estimate your target by using the immediate reward plus the belman equation plus the discounted future

4628.4s add it to the replay memory sample from

4631.8s the replay memory and then train on that

4634.3s sample and in the process you will need

4636.6s to do another forward path because you

4638.5s need to estimate your target by using

4640.6s the immediate reward plus the belman

4642.6s equation plus the discounted future

4645.0s equation plus the discounted future reward. Okay, are you experts at Q-learning?

4647.7s Okay, are you experts at Q-learning?

4657.0s where we get at the end. You can claim proudly you have trained an atari. It's not that complicated as you can see other than the Bellman equation piece. Turns out the agent has discovered that it can send the ball on the back and it's actually much easier to finish the game like that which is quite

4660.0s proudly you have trained an atari. It's

4661.9s not that complicated as you can see

4663.3s other than the Bellman equation piece.

4665.9s Turns out the agent has discovered that

4668.5s it can send the ball on the back and

4670.2s it's actually much easier to finish the

4671.9s game like that which is quite

4674.5s interesting. You know, a good player would know that you can dig a tunnel and you can finish the game without too much issues. Yeah. How do you How do you actually How do you quantify when uh when the game has ended? How would the model come after the How would the model come after the train?

4675.8s would know that you can dig a tunnel and

4677.8s you can finish the game without too much

4679.4s issues. Yeah.

4680.4s How do you

4682.6s How do you actually

4685.0s How do you quantify when uh when the

4687.1s game has ended?

4688.0s How would the model come after the

4689.8s How would the model come after the train?

4691.0s Yeah. Well, uh, first you would you would you would start seeing um the model get to good rewards as it play like it manages to get really good rewards while earlier it might not, you know, and so that's probably your best guess for how good the model is. In practice, if you're AlphaGo, you can

4693.3s would you would start seeing um the

4696.4s model get to good rewards as it play

4698.6s like it manages to get really good

4700.2s rewards while earlier it might not, you

4703.6s know, and so that's probably your best

4705.8s guess for how good the model is. In

4707.8s practice, if you're AlphaGo, you can

4709.4s also test it against the best humans in the world and you can observe that they're losing against the model. I'm saying you have like a bunch of different clears throat chest engines and some of them are way way better than others. They have different maybe they're both based on reinforcement learning and at the end

4711.0s the world and you can observe that

4712.7s they're losing against the model.

4715.4s I'm saying you have like a bunch of

4717.1s different clears throat chest engines

4718.2s and some of them are way way better than

4720.4s others. They have different

4724.6s maybe they're both based on

4725.8s reinforcement learning and at the end

4727.3s they maximize both of their rewards. they maximize both of their rewards. Yeah. So how do you know which model is actually doing better? clears throat You can get them to play together. So you have no idea if you do that one. No, you you could you could actually monitor the loss function and look at is

4729.8s they maximize both of their rewards. Yeah.

4730.2s So how do you know which model is

4731.9s actually doing better? clears throat

4733.5s You can get them to play together. So

4736.1s you have no idea if you do that one.

4738.2s No, you you could you could actually

4739.7s monitor the loss function and look at is

4742.0s the Bellman equation respected. If the Bellman equation is respected, then your model is really really good. And then we we're going to see an example of competitive selfplay where you get the model to play against other models and then over time as you watch them play for thousands and thousands of time, you

4745.2s Bellman equation is respected, then your

4747.2s model is really really good. And then we

4750.2s we're going to see an example of

4751.3s competitive selfplay where you get the

4753.8s model to play against other models and

4755.7s then over time as you watch them play

4757.7s for thousands and thousands of time, you

4759.8s can tell which model is ahead of another one. You can then sort of copy paste the best model into the other models and then make them play again for many times. And because you have the epsilon greedy approach, one of the model is naturally going to get better than the others because of the randomness that you add.

4761.4s one. You can then sort of copy paste the

4765.3s best model into the other models and

4767.3s then make them play again for many

4768.9s times. And because you have the epsilon

4770.8s greedy approach, one of the model is

4772.7s naturally going to get better than the

4773.9s others because of the randomness that

4776.6s you add.

4783.0s and then we'll spend 20 minutes on the RLHF. Um, here are other examples. This is Pong uh which is one v one sequest

4784.6s RLHF. Um, here are other examples. This

4788.1s is Pong uh which is one v one sequest

4797.5s the one that maybe more of you know Space Invaders very popular game as well. sighs So the impressive thing that they showed is that you can um actually solve many games with the exact same algorithm, no tweaks, which is quite impressive. Let's go a little further and talk about um advanced topics. Um here is a game

4799.2s Space Invaders very popular game as

4802.3s well. sighs

4803.9s So the impressive thing that they showed

4805.6s is that you can um actually solve many

4809.0s games with the exact same algorithm,

4812.0s no tweaks, which is quite impressive.

4818.3s Let's go a little further and talk about

4820.7s um advanced topics. Um here is a game

4824.0s called Montezuma Revenge. Um this game is particular because you're controlling a little character right here. And this character is trying to go and grab, let's say, this key right here, and it has some obstacles or some enemies that it needs to take care of. What what do you think is going to be an

4827.6s is particular because you're controlling

4829.9s a little character right here. And this

4832.4s character is trying to go and grab,

4834.5s let's say, this key right here, and it

4837.5s has some obstacles or some enemies that

4840.0s it needs to take care of.

4842.5s What what do you think is going to be an

4845.4s issue if we apply what we just learned to this game? comparison to let's say chess or go? comparison to let's say chess or go? Yes. Yeah, the reward is very delayed. Like if you start with a random network, what are the chances that the network is going to figure out that to get to the

4848.0s to this game?

4854.6s comparison to let's say chess or go?

4858.2s comparison to let's say chess or go? Yes.

4860.7s Yeah, the reward is very delayed. Like

4863.4s if you start with a random network, what

4866.3s are the chances that the network is

4868.6s going to figure out that to get to the

4870.2s key, it actually should go in the opposite direction. It should go in the opposite direction. It could jump down here. It should catch the rope. The rope will probably allow the character to go to the ladder. It goes down the ladder. It has to go jump up this enemy. My

4872.8s opposite direction. It should go in the

4874.3s opposite direction. It could jump down

4876.2s here. It should catch the rope. The rope

4878.9s will probably allow the character to go

4880.6s to the ladder. It goes down the ladder.

4883.5s It has to go jump up this enemy. My

4886.4s guess is it's in an enemy. I'm not sure, but I think it's an enemy because of the color. And I know that in gaming if it was green it might not have been an enemy but if it's gray or red it might be an enemy. And then go up the ladder

4888.5s but I think it's an enemy because of the

4890.3s color. And I know that in gaming if it

4892.7s was green it might not have been an

4894.2s enemy but if it's gray or red it might

4896.5s be an enemy. And then go up the ladder

4899.0s and grab the key. The chance is very low that the agent is going to make that successive good decisions to get there. You're right. Why is it use why is it easier for a human to actually solve that game? Intuition. Prior knowledge. So, for example, when you look at this game,

4902.7s that the agent is going to make that

4905.0s successive good decisions to get there.

4907.5s You're right. Why is it use why is it

4910.6s easier for a human to actually solve

4912.0s that game?

4916.6s Intuition. Prior knowledge. So, for

4919.7s example, when you look at this game,

4921.3s even if you have never played it, my guess is you would know you can go down the ladder because you know what a ladder is. Or you can see this little rope and you're like, I'm going to catch the rope. I'm going to jump and go to the other side. And you look at this

4923.4s guess is you would know you can go down

4924.8s the ladder because you know what a

4926.7s ladder is. Or you can see this little

4928.9s rope and you're like, I'm going to catch

4930.2s the rope. I'm going to jump and go to

4931.6s the other side. And you look at this

4933.3s little monster and you're like, I better not touch this monster. Or if anything, I will jump on top of it. Cuz you've played Mario, let's say. So, all of this is human intuition. uh sometimes you would call as a baby survival instinct like you throw the baby in the water and

4934.9s not touch this monster. Or if anything,

4936.5s I will jump on top of it. Cuz you've

4938.6s played Mario, let's say. So, all of this

4941.6s is human intuition. uh sometimes you

4945.0s would call as a baby survival instinct

4947.1s like you throw the baby in the water and

4948.6s suddenly it flips and it can um it can uh swim. Those are things that are to a certain extent encoded in our DNA but at the very least encoded in our experience of doing other things that have nothing to do with this game. And so the the problem here is called imitation

4951.5s uh swim. Those are things that are to a

4954.0s certain extent encoded in our DNA but at

4956.1s the very least encoded in our experience

4957.6s of doing other things that have nothing

4959.0s to do with this game. And so the the

4961.3s problem here is called imitation

4962.7s learning. Is is there a better way to start our network than a random initialization that allows the network to for example guess that this is a ladder and turns out that if the network knows that it will be more likely to get to the reward first and then learn from that reward and then get better over

4966.0s start our network than a random

4967.8s initialization that allows the network

4970.0s to for example guess that this is a

4971.6s ladder and turns out that if the network

4973.7s knows that it will be more likely to get

4975.5s to the reward first and then learn from

4977.5s that reward and then get better over

4979.0s that reward and then get better over time. The other part that can also use human knowledge which is what we're going to see together is reinforcement learning from human feedback where you have an analogy here which is you can train a language model and it might be completely misaligned with what actually humans care about. How does

4981.0s The other part that can also use human

4983.2s knowledge which is what we're going to

4984.4s see together is reinforcement learning

4986.6s from human feedback where you have an

4990.2s analogy here which is you can train a

4992.2s language model and it might be

4993.8s completely misaligned with what actually

4995.6s humans care about. How does

4997.5s reinforcement learning help in those situations? That's going to be the next topic in the last part of the lecture. snorts Okay, let me show you a few other results um quickly. Today we talked about DQ and deep Q learning. In practice there's a lot more reinforcement learning algorithm but you

4999.0s situations? That's going to be the next

5000.2s topic in the last part of the lecture.

5002.9s snorts Okay, let me show you a few

5004.6s other results um quickly. Today we

5007.8s talked about DQ and deep Q learning. In

5011.0s practice there's a lot more

5012.8s reinforcement learning algorithm but you

5014.6s got the gist of it. You got the concept of making good sequences of decision epsilon greedy exploration exploitation um terminal state starting state all of that you you got the one the one algorithm that is very popular right now is called PO proximal policy optimization. There is one that is even more popular right now that's actually

5016.3s of making good sequences of decision

5018.3s epsilon greedy exploration exploitation

5021.8s um terminal state starting state all of

5023.8s that you you got the one the one

5025.8s algorithm that is very popular right now

5027.9s is called PO proximal policy

5031.4s optimization. There is one that is even

5033.8s more popular right now that's actually

5035.3s from a year ago at Stanford called DPO that we won't study in the class. One of the things to know about PO just just to go over it really quickly and and I I pasted two important papers from Schulman a few years back trust TRPO and PO um is that it is not a value based

5038.4s that we won't study in the class. One of

5040.8s the things to know about PO just just to

5042.6s go over it really quickly and and I I

5044.4s pasted two important papers from

5046.2s Schulman a few years back trust TRPO and

5049.5s PO um is that it is not a value based

5052.9s algorithm. So in Q-learning you learn the Q values and then you define your policy as the arg max of the Q values. In PO, you learn the policy directly, which is a more probabilistic method. snorts Um, it also works well with continuous spaces. If you look at the Q-learning, we learned one output for

5054.8s the Q values and then you define your

5056.6s policy as the arg max of the Q values.

5060.6s In PO, you learn the policy directly,

5063.4s which is a more probabilistic method.

5065.8s snorts Um, it also works well with

5067.8s continuous spaces. If you look at the

5069.8s Q-learning, we learned one output for

5072.4s one action. If you actually have um a game that has continuous action like autonomous driving where it's not like just turn the wheel to the right or to the left, it's like what degree you turn it, it's continuous. then Q the QN would not work well or you would have to granularize the number of action a

5076.6s game that has continuous action like

5078.4s autonomous driving where it's not like

5080.6s just turn the wheel to the right or to

5082.2s the left, it's like what degree you turn

5084.4s it, it's continuous. then Q the QN would

5087.4s not work well or you would have to

5090.2s granularize the number of action a

5091.8s little bit to the right a little bit more a little more which would not be really useful instead you would use PO

5093.0s more a little more which would not be

5094.6s really useful instead you would use PO

5110.0s reward in DQN um different reward structure will lead to different uh types of you know agent strategies. Uh but you're right for the game of go you could actually define the reward as one if you win and zero if you don't win. That's it. You know every move will be

5113.0s structure will lead to different uh

5115.0s types of you know agent strategies. Uh

5117.9s but you're right for the game of go you

5119.9s could actually define the reward as one

5122.1s if you win and zero if you don't win.

5124.7s That's it. You know every move will be

5126.3s zero until the last move is a win. In chess you might actually do intermediate reward because you want to tell the you want to tell the agent that it's good to kill the opponent's pieces to get rid of them. You could also do end to end and say I don't give any intermediate

5128.5s chess you might actually do intermediate

5130.2s reward because you want to tell the you

5132.2s want to tell the agent that it's good to

5133.8s kill the opponent's pieces to get rid of

5136.6s them. You could also do end to end and

5138.6s say I don't give any intermediate

5140.2s reward. I just give a final reward which might be more complicated to train on but it might actually lead to a more optimal strategy because in fact you could actually win without taking any piece from your opponent. So other snorts things about PO is um you know it's more probabilistic. It has a

5142.2s might be more complicated to train on

5143.9s but it might actually lead to a more

5145.4s optimal strategy because in fact you

5147.9s could actually win without taking any

5149.5s piece from your opponent. So

5152.9s other snorts things about PO is um you

5155.5s know it's more probabilistic. It has a

5158.0s concept of an expected advantage which at every steps instead of telling you how good that action is, it would tell you how much better it is than random than than the current state like how much better would it be to do certain thing versus what you would have done otherwise. I'm not going to go into the

5160.5s at every steps instead of telling you

5162.7s how good that action is, it would tell

5165.0s you how much better it is than random

5167.4s than than the current state like how

5168.8s much better would it be to do certain

5170.9s thing versus what you would have done

5172.2s otherwise. I'm not going to go into the

5173.6s details. It's all in the paper. But those are things that are important. Here's a few examples of PO. So this example on the left is from open AI a few years back where you can see it's a continuous space where uh the agent is being um bullied a little bit but um

5175.3s those are things that are important.

5177.4s Here's a few examples of PO. So this

5180.1s example on the left is from open AI a

5183.8s few years back where you can see it's a

5185.8s continuous space where uh the agent is

5189.3s being um bullied a little bit but um

5192.7s it's trying to grab the rewards but it's also subject to external forces that are sort of throwing balls at it. It's a little bit mean, but um you can imagine that this is a continuous space, meaning you're controlling the nodes, you're controlling the joints of the agent, and you're controlling the forces, the

5195.7s also subject to external forces that are

5198.1s sort of throwing balls at it. It's a

5200.4s little bit mean, but

5203.5s um you can imagine that this is a

5206.1s continuous space, meaning you're

5207.4s controlling the nodes, you're

5209.2s controlling the joints of the agent, and

5211.4s you're controlling the forces, the

5213.1s angles, and so it's a that's why PO would be better in that case. Super. Here's a competitive selfplay, which I really like other. And this is the sumo game. Push the opponent outside the ring and you get a reward. So actually it's interesting because you're seeing some emergent behavior which is they attack each other's feet

5215.8s would be better in that case.

5218.8s Super. Here's a competitive selfplay,

5221.4s which I really like

5227.7s other. And this is the sumo game. Push

5229.7s the opponent outside the ring and you

5231.0s get a reward.

5233.1s So actually it's interesting because

5234.5s you're seeing some emergent behavior

5236.2s which is they attack each other's feet

5239.4s or they lower their center of gravity to be more stable for example. Yeah. be more stable for example. Yeah. laughter Yeah. Yeah. It's versions sometimes different initializations for example. So no but good question. So often time what open would do back you know back in that time is they would create copies of

5242.5s be more stable for example. Yeah.

5247.3s be more stable for example. Yeah. laughter

5248.3s Yeah. Yeah. It's versions sometimes

5250.4s different initializations for example.

5254.2s So no but good question. So often time

5256.9s what open would do back you know back in

5259.2s that time is they would create copies of

5262.4s the same model they would initialize them differently and they would let them learn and turns out one of the model will get better than the others and then they will copy again that model to the rest and do the same thing again and again pretty much. Oh yeah it's kind of funny isn't it? That's a good catch.

5264.6s them differently and they would let them

5266.4s learn and turns out one of the model

5268.3s will get better than the others and then

5270.2s they will copy again that model to the

5272.2s rest and do the same thing again and

5273.8s again pretty much.

5276.2s Oh yeah it's kind of funny isn't it?

5279.4s That's a good catch.

5285.8s that's a good goal. laughter

5293.8s they're a little awkward, you have to say, but but it it works. Okay, so this, you know, I let you watch the video, it's going to be shared, but um here's another set of games that are even more complicated that I mentioned early on. Open AAI 5 which you can think of an

5295.3s say, but but it it works. Okay, so this,

5299.3s you know, I let you watch the video,

5300.6s it's going to be shared, but um here's

5302.8s another set of games that are even more

5305.0s complicated that I mentioned early on.

5307.0s Open AAI 5 which you can think of an

5309.6s equivalent of League of Legend Dota where you have um 5v5 game so you have to collaborate etc which makes it adds like literally one additional um degree of complexity. Uh and Starcraft uh Alpha Star from Deep Mind is an example of where the observation is not the entire state you have fog and so that adds

5311.8s where you have um 5v5 game so you have

5315.8s to collaborate etc which makes it adds

5318.7s like literally one additional um degree

5321.8s of complexity. Uh and Starcraft uh Alpha

5325.8s Star from Deep Mind is an example of

5328.6s where the observation is not the entire

5331.2s state you have fog and so that adds

5333.6s another layer of complexity. not going to see that together today. Um, I would encourage you to look at the AlphaGo documentary on Netflix if you haven't. Who has seen it already? Nobody. Okay. Well, um, you can now watch it with a different eye, understanding reinforcement learning. And at some point in the in the

5335.8s to see that together today. Um,

5339.4s I would encourage you to look at the

5342.1s AlphaGo documentary on Netflix if you

5344.4s haven't. Who has seen it already?

5346.7s Nobody. Okay. Well, um, you can now

5349.8s watch it with a different eye,

5351.4s understanding reinforcement learning.

5353.4s And at some point in the in the

5357.6s documentary, you will see that um Alph Go makes a very odd move, a very creative move. And people are like, I don't understand that move. Even the top researchers or the best players would say in the video, they don't understand that move. It turns out that that move is very unintuitive for humans because

5361.4s Go makes a very odd move, a very

5365.0s creative move. And people are like, I

5367.8s don't understand that move. Even the top

5370.0s researchers or the best players would

5372.4s say in the video, they don't understand

5373.9s that move. It turns out that that move

5376.5s is very unintuitive for humans because

5379.0s as humans, we are trained to maximize our chances of winning. Like literally, if I can eat all your pieces in chess, I will eat all your pieces. And if I can surround your stones in go as much as I can, I will do it. The agent is just programmed to win. So that move actually

5381.6s our chances of winning. Like literally,

5383.7s if I can eat all your pieces in chess, I

5386.1s will eat all your pieces. And if I can

5388.5s surround your stones in go as much as I

5390.8s can, I will do it. The agent is just

5393.2s programmed to win. So that move actually

5395.4s look counterintuitive because the agent doesn't care about winning by one or winning by a, you know, 20 stones. It just cares about winning. And that move specifically put the agent in a good place to win by a small margin. Yeah. So that's an example of an insight that you will learn. you you understand from this

5397.0s doesn't care about winning by one or

5399.4s winning by a, you know, 20 stones. It

5401.9s just cares about winning. And that move

5404.2s specifically put the agent in a good

5406.6s place to win by a small margin. Yeah. So

5409.2s that's an example of an insight that you

5410.9s will learn. you you understand from this

5412.8s class and you will see in the in the class and you will see in the in the documentary. Okay, I think we have 10 minutes. I'm I'm just going to introduce uh reinforcement learning from human feedback because it's a more um modern topic that um is very trendy right now. It's important to know and so let's look

5415.4s class and you will see in the in the documentary.

5418.2s Okay, I think we have 10 minutes. I'm

5420.4s I'm just going to introduce uh

5422.2s reinforcement learning from human

5423.7s feedback because it's a more um modern

5427.8s topic that um is very trendy right now.

5430.6s It's important to know and so let's look

5432.7s at it together. We're going to start by recapping how language models are trained in a nutshell and then we'll see what self what supervised fine-tuning looks like. We'll talk about how do we train a critique model, a reward model and then finally what RLHF looks like and why is it so trending in the news. So snorts

5434.6s recapping how language models are

5436.1s trained in a nutshell and then we'll see

5439.0s what self what supervised fine-tuning

5441.3s looks like. We'll talk about how do we

5443.4s train a critique model, a reward model

5445.8s and then finally what RLHF looks like

5449.0s and why is it so trending in the news.

5451.4s So snorts

5452.5s um our training objective for language models is next token prediction, right? We've already talked about it in a former lecture. The idea is that I will get some inputs. I'm reading Wikipedia, let's say, or some sort of a text online and I read a sentence and I predict the last token and I do that again and

5456.2s models is next token prediction, right?

5458.4s We've already talked about it in a

5459.8s former lecture. The idea is that I will

5462.7s get some inputs. I'm reading Wikipedia,

5464.7s let's say, or some sort of a text online

5467.4s and I read a sentence and I predict the

5470.9s last token and I do that again and

5472.6s again. So for example, deep learning and then deep learning is deep learning is so deep learning is so cool and that's it. Um, so you get the idea, right? you always predict the next token and then over time it forces the model to um explicit u um emerging behaviors and it understands the connections

5476.0s then deep learning

5477.9s is deep learning is

5480.8s so deep learning is so cool

5485.3s and that's it. Um, so you get the idea,

5488.4s right? you always predict the next token

5490.1s and then over time it forces the model

5492.3s to um explicit u um emerging behaviors

5496.4s and it understands the connections

5498.0s between those concepts and it's really good at generating text. We compute a loss function. You're actually going to study this loss function in uh C 5. So I'm not going to talk about it right now, but you perform, you know, a gradient descent loop. And this is how you get your first pre-trained language

5500.2s good at generating text. We compute a

5503.8s loss function. You're actually going to

5505.0s study this loss function in uh C 5. So

5509.1s I'm not going to talk about it right

5510.2s now, but you perform, you know, a

5512.9s gradient descent loop. And this is how

5516.5s you get your first pre-trained language

5519.0s model. Get a pre-trained language model. You can call it on a text or a prompt and it will continually generate and you call it again and again and again and it generate generate generates. Everybody's comfortable with that, right? Okay. So, uh that's how we trained a language model. But there is a couple of

5520.9s You can call it on a text or a prompt

5523.0s and it will continually generate and you

5524.7s call it again and again and again and it

5526.3s generate generate generates. Everybody's

5528.8s comfortable with that, right? Okay. So,

5532.5s uh that's how we trained a language

5534.2s model. But there is a couple of

5535.8s problems. The first problem is that online data does not reflect online data does not reflect helpfulness. So to give you a concrete example um what you might find in the training set is something like deep learning is so cool when actually what you might find in practice is people asking what is deep learning.

5540.6s online data does not reflect

5543.0s online data does not reflect helpfulness.

5545.7s So to give you a concrete example um

5548.6s what you might find in the training set

5550.4s is something like deep learning is so

5552.2s cool when actually what you might find

5555.0s in practice is people asking what is

5557.4s deep learning.

5559.5s So the data is not really reflective of you want an agent to be helpful and that's a problem because the model was trained to continue text rather than answer questions and in practice you would see it's a big problem. Another problem is the model has no concept of good, polite, or

5561.9s you want an agent to be helpful and

5564.0s that's a problem because the model was

5566.9s trained to continue text rather than

5568.7s answer questions

5570.9s and in practice you would see it's a big

5572.4s problem. Another problem is the model

5574.7s has no concept of good, polite, or

5577.4s helpful yet. And to give you a concrete example, you might actually ask a pre-trained language model, my laptop won't turn on. What should I do? And then the model responds because it has Reddit on Reddit or on Wikipedia is laptops sometimes don't turn on because of power issues. Um, which is not what

5580.4s example, you might actually ask a

5582.6s pre-trained language model, my laptop

5584.9s won't turn on. What should I do? And

5587.4s then the model responds because it has

5589.1s Reddit on Reddit or on Wikipedia is

5591.8s laptops sometimes don't turn on because

5593.8s of power issues. Um, which is not what

5596.4s you ask. You ask what should I do? And in fact, a better answer would have been check your charger if is properly clears throat connected or the outlet works. If that's fine, try holding the power button for 10 seconds. If it still doesn't start, the battery or motorboard may blah blah blah blah. That's a better

5599.4s in fact, a better answer would have been

5601.7s check your charger if is properly

5603.6s clears throat connected or the outlet

5605.4s works. If that's fine, try holding the

5607.0s power button for 10 seconds. If it still

5608.9s doesn't start, the battery or motorboard

5610.3s may blah blah blah blah. That's a better

5611.9s answer. That's what you want an a language model to do nowadays. Um, and the model um can give you factual text because that's what it's been trained on, but it's it doesn't understand being helpful or um having an answer that looks like a humanlike answer. So our solution to it will start with

5613.4s language model to do nowadays. Um, and

5616.2s the model um can give you factual text

5619.4s because that's what it's been trained

5620.5s on, but it's it doesn't understand being

5622.2s helpful or um having an answer that

5625.0s looks like a humanlike answer.

5629.0s So our solution to it will start with

5631.6s using supervised fine-tuning um which is going to be learning from human written demonstrations of helpful behavior and then we get to even further and use RHF which will optimize not only for human written sentences or paragraphs but for preferences and the word preference is the keyword. Let's talk about how we can improve our pre-trained model with

5634.6s going to be learning from human written

5636.6s demonstrations of helpful behavior and

5639.4s then we get to even further and use RHF

5642.5s which will optimize not only for human

5644.9s written sentences or paragraphs but for

5647.4s preferences and the word preference is

5649.7s the keyword. Let's talk about how we can

5652.3s improve our pre-trained model with

5653.6s supervised finetuning. I'll take um that we want to align models with human written responses and the step one that we're going to lose is to build a data set. Let's build a data set of human prompt response pairs. So what actually open is going to do I'll explain it in a second is it

5655.6s I'll take um that we want to align

5658.4s models with human written responses

5662.3s and the step one that we're going to

5664.7s lose is to build a data set. Let's build

5666.9s a data set of human prompt response

5669.9s pairs. So what actually open is going to

5673.0s do I'll explain it in a second is it

5674.8s might collect some of the prompts that we all use and then ask humans to respond to those prompts and put that in a data set. It might al also ask separately experts to write really good prompts and then answer those prompts. It's a fully human-made data set. And then we use that data set to fine-tune

5677.4s we all use and then ask humans to

5679.9s respond to those prompts and put that in

5681.7s a data set. It might al also ask

5684.1s separately experts to write really good

5686.1s prompts and then answer those prompts.

5689.0s It's a fully human-made data set. And

5691.8s then we use that data set to fine-tune

5694.2s our pre-trained model. And by now you've learned fine-tuning in the online video. So you know what I'm talking about using supervised learning. So what it looks like is I take my pre-trained model that I just told you how we train and then I give it a prompt explain deep learning to a beginner and I also will

5696.2s learned fine-tuning in the online video.

5698.2s So you know what I'm talking about using

5700.4s supervised learning. So what it looks

5702.0s like is I take my pre-trained model that

5705.0s I just told you how we train and then I

5707.4s give it a prompt explain deep learning

5709.9s to a beginner and I also will

5711.9s concatenate to it a response a good response written by a human. Deep learning is a type of machine learning that uses neural and then I expect the model to come up with the word networks. So it's literally do whatever we did to train the pre-trained model but we do it on human written front response pairs

5716.9s response written by a human. Deep

5719.2s learning is a type of machine learning

5720.9s that uses neural

5723.3s and then I expect the model to come up

5725.0s with the word networks.

5727.3s So it's literally do whatever we did to

5730.4s train the pre-trained model but we do it

5732.6s on human written front response pairs

5740.6s use the you know the same loss function how far the model's response is from a human response um you do that many times and you will get SFT supervised and you will get SFT supervised fine-tuning um but it has some shortcomings

5742.9s how far the model's response is from a

5744.8s human response um you do that many times

5748.1s and you will get SFT supervised

5751.4s and you will get SFT supervised fine-tuning

5753.0s um but it has some shortcomings

5760.6s that is extremely costly to collect. In fact, I believe in the first version of that instruct GPT there was only 13,000 prompt response pairs and turns out it did really well despite that.

5763.8s fact, I believe in the first version of

5765.8s that instruct GPT there was only 13,000

5770.1s prompt response pairs and turns out it

5773.1s did really well despite that.

5781.0s generalize well because you're again you're not doing reinforcement learning here. You're doing supervised learning and so you're just showing a set of examples 13,000 examples that you want to learn but it's what tells you that it will generalize to an unseen prompt that will come up from your user base. And so this approach SFT really teaches the

5783.7s you're not doing reinforcement learning

5785.3s here. You're doing supervised learning

5788.8s and so you're just showing a set of

5790.8s examples 13,000 examples that you want

5793.3s to learn but it's what tells you that it

5795.8s will generalize to an unseen prompt that

5798.6s will come up from your user base. And so

5802.3s this approach SFT really teaches the

5805.6s model to imitate good behavior from humans. And that's the key. It's it's imitation. It is not preference imitation. It is not preference optimization. To do preference optimization, that's where we're going to train a reward model and we're going to do proper RHF. So let me talk to you about the RM reward model and then I'll tell

5807.8s humans. And that's the key. It's it's

5810.2s imitation. It is not preference

5812.7s imitation. It is not preference optimization.

5815.6s To do preference optimization,

5818.0s that's where we're going to train a

5819.4s reward model and we're going to do

5821.0s proper RHF. So let me talk to you about

5824.0s the RM reward model and then I'll tell

5827.0s you about RHF in a nutshell. The problem of um RHF is to align not with human responses but with human with human responses but with human preferences. So what what's going to happen is we're going to train a separate model to predict which responses human prefer and we're going to call that model the

5831.3s The problem of um RHF is to align not

5835.8s with human responses but with human

5837.8s with human responses but with human preferences.

5839.7s So what what's going to happen is we're

5842.1s going to train a separate model to

5845.0s predict which responses human prefer and

5848.6s we're going to call that model the

5850.1s reward model. It's a separate model from whatever we've trained before. The model um is going to use data from labelers. So you're going to show labelers two or more responses to the same prompt and those responses will be sampled from the SFT. So your best model right now is the SFT. You will sample three or four

5851.8s whatever we've trained before. The model

5855.0s um is going to use data from labelers.

5857.9s So you're going to show labelers two or

5860.9s more responses to the same prompt

5864.1s and those responses will be sampled from

5867.6s the SFT. So your best model right now is

5870.3s the SFT. You will sample three or four

5873.7s responses and you know how we sample right you you can tweak the temperature. You can select not only the top priority word, the top the softmax layers number one word, but you can sometimes sample differently and you will get a variety of answers. And then you will ask a human labeler to say answer B is better

5875.6s right you you can tweak the temperature.

5877.7s You can select not only the top priority

5880.2s word, the top the softmax layers number

5883.1s one word, but you can sometimes sample

5885.4s differently and you will get a variety

5887.3s of answers. And then you will ask a

5889.7s human labeler to say answer B is better

5892.0s than answer C and answer C is better than answer A and answer A is sort of equal to answer D. Let's say equal to answer D. Let's say um they will be asked which response they prefer and it can get more complicated. It doesn't have to be just a simple ranking. You you have multiple liker

5894.3s than answer A and answer A is sort of

5896.9s equal to answer D. Let's say

5900.7s equal to answer D. Let's say um

5903.0s they will be asked which response they

5904.6s prefer and it can get more complicated.

5907.0s It doesn't have to be just a simple

5908.4s ranking. You you have multiple liker

5910.8s scale methods and so on. But the the point is that you will collect those pair-wise comparison that we call preference data and you will use it to train a reward model which is initialized from your SFT. So your SFT is here. It's your best model to date and you're going to modify the last

5913.4s point is that you will collect those

5914.8s pair-wise comparison that we call

5916.6s preference data and you will use it to

5918.7s train a reward model which is

5920.7s initialized from your SFT. So your SFT

5923.6s is here. It's your best model to date

5926.5s and you're going to modify the last

5928.6s layer. So the softmax layer at the end of a language model that will tell you this is the token we should output or this is the word we should write. Um instead of that you'll get rid of that layer. You'll put a scalar value as output. You'll put a linear layer with a

5931.1s of a language model that will tell you

5933.4s this is the token we should output or

5935.4s this is the word we should write. Um

5937.8s instead of that you'll get rid of that

5939.5s layer. You'll put a scalar value as

5941.9s output. You'll put a linear layer with a

5943.9s scalar value that will represent the reward head. It will predict the the reward you know uh which is a proxy for the preference of the human. The way you'll train that reward model is you'll give it um a batch of two um you know you give it a prompt X with a response A

5945.9s reward head. It will predict the the

5948.2s reward you know uh which is a proxy for

5950.9s the preference of the human. The way

5952.6s you'll train that reward model is you'll

5955.2s give it um a batch of two um you know

5958.7s you give it a prompt X with a response A

5962.1s and the preference of the user and you'll give it the same prompt with a response B from the SFT with the preference of the user. So here the user is saying response A is better than response B. And so if you actually were sending that in this model, you will get a predicted reward for the preferred

5963.9s you'll give it the same prompt with a

5965.9s response B from the SFT with the

5968.5s preference of the user. So here the user

5970.3s is saying response A is better than

5972.7s response B. And so if you actually were

5975.3s sending that in this model, you will get

5978.0s a predicted reward for the preferred

5980.2s answer and a predicted reward for the this preferred answer that allows you to uh train using a loss function that I'm not going to cover given our our time sensitivity. The loss function will encourage the model to assign higher rewards to preferred responses. So you're trying to dissociate the higher reward better preference from the lower reward for

5982.4s this preferred answer

5985.8s that allows you to uh train using a loss

5992.6s function that I'm not going to cover

5993.7s given our our time sensitivity. The loss

5996.4s function will encourage the model to

5998.0s assign higher rewards to preferred

6000.8s responses. So you're trying to

6002.9s dissociate the higher reward better

6006.1s preference from the lower reward for

6007.8s lower preference. And it turns out that if you do that many times, you will have a reward model that given a prompt and a response will be approximating human preference. So you've just trained a critique that represents your humans. It's a proxy for what humans prefer. It's clears throat been trained on a lot of human preferences.

6009.7s if you do that many times, you will have

6012.0s a reward model that given a prompt and a

6014.8s response will be approximating human

6018.2s preference. So you've just trained a

6020.6s critique that represents your humans.

6023.8s It's a proxy for what humans prefer.

6025.9s It's clears throat been trained on a

6026.9s lot of human preferences.

6029.1s The reason it's better to use a model than actual humans is because we can use it widely on all sorts of inputs and it can scale from a data standpoint. Also note that this method is better than SFT because it's way easier to ask humans what's your preference between those two things and to ask them to come

6030.7s than actual humans is because we can use

6032.7s it widely on all sorts of inputs and it

6035.7s can scale from a data standpoint.

6038.8s Also note that this method is better

6040.6s than SFT because it's way easier to ask

6043.5s humans what's your preference between

6044.9s those two things and to ask them to come

6046.4s up with answers to prompts. Takes way less time. And if you've used SH GPT, you've probably been asked before to to to tell them which response you prefer. Yeah. So once trained the reward model represses the human as the evaluator during reinforcement learning with from human feedback and reinforcement learning from human feedback is very

6048.7s less time. And if you've used SH GPT,

6051.4s you've probably been asked before to to

6053.2s to tell them which response you prefer.

6057.4s Yeah. So once trained the reward model

6060.1s represses the human as the evaluator

6062.1s during reinforcement learning with from

6064.6s human feedback and reinforcement

6066.5s learning from human feedback is very

6069.5s comfortable for you. Now I will I will show you what it looks like given the Q-learning algorithm we learned. But essentially we we have first talked the model what good behavior looks like with SFT and then we built a reward model that can tell us how good an answer is according to human preferences and the

6071.4s show you what it looks like given the

6072.7s Q-learning algorithm we learned. But

6074.7s essentially we we have first talked the

6077.0s model what good behavior looks like with

6079.1s SFT and then we built a reward model

6081.8s that can tell us how good an answer is

6083.7s according to human preferences and the

6086.2s RHF approach is where we will let this model practice get scored by the reward model or the critic and update itself to produce higher scoring answers. So more preferred answers and it's the same as the games we've seen together but some things differ. So I just pasted here the exact setup that we've learned together

6089.2s model practice get scored by the reward

6092.2s model or the critic and update itself to

6094.9s produce higher scoring answers. So more

6097.9s preferred answers and it's the same as

6101.6s the games we've seen together but some

6104.6s things differ. So I just pasted here the

6107.3s exact setup that we've learned together

6110.5s um for reinforcement learning. The differences are the following. You know our objective is still to maximize expected reward that is produced um by the reward model aligned with human the reward model aligned with human preferences. The agent is the language model being fine-tuned. The environment is the space of possible prompts and continuations.

6113.1s differences are the following. You know

6115.4s our objective is still to maximize

6117.2s expected reward that is produced um by

6120.1s the reward model aligned with human

6121.9s the reward model aligned with human preferences.

6127.4s The agent is the language model being fine-tuned.

6130.4s The environment is the space of possible

6132.6s prompts and continuations.

6135.6s It's any any text that you can encounter. The state is the specific prompt plus the tokens that were generated so far. The next state is one more token added and the action is the next token that is chosen by the agent or the model. which is of course determined by the

6138.1s encounter. The state is the specific

6141.6s prompt plus the tokens that were

6143.9s generated so far. The next state is one

6147.0s more token added and the action is the

6150.8s next token that is chosen by the agent

6152.6s or the model.

6154.0s which is of course determined by the

6156.2s which is of course determined by the policy laughter and then the reward is estimated by the reward model that we trained to represent human preferences. in this case one episode is one full prompt. So imagine that you get a prompt and you start generating and you go through this reinforcement learning loop

6157.7s laughter and then the reward is

6161.0s estimated by the reward model that we

6163.7s trained to represent human preferences.

6171.0s in this case one episode is one full

6174.2s prompt. So imagine that you get a prompt

6177.4s and you start generating and you go

6179.0s through this reinforcement learning loop

6180.6s and you observe the rewards and then you try to maximize the future rewards and then at the end of training you end up with having your pre-trained model turn into an SFT and your SFT turn into a way better model using RHF.

6182.2s try to maximize the future rewards and

6184.5s then at the end of training you end up

6186.2s with having your pre-trained model turn

6188.3s into an SFT and your SFT turn into a way

6191.0s better model using RHF.

6201.8s this. The model does not get a reward at every single token. It gets a reward at the end of a sequence when the completion is finished because reward model was uh was asked to rate prompts and responses together. So you need to finish the generation in order to see what's the reward. And so again going

6204.9s every single token. It gets a reward at

6208.7s the end of a sequence when the

6210.5s completion is finished because reward

6212.9s model was uh was asked to rate prompts

6216.2s and responses together. So you need to

6218.3s finish the generation in order to see

6220.7s what's the reward. And so again going

6223.0s back to making good sequences of decision that's exactly it. You want the model to make enough good sequences of decision so that the response is preferred by the critique which represents a proxy to the human represents a proxy to the human preferences. So all intermediary rewards are typically zero and that makes it a very

6224.6s decision that's exactly it. You want the

6226.2s model to make enough good sequences of

6227.8s decision so that the response is

6230.6s preferred by the critique which

6232.3s represents a proxy to the human

6234.6s represents a proxy to the human preferences.

6236.7s So all intermediary rewards are

6238.6s typically zero and that makes it a very

6241.5s sparse reward episodic tasks just like a game of chess where you only get a reward when you finish assuming you're not defining intermediary reward. So you only know if you did well at the end and you have to then use that information to update your network and get a better proxy for it. Super. There's a very nice

6246.1s game of chess where you only get a

6247.7s reward when you finish assuming you're

6249.2s not defining intermediary reward. So you

6251.9s only know if you did well at the end and

6254.4s you have to then use that information to

6257.0s update your network and get a better

6259.2s proxy for it. Super. There's a very nice

6262.2s video. We're not going to play it for the sake of time, but I will send it online. Uh it's from 4 days ago. It's uh you know former Stanford students Andre Karpati who is very thoughtful and articulate and was explaining 4 days ago why reinforcement learning can be terrible at times and that human minds work way more

6263.3s the sake of time, but I will send it

6264.8s online. Uh it's from 4 days ago. It's uh

6268.1s you know

6269.9s former Stanford

6271.8s students Andre Karpati who is very

6274.5s thoughtful and articulate and was

6276.6s explaining 4 days ago why reinforcement

6279.4s learning can be terrible at times and

6282.1s that human minds work way more

6284.6s efficiently. And so I would encourage you to watch this 4-minute video because he's very clearly outlining why reinforcement learning is still not great even if it's the best thing we can use in many ways.

6286.3s you to watch this 4-minute video because

6288.6s he's very clearly outlining why

6290.5s reinforcement learning is still not

6293.0s great even if it's the best thing we can

6294.6s use in many ways.

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

Full Transcript