Name: How Robots Can Backflip with RL (Sim-to-Real, Kinematic Retargetting, Isaac Lab vs Mujoco)
Uploaded: 2026-02-08T16:04:01.768
Duration: 3891 s
Description: Read the full transcript of "How Robots Can Backflip with RL (Sim-to-Real, Kinematic Retargetting, Isaac Lab vs Mujoco)" by Kevin Wood | Robotics & AI. Pract...

2.0s here, an expert in reinforcement learning. She's also the author behind this cool work that allows robots to do back flips. music We're going to be covering a lot of cool and exciting topics like RL theory versus application, bridging the sim to real gap, simulation environments for RL, deploying RL on custom robots, music position versus torque control,

3.9s learning. She's also the author behind

6.0s this cool work that allows robots to do

8.2s back flips. music We're going to be

10.0s covering a lot of cool and exciting

11.4s topics like RL theory versus

13.2s application, bridging the sim to real

15.0s gap, simulation environments for RL,

17.4s deploying RL on custom robots, music

19.5s position versus torque control,

21.4s resources for learning RL, and then the second half we'll talk about kinematic music retargeting applications, generalization in robotics, generalist versus specialist, data and training, and finally end it with traditional optimization versus RL. And if you're new here, my name is Kevin. I've music been doing robotics and AI for 10 plus

23.7s second half we'll talk about kinematic

24.9s music retargeting applications,

26.7s generalization in robotics, generalist

29.4s versus specialist, data and training,

31.9s and finally end it with traditional

33.8s optimization versus RL. And if you're

36.1s new here, my name is Kevin. I've music

37.4s been doing robotics and AI for 10 plus

39.6s years and have lots of resources on my channel. I also have a robotics builder music membership where you get deep dive access to how I build my projects and also resources on my website to help you get started. Link down below. Can you give us a quick self introduction? Hi everyone, I'm Luj. I go by Lou. I'm

41.6s channel. I also have a robotics builder

43.2s music membership where you get deep

45.0s dive access to how I build my projects

47.8s and also resources on my website to help

49.6s you get started. Link down below. Can

51.4s you give us a quick self introduction?

53.5s Hi everyone, I'm Luj. I go by Lou. I'm

56.8s currently an applied scientist at Frontier A and Robotics Lab at Amazon. I just obtained my PhD last October from MIT under the guidance of professor Russ Tedri. Uh I have been working on optimization based and learning based methods for robotic manipulation and control. And right now I'm working on whole humanoid whole body control.

59.2s Frontier A and Robotics Lab at Amazon. I

62.4s just obtained my PhD last October from

65.4s MIT under the guidance of professor Russ

68.1s Tedri. Uh I have been working on

70.4s optimization based and learning based

72.6s methods for robotic manipulation and

74.9s control. And right now I'm working on

77.2s whole humanoid whole body control.

79.6s Very nice. Very cool topic because I know right now humanoid robots is probably one of the hottest topics right now. So it's very exciting to you know have a deeper dive into how some of the reinforcement learning techniques are actually applied in real life and understanding some of the you know existing challenges that people are

81.8s know right now humanoid robots is

83.9s probably one of the hottest topics right

85.8s now. So it's very exciting to you know

88.2s have a deeper dive into how some of the

91.0s reinforcement learning techniques are

93.3s actually applied in real life and

94.9s understanding some of the you know

96.8s existing challenges that people are

98.5s dealing with in industry. Uh so first topic we want to talk about is um I know a lot of times when people are learning RL there's a lot of theory but then when people try to apply that in application sometimes it could be quite different from theory. So, uh, if you were to give

101.3s topic we want to talk about is um I know

104.7s a lot of times when people are learning

106.6s RL there's a lot of theory but then when

109.5s people try to apply that in application

111.5s sometimes it could be quite different

113.3s from theory. So, uh, if you were to give

115.8s the audience like a good overview of what that difference is, how would you describe that? Uh, yeah. So, I think having a very solid background and foundation in RL theory is actually very rewarding and inspiring for RL practice. So um I think RL is more than just collecting rewards but also like how do we balance the

118.6s what that difference is, how would you

120.5s describe that?

121.8s Uh, yeah. So, I think having a very

124.7s solid background and foundation in RL

127.6s theory is actually very rewarding and

130.2s inspiring for RL practice. So um I think

133.8s RL is more than just collecting rewards

136.5s but also like how do we balance the

138.9s exploration and exploitation uh that is actually embedded in the coefficients and the uh exploration uh re uh exploration formulation in the RL theory itself. So how do we uh gain some insights from uh tuning the RO rewards or or coefficient from the RO theory I think is a very uh interesting and uh rewarding topic.

141.8s actually embedded in the coefficients

143.8s and the uh exploration uh re uh

147.2s exploration formulation in the RL theory

149.8s itself. So how do we uh gain some

152.7s insights from uh tuning the RO rewards

156.5s or or coefficient from the RO theory I

159.0s think is a very uh interesting and uh

162.1s rewarding topic.

163.5s Okay. So if someone is like for example if they're doing the tuning with all the coefficients is there like a specific example where you could think of where you know something where they try to apply that maybe in um like simulation but then when they actually try to deploy it it's very different.

165.8s if they're doing the tuning with all the

168.9s coefficients is there like a specific

171.0s example where you could think of where

173.0s you know something where they try to

175.2s apply that maybe in um like simulation

179.6s but then when they actually try to

181.3s deploy it it's very different.

184.5s uh do are you referring to the sim to row gap or just uh how is it what is a systematic way to tune the RL coefficients and reward uh in sim? Yeah, I guess more so in terms of the theory like um you know because a lot of times when someone if someone were to

186.9s row gap or just uh how is it what is a

189.8s systematic way to tune the RL

192.2s coefficients and reward uh in sim?

195.1s Yeah, I guess more so in terms of the

197.4s theory like um you know because a lot of

199.7s times when someone if someone were to

201.3s jump in and use a framework like um gymnasium for example, they have everything set up where a lot of the function calls is kind of like a black box, right? they would maybe play with the reward structure or maybe they might change kind of like the exploration you were saying. So, you know, for example,

204.4s gymnasium for example, they have

206.2s everything set up where a lot of the

209.4s function calls is kind of like a black

211.3s box, right? they would maybe play with

214.0s the reward structure or maybe they might

216.8s change kind of like the exploration you

218.8s were saying. So, you know, for example,

220.8s someone who's very used to a gymnasium Python framework and um those that's what I would maybe consider more of the application side. So where do you see that that infrastructure start to break where you know like for someone who has spent maybe a couple years learning the theory what is something what edge do

223.8s Python framework and um those that's

226.9s what I would maybe consider more of the

228.7s application side. So where do you see

231.3s that that infrastructure start to break

234.2s where you know like for someone who has

236.9s spent maybe a couple years learning the

239.0s theory what is something what edge do

242.2s someone that know the theory might have over someone who doesn't you know if they're just playing with gymnasium for they're just playing with gymnasium for example um yeah that's a great question so uh during the training we might see some phenomenon as mode collapse. So for example uh if there is not enough enough

244.2s over someone who doesn't you know if

246.5s they're just playing with gymnasium for

248.5s they're just playing with gymnasium for example

250.0s um yeah that's a great question so uh

252.6s during the training we might see some

255.1s phenomenon as mode collapse. So for

257.4s example uh if there is not enough enough

260.8s exploration from the agents the agents will just quickly uh decay to a single behavior without exploring the environment further and it can easily get stuck in the local minimum which uh we don't want as the ideal behavior. So in that case uh we might one might have to increase the exploration coefficient

263.2s will just quickly uh decay to a single

266.3s behavior without exploring the

268.2s environment further and it can easily

270.1s get stuck in the local minimum which uh

272.5s we don't want as the ideal behavior. So

276.0s in that case uh we might one might have

278.3s to increase the exploration coefficient

281.0s to encourage the agent to explore the environments before collapsing to a single mode and doing explor exploitation afterwards. Okay. So someone who's just playing with the gymnasium um like Python library for the gymnasium um like Python library for example is that something they would be able to figure out or is it because if they had

283.9s environments before collapsing to a

286.3s single mode and doing explor

288.2s exploitation afterwards.

290.2s Okay. So someone who's just playing with

293.4s the gymnasium um like Python library for

297.2s the gymnasium um like Python library for example

298.8s is that something they would be able to

302.2s figure out or is it because if they had

304.6s like no theory they wouldn't be able to figure out something like that I think if you yourself do enough of trial and error and change the coefficients this and there you might be able to bump into some uh coefficients tuning or uh you uh you just do a grid search on all the important co uh

306.2s figure out something like that

308.5s I think if you yourself do enough of

311.0s trial and error and change the

312.6s coefficients this and there you might be

314.8s able to bump into some uh coefficients

318.3s tuning or uh you uh you just do a grid

321.6s search on all the important co uh

323.2s coefficients and get a very good intuition afterwards uh like which part which are the most important coefficients or parameters you should change for the RL training but if we just start from the fundamental of the RO training and RO theory itself so most of the time the very common algorithm

325.1s intuition afterwards uh like which part

327.8s which are the most important

329.4s coefficients or parameters you should

331.3s change for the RL training but if we

334.0s just start from the fundamental of the

336.5s RO training and RO theory itself so most

339.4s of the time the very common algorithm

342.5s we're using for RO these days are PO so there are only a small number of terms in the PO that are really important that is causing the training to uh stabilize or destabilize. So if we just start from these formulation uh from the fundamental we will be have able to have

345.8s there are only a small number of terms

348.3s in the PO that are really important

351.0s that is causing the training to uh

353.3s stabilize or destabilize. So if we just

355.8s start from these formulation uh from the

358.0s fundamental we will be have able to have

360.2s like a big picture of oh what are the important parameters we should tune. So something like doing like a grid search on all the coefficients. Do you feel like that's more of a advanced technique that someone would learn maybe only through school or is that something that someone who's more familiar with the application only part would also

362.7s important parameters we should tune.

364.7s So something like doing like a grid

366.6s search on all the coefficients. Do you

368.3s feel like that's more of a advanced

370.5s technique that someone would learn maybe

372.5s only through school or is that something

374.4s that someone who's more familiar with

377.8s the application only part would also

379.9s know how to do? Uh I think people who are very practical will also come up with this idea. Oh really? Okay. Um you mentioned the PO. I know that's one of the most common uh RL models that people are using right now for robotics. Um would you say there's a specific reason why people are leaning towards

381.5s Uh I think people who are very practical

383.6s will also come up with this idea.

385.5s Oh really? Okay.

387.1s Um you mentioned the PO. I know that's

389.3s one of the most common uh RL models that

392.8s people are using right now for robotics.

395.7s Um would you say there's a specific

397.8s reason why people are leaning towards

399.4s that versus another model or how they come to decide to use that one? Yeah. Um so uh in Local Motion for example, people use PO a lot because it's uh an online uh policy that is uh very good at encouraging the uh encouraging the training to stay within the distribution. So that means the uh

402.2s come to decide to use that one?

404.5s Yeah. Um so uh in Local Motion for

407.0s example, people use PO a lot because

409.5s it's uh an online uh policy that is uh

413.1s very good at encouraging the uh

416.9s encouraging the training to stay within

419.2s the distribution. So that means the uh

422.6s RL algorithm is actually experiencing the state the the same state and action distribution as the agent itself is exploring the environment. So for some other uh more efficient RL algorithms such as offline RL uh they are essentially using different strategies uh during the policy update versus what the agent is using to explore the

425.0s the state the the same state and action

427.6s distribution as the agent itself is

430.2s exploring the environment. So for some

433.0s other uh more efficient RL algorithms

436.2s such as offline RL uh they are

438.8s essentially using different strategies

441.1s uh during the policy update versus what

443.4s the agent is using to explore the

446.0s environment. That kind of algorithm is uh cheaper and is can can be more uh efficient but because the policy update and the agent is experiencing different state distribution uh during deployment there will be some non ideal effect called distribution shift which can cause a a big gap during deployment and training. So would you say if you were

448.8s uh cheaper and is can can be more uh

452.2s efficient but because the policy update

454.9s and the agent is experiencing different

457.3s state distribution uh during deployment

459.8s there will be some non ideal effect

462.2s called distribution shift which can

464.4s cause a a big gap during deployment and

467.7s training. So would you say if you were

470.0s to give like a percentage of users that's using PO versus other models, what would how would you kind of describe that distribution right now? Uh I would say probably 90 of robot locomotion researchers are using PO. Oh okay. We were talking about like the sim to real gap. That's like a very big

472.7s that's using PO versus other models,

475.3s what would how would you kind of

476.8s describe that distribution right now?

479.8s Uh I would say probably 90 of robot

483.7s locomotion researchers are using PO.

486.3s Oh okay. We were talking about like the

488.3s sim to real gap. That's like a very big

491.2s area that people are trying to tackle and I know there's people that try to tackle that problem either you know rand tackle that problem either you know rand randomizing uh you know physical properties of their robots or some people have like zeroot techniques and also there's people that try to train on hardware ignoring some

493.2s and I know there's people that try to

495.4s tackle that problem either you know rand

498.1s tackle that problem either you know rand randomizing

499.6s uh you know physical properties of their

501.4s robots or some people have like zeroot

505.6s techniques and also there's people that

507.9s try to train on hardware ignoring some

510.6s of the simulation. in your opinion, what is the best approach to sim to real? And maybe if you could explain some of the tradeoffs between uh maybe some of the methods I mentioned or any other methods that um you're familiar with. Yeah, that's a great question. Simple is a very important procedure in our uh

513.2s is the best approach to sim to real? And

516.2s maybe if you could explain some of the

518.0s tradeoffs between uh maybe some of the

520.2s methods I mentioned or any other methods

522.2s that um you're familiar with.

525.6s Yeah, that's a great question. Simple is

527.8s a very important procedure in our uh

530.9s entire robot deployment pipeline. So I would say we one should start with a reasonably accurate system identification of the robot and then to try to randomize a little bit around that nominal values in simulation during training. So that's called domain randomization uh and deployment uh and and deploy it on the robots and see if

533.9s would say we one should start with a

536.5s reasonably accurate system

538.8s identification of the robot and then to

541.8s try to randomize a little bit around

544.3s that nominal values in simulation during

546.7s training. So that's called domain

548.4s randomization uh and deployment uh and

551.3s and deploy it on the robots and see if

553.8s there's other like mismatch between the sim and row. So one can essentially do uh a loop where uh one identify the coefficient system coefficients of the robots uh try to uh randomize it in sim deploy it on real and collect the system rowouts uh to back prop the parameters back to the sim to modify the sim

556.6s sim and row. So one can essentially do

559.8s uh a loop where uh one identify the

562.9s coefficient system coefficients of the

565.3s robots uh try to uh randomize it in sim

568.9s deploy it on real and collect the system

571.8s rowouts uh to back prop the parameters

575.3s back to the sim to modify the sim

577.6s parameters to match the real behavior. Uh that would be the most ideal case. uh but also one can uh train a base policy in simulation which is performing reasonably enough and then starting from that base policy uh do real world RL uh in real uh I would say training RL from

579.7s Uh that would be the most ideal case. uh

582.2s but also one can uh train a base policy

585.6s in simulation which is performing

587.6s reasonably enough and then starting from

590.6s that base policy uh do real world RL uh

593.6s in real uh I would say training RL from

597.4s scratch in the real world is very uh time consuming and also can uh do a lot of damage to the hardware which is the hardware can be expensive so we want to u minimize the uh computation time as well as the cost for the entire training and deployment loop. Yeah, that's a good point. Uh on the

600.2s time consuming and also can uh do a lot

604.3s of damage to the hardware which is the

607.7s hardware can be expensive so we want to

610.9s u minimize the uh computation time as

613.4s well as the cost for the entire training

616.2s and deployment loop.

618.0s Yeah, that's a good point. Uh on the

619.6s hardware note, I'm very curious um like have you ever encountered any issues where um your hardware randomly fails during your RL deployment and you couldn't figure it out or maybe something that was harder hard for you to figure out. Uh yeah. So actually that's a very common issue when we're doing hardware

623.1s have you ever encountered any issues

625.6s where um your hardware randomly fails

630.6s during your RL deployment and you

632.7s couldn't figure it out or maybe

634.1s something that was harder hard for you

636.7s to figure out.

639.0s Uh yeah. So actually that's a very

641.8s common issue when we're doing hardware

644.0s experiments. Um so there can be cases when uh as soon as we start our controller the robot is just like uh waving the their arms and robots uh like wildly. Um there can be a couple of issues. So first of all there can be like uh the calibration of IMU is not

646.6s when uh as soon as we start our

649.3s controller the robot is just like uh

652.1s waving the their arms and robots uh like

655.3s wildly. Um there can be a couple of

657.5s issues. So first of all there can be

659.4s like uh the calibration of IMU is not

662.1s accurate enough. So it doesn't have like an accurate uh sense of where it is and what it's uh what is it it is angular and uh linear velocity and there can also uh be uh motor that is not producing enough torque. So for example if we're doing like very agile behavior

664.1s an accurate uh sense of where it is and

668.2s what it's uh what is it it is angular

670.9s and uh linear velocity and there can

673.9s also uh be uh motor that is not

677.8s producing enough torque. So for example

679.9s if we're doing like very agile behavior

683.3s like climbing up a very high platform or jumping off a cliff or something like that. Uh sometimes uh in simulation although we are able to kind of train a policy for the robot to do that the motor curve is actually different in real and in reality uh on the hardware experiments the motor on the robot will

687.0s jumping off a cliff or something like

689.7s that. Uh sometimes uh in simulation

692.6s although we are able to kind of train a

695.9s policy for the robot to do that the

698.3s motor curve is actually different in

700.2s real and in reality uh on the hardware

703.0s experiments the motor on the robot will

705.9s not be able to produce as much torque as needed for doing such an agile behavior. So that might also cause some failure on hardware. So since motors typically have like a rated torque and peak torque, is that something you can cap uh accurately in simulation or is it pretty hard to set those constraints?

708.7s needed for doing such an agile behavior.

711.7s So that might also cause some failure on

714.6s hardware. So

717.0s since motors typically have like a rated

719.4s torque and peak torque, is that

721.1s something you can cap uh accurately in

725.3s simulation or is it pretty hard to set

727.7s those constraints?

729.8s Yeah, I think it's pretty hard. So if we're just using the default parameters from the robot seller, uh I would say uh it's often not as accurate. So if one wants to do a very accurate calibration uh I think one has to run a lot of uh experiments with the uh current and

732.4s we're just using the default parameters

734.2s from the robot seller, uh I would say uh

737.7s it's often not as accurate. So if one

741.0s wants to do a very accurate calibration

743.9s uh I think one has to run a lot of uh

747.4s experiments with the uh current and

750.3s torque and this and speed of the motor to record its behavior. Uh and also there's another caveat which is when we do more experiments the robot itself essentially wears off. So that uh relationship between motor current, speed and torque will actually change over time which is even harder to model in simulation. So I think the best we

753.2s to record its behavior. Uh and also

756.9s there's another caveat which is when we

759.8s do more experiments the robot itself

762.4s essentially wears off. So that uh

765.1s relationship between motor current,

767.3s speed and torque will actually change

769.8s over time which is even harder to model

773.3s in simulation. So I think the best we

775.9s can do is try to add more safety guard in simulation. Uh try to penalize a little bit more uh of the torque uh limits in simulation so that when it transfer to real it won't hit its actual transfer to real it won't hit its actual limit. I see. So in theory, let's say someone were to

779.4s in simulation. Uh try to penalize a

782.9s little bit more uh of the torque uh

785.5s limits in simulation so that when it

787.8s transfer to real it won't hit its actual

790.4s transfer to real it won't hit its actual limit.

791.0s I see.

792.9s So in theory, let's say someone were to

796.0s be like very conservative and say, you know, they assume the peak torque is like 50 of the manufacturers's rated torque. Do you think they would still face any of these challenges or would that be pretty safe? I think in that case um it's more promising to uh transfer to real without hitting any hardware issue. But uh that

799.5s know, they assume the peak torque is

801.7s like 50 of the manufacturers's rated

805.0s torque. Do you think they would still

806.8s face any of these challenges or would

808.4s that be pretty safe?

810.7s I think in that case um it's more

813.2s promising to uh transfer to real without

815.8s hitting any hardware issue. But uh that

818.6s of course limits the range of motion the robot can do. Like if we wanted to do the wolf flip or jumping uh very high, I think it will be more challenging. Yeah, it's definitely a good point because Yeah, you're pretty much constraining over constraining your robot. Yeah. Mhm. Exactly.

821.0s robot can do. Like if we wanted to do

823.4s the wolf flip or jumping uh very high, I

827.3s think it will be more challenging.

829.2s Yeah, it's definitely a good point

830.3s because Yeah, you're pretty much

831.9s constraining over constraining your

833.5s robot. Yeah.

835.0s Mhm. Exactly.

837.4s How about So, so far I know a lot of your work has been on uh deploying it on the Uni Tree robots. So if someone were to try to take their techniques and deploy on their own custom robot, what do you think would be, you know, some techniques or strategies they would have

839.7s your work has been on uh deploying it on

842.2s the Uni Tree robots. So if someone were

844.6s to try to take their techniques and

847.8s deploy on their own custom robot, what

850.4s do you think would be, you know, some

852.6s techniques or strategies they would have

854.3s to use or understand to do something like that? Yeah, that's a great question. So uh starting from uh the fundamental level one has to do a very careful CID to measure the uh for example the inertia and mass of the entire robot as well as the motors uh and do a careful calibration of all the sensors including

857.0s like that?

859.0s Yeah, that's a great question. So uh

861.5s starting from uh the fundamental level

864.2s one has to do a very careful CID to

867.7s measure the uh for example the inertia

870.8s and mass of the entire robot as well as

873.5s the motors uh and do a careful

875.8s calibration of all the sensors including

878.4s IMU encoders and if one has depth camera uh the cameras as well and then one has to have like a very robust low-level controller which translates ates the higher level RA policy into lower level higher frequency torque control and that has to be very reliable in terms of both the magnitude as well as the frequency

881.9s uh the cameras as well and then one has

885.0s to have like a very robust low-level

888.0s controller which translates ates the

890.5s higher level RA policy into lower level

893.8s higher frequency torque control and that

896.6s has to be very reliable in terms of both

899.7s the magnitude as well as the frequency

902.4s the uh command is sent to the torque sends then to the motor and then uh one has to build like a reasonable enough uh simulation mod in simulation so that one can start like a arrow training session with a good model so that lighter when people want to transfer it to real there's a smaller centrial gap.

905.0s sends then to the motor

906.9s and then uh one has to build like a

909.7s reasonable enough uh simulation mod in

913.7s simulation so that one can start like a

917.0s arrow training session with a good model

919.1s so that lighter when people want to

921.5s transfer it to real there's a smaller

923.6s centrial gap.

924.9s So for the CIS ID that you talked about are there generally common common methods or open-source methods that people like to leverage or do they try to make their own? Uh I think it can be case dependent because uh for the mass you can just take a scale or something like that to

927.3s are there generally common common

930.6s methods or open-source methods that

933.4s people like to leverage or do they try

935.4s to make their own?

937.1s Uh I think it can be case dependent

939.4s because uh for the mass you can just

941.9s take a scale or something like that to

944.4s uh measure the mass. Uh I think the most important piece of the CIS ID is actually amateur or which is the motors or rotational inertia. Uh that is actually very important. Uh we find out that one of the most important pieces for our SIM to real pipeline. uh and for others I think there are standardized uh

947.5s important piece of the CIS ID is

950.6s actually amateur or which is the motors

953.6s or rotational inertia. Uh that is

956.3s actually very important. Uh we find out

959.1s that one of the most important pieces

960.7s for our SIM to real pipeline. uh and for

963.8s others I think there are standardized uh

966.2s procedure online one can resort to but uh I think for each robot they have their own uh advantage and disadvantages and one should be careful about characterizing uh uh if it's the motor curve that is really important or it's the inertia that is uh causing a lot of trouble uh for the centuro deployment.

969.8s uh I think for each robot they have

972.2s their own uh advantage and disadvantages

975.2s and one should be careful about

977.0s characterizing uh uh if it's the motor

981.1s curve that is really important or it's

983.4s the inertia that is uh causing a lot of

986.2s trouble uh for the centuro deployment.

988.4s So in terms of the accuracy of your So in terms of the accuracy of your calibration, um how accurate do you think one has to be to have good enough results when they deploy their RL models? Uh yeah, I think it's again uh case by case. I would have to say as accurate as

992.3s So in terms of the accuracy of your calibration,

993.8s um how accurate do you think one has to

996.7s be to have good enough results when they

1000.3s deploy their RL models?

1003.5s Uh yeah, I think it's again uh case by

1007.0s case. I would have to say as accurate as

1010.8s one would get so that when during the training uh phase when doing domain randomization you can randomize uh only in a very small range. Um and uh a very important thing is that uh maybe during your calibration or CID you only you do uh multiple rounds and take the average uh and also know the standard deviation

1013.8s training uh phase when doing domain

1016.0s randomization you can randomize uh only

1018.7s in a very small range. Um and uh a very

1022.2s important thing is that uh maybe during

1024.3s your calibration or CID you only you do

1028.2s uh multiple rounds and take the average

1030.7s uh and also know the standard deviation

1033.0s of your value so that you can use the mean and the standard deviation to characterize the range for your domain characterize the range for your domain randomization. Uh you mentioned uh frequency earlier. Can you dive into a little bit more details about what you were saying for the frequency part?

1036.0s mean and the standard deviation to

1037.8s characterize the range for your domain

1039.9s characterize the range for your domain randomization.

1041.3s Uh you mentioned uh frequency earlier.

1043.9s Can you dive into a little bit more

1045.6s details about what you were saying for

1047.3s the frequency part?

1049.9s Uh sure. So the higher level R policy is running at 50 HzT and the lower level uh torque command is running at 500 hertz. So the the robot itself has to consume like a relatively higher snorts frequency of torque command while on the higher level the R training is spitting

1054.2s running at 50 HzT and the lower level uh

1058.1s torque command is running at 500 hertz.

1061.7s So the the robot itself has to consume

1065.8s like a relatively higher snorts

1067.7s frequency of torque command while on the

1070.6s higher level the R training is spitting

1073.2s out like 50 50 Hz uh PD target um for the motor to for the SDK to be converted into the motor commands. So how how is that typically determined like these values 50 500 is that something uh that was figured out through experiment or was it through theory? How how did you guys come to this conclusion?

1077.8s the motor to for the SDK to be converted

1082.5s into the motor commands. So how how is

1085.2s that typically determined like these

1087.5s values 50 500 is that something uh that

1091.2s was figured out through experiment or

1093.4s was it through theory? How how did you

1095.4s guys come to this conclusion?

1098.6s Uh I think it's uh most of the time u by convention. So for example, unitry SDK would uh support something like 500 uh uh 500 Hz work command and for the RL policy I think it's uh mostly trial and error and uh what other people have been using and have been work uh have been

1102.1s convention. So for example, unitry SDK

1106.1s would uh support something like 500 uh

1109.4s uh 500 Hz work command and for the RL

1114.8s policy I think it's uh mostly trial and

1117.3s error and uh what other people have been

1120.0s using and have been work uh have been

1122.2s working reasonably well for them. Uh we will first try the value that is already testified to be working stably. So what what behavior would you typically see if you went a little bit too high? Say your RL was running at maybe 100 or even like 200. What sort of behavior might one see

1125.5s will first try the value that is already

1129.3s testified to be working stably. So what

1132.2s what behavior would you typically see if

1135.0s you went a little bit too high? Say your

1138.2s RL was running at maybe 100 or even like

1141.2s 200. What sort of behavior might one see

1144.2s if they were running too high? And what if they went too low like maybe say like 10 hertz? 10 hertz? snorts Mhm. Uh so I think there's a trade-off between the compute time and uh the RL like policy frequency. So the higher is probably better in some sense that we can react to the uh environment more

1146.6s if they went too low like maybe say like

1149.3s 10 hertz?

1151.5s 10 hertz? snorts

1152.2s Mhm. Uh so I think there's a trade-off

1155.0s between the compute time and uh the RL

1159.4s like policy frequency. So the higher is

1163.7s probably better in some sense that we

1166.3s can react to the uh environment more

1169.9s quickly. Um but there is of course if the policy is running at a higher frequency then it consumes more uh memory and power on the GPU and sometimes there will if it's we're running too high we will run into latency problem that the the policy command cannot actually be sent in real

1174.2s the policy is running at a higher

1176.1s frequency then it consumes more uh

1179.2s memory and power on the GPU and

1181.6s sometimes there will if it's we're

1183.7s running too high we will run into

1185.4s latency problem that the the policy

1189.0s command cannot actually be sent in real

1191.8s time to the lower level uh motor command. uh if we're running it on a very low frequency then there's the problem that we're not reacting to the environment fast enough. If the robot is falling down and it cannot send the policy command fast enough it might not be able to recover uh immediately.

1194.7s command. uh if we're running it on a

1197.8s very low frequency then there's the

1200.6s problem that we're not reacting to the

1202.6s environment fast enough. If the robot is

1204.9s falling down and it cannot send the

1207.4s policy command fast enough it might not

1209.4s be able to recover uh immediately.

1212.4s How about in terms of um so you mentioned a lot about like reaction time. How about in terms of like the model or robot uh stability? Have you seen any trends in the stability whether it's higher or lower uh frequency? Uh yeah that's a great question. Uh our intuition is that the lower the

1214.6s mentioned a lot about like reaction

1216.1s time. How about in terms of like the

1219.0s model or robot uh stability? Have you

1222.4s seen any trends in the stability whether

1225.8s it's higher or lower uh frequency?

1229.4s Uh yeah that's a great question. Uh our

1232.0s intuition is that the lower the

1234.3s frequency the more stable the training is because we are actually uh exploring on a uh smoother manifold while versus if it's a higher frequency uh one might be like sampling around a smooth curve uh in a noisy way. So the command is actually like pretty uh zigzag and noisy. So that is a an actually like a

1236.2s is because we are actually uh exploring

1240.0s on a uh smoother manifold while versus

1244.1s if it's a higher frequency uh one might

1247.7s be like sampling around a smooth curve

1250.9s uh in a noisy way. So the command is

1253.8s actually like pretty uh zigzag and

1256.0s noisy. So that is a an actually like a

1260.3s harder exploration problem in some sense to actually stabilize your robot. So I think finding the balance between the smoothness and reactivity is what we uh like what drives us here for 50 Hz. I see. How about in terms of so like at 50 Hz, you know, your desired points are kind of spaced apart. So in practice, do

1262.4s to actually stabilize your robot. So I

1265.8s think finding the balance between the

1268.3s smoothness and reactivity is what we uh

1271.6s like what drives us here for 50 Hz.

1274.3s I see. How about in terms of so like at

1277.7s 50 Hz, you know, your desired points are

1281.9s kind of spaced apart. So in practice, do

1285.8s you think it's good enough to send those points at 50 Hz or do you need something that's in between your desired commands that goes to your low level that does any interpolation? Do you think is there any interpolation that needs to happen or is that happening in the lower level

1288.4s points at 50 Hz or do you need something

1292.1s that's in between your desired commands

1295.1s that goes to your low level that does

1297.0s any interpolation? Do you think is there

1300.0s any interpolation that needs to happen

1301.6s or is that happening in the lower level

1303.8s or is that happening in the lower level controllers? Yeah, so that interpolation is actually on the lower level. So we're essentially uh setting PD position target for the higher level RL policy which is actually uh implicitly doing torque control using this PD target uh that we interpolate and translate the position command into torque command using the PD

1305.8s Yeah, so that interpolation is actually

1307.8s on the lower level. So we're essentially

1310.7s uh setting PD position target for the

1314.4s higher level RL policy which is actually

1318.5s uh implicitly doing torque control using

1321.0s this PD target uh that we interpolate

1324.7s and translate the position command into

1329.0s torque command using the PD

1331.6s torque command using the PD relationship. So their lower level controller would you say is like a position P. So you're taking a input as position and then the position P converts it to torque. Do they have like a cascaded type controller where there's a position velocity current or is it just position current? Is that some what's your

1332.7s So their lower level controller would

1334.7s you say is like a position P. So you're

1338.0s taking a input as position and then the

1340.6s position P converts it to torque. Do

1343.0s they have like a cascaded type

1346.6s controller where there's a position

1348.2s velocity current or is it just position

1351.7s current? Is that some what's your

1354.5s understanding of their current setup? understanding of their current setup? snorts Um so it's so the for example the unitry SDK is kind of like a black spark black box to us. So our best understanding is that they are trying to use the PD relationship to translate the position uh target into torque command.

1357.8s understanding of their current setup? snorts

1357.9s Um so it's so the for example the unitry

1361.6s SDK is kind of like a black spark

1365.2s black box to us. So our best

1368.5s understanding is that they are trying to

1370.9s use the PD relationship to translate the

1374.8s position uh target into torque command.

1378.5s But sometimes there are weird behavior on the robot. So that might be something that uh in the blackbox is not per uh performing as we expected. So that that might also create some gap. So most of the experiments you have done was in uh position control. Is that right? Have you guys played with torque control

1381.1s on the robot. So that might be something

1383.8s that uh in the blackbox is not per uh

1387.1s performing as we expected. So that that

1390.1s might also create some gap. So most of

1393.0s the experiments you have done was in uh

1395.9s position control. Is that right?

1399.0s Have you guys played with torque control

1400.9s directly from your model? Uh not really. The reason is torque control is uh much less forgiving and it it should be sent at a very high frequency. So if there is any like non like any imperfections uh in the command it will be actually amplified by torque uh torque command versus if it's just PD

1403.8s Uh not really. The reason is torque

1406.7s control is uh much less forgiving and it

1410.6s it should be sent at a very high

1412.7s frequency. So if there is any like non

1417.7s like any imperfections uh in the command

1421.1s it will be actually amplified by torque

1423.5s uh torque command versus if it's just PD

1426.4s target you we do an interpolation at a um at a slower frequency that will be that will not be uh amplified as much as the higher frequency torque control. So, I'm just curious because like when you have a robot arm that you're tuning, if it's like let's say if your arm is in

1429.8s um at a slower frequency that will be

1432.7s that will not be uh amplified as much as

1435.2s the higher frequency torque control. So,

1437.8s I'm just curious because like when you

1439.4s have a robot arm that you're tuning, if

1441.8s it's like let's say if your arm is in

1444.4s the vertical position and it's like swinging back and forth versus if the arm is like horizontal and swing up and down, the range of the robot it's in would be very the gains the gains that were tuned for the different position would be very different because of the load it's seeing. So if you're

1447.4s swinging back and forth versus if the

1449.8s arm is like horizontal and swing up and

1452.6s down, the range of the robot it's in

1456.0s would be very the gains the gains that

1458.9s were tuned for the different position

1461.1s would be very different because of the

1463.1s load it's seeing. So if you're

1465.4s controlling it in position and um I'm assuming you have the same gains for a different position, how does your robot usually still behave with similar response in the different positions? Yeah, that's a great question. So we are indeed using the same gain but uh essentially like very small PD gains for

1469.0s assuming you have the same gains for a

1471.3s different position, how does your robot

1474.4s usually still behave with similar

1477.9s response in the different positions?

1481.8s Yeah, that's a great question. So we are

1484.7s indeed using the same gain but uh

1487.8s essentially like very small PD gains for

1491.9s uh for for all the motors. Um that essentially uh like decreases the sim to real gap because it's like a more gentle uh response to our command send. uh and I think for of our experiments even including the wall flip and more agile behavior these gains work perfectly fine for all the motions.

1494.6s essentially uh like decreases the sim to

1497.6s real gap because it's like a more gentle

1501.6s uh response to our command send. uh and

1504.8s I think for of our experiments even

1507.7s including the wall flip and more agile

1510.2s behavior these gains work perfectly fine

1513.0s for all the motions.

1514.6s So do you think the RL model somehow can help compensate some of these differences is do you think that's what's happening because or what's what's your thought on what's happening? Um I think both the hardware is getting better that there if we have like a small gain that they can uh uh they can

1517.9s help compensate some of these

1520.8s differences is do you think that's

1523.0s what's happening because or what's

1525.3s what's your thought on what's happening?

1528.4s Um I think both the hardware is getting

1531.2s better that there if we have like a

1533.6s small gain that they can uh uh they can

1537.4s do reasonably well of sending the right command as well as we during the our training we're like randomly pushing the robots uh for domain randomization. So uh so even if uh the gains are like uh not the are not reflecting the actual not the are not reflecting the actual torque on hardware when we're do randomly

1539.6s command as well as we during the our

1542.5s training we're like randomly pushing the

1544.6s robots uh for domain randomization. So

1548.1s uh so even if uh the gains are like uh

1552.9s not the are not reflecting the actual

1556.5s not the are not reflecting the actual torque

1557.5s on hardware when we're do randomly

1560.2s pushing the robots the act the robot actually in simulation the robot actually experiences that a little bit that variation in sim as well. So it has been trained to see like sort of different motor uh perturbations uh during simulation. So in your setup when you're doing the simulations um what what specific tool

1562.3s actually in simulation the robot

1564.2s actually experiences that a little bit

1566.6s that variation in sim as well. So it has

1569.8s been trained to see like sort of

1572.5s different motor uh perturbations uh

1576.1s during simulation.

1577.5s So in your setup when you're doing the

1580.2s simulations um what what specific tool

1583.2s sets or framework were you using to do your simulation? Uh yeah so we are using Isaac lab as the training framework uh which is running on issim for the uh lower level simulation engine. Is there a reason why you guys went with Isaac sim Isaac lab or was it like a choice that the team has made already?

1586.0s your simulation?

1588.2s Uh yeah so we are using Isaac lab as the

1590.9s training framework uh which is running

1593.0s on issim for the uh lower level

1596.3s simulation engine. Is there a reason why

1598.5s you guys went with Isaac sim Isaac lab

1600.8s or was it like a choice that the team

1602.8s has made already?

1604.6s Uh yeah so I think it's both because the Isac lab is highly paralyzable and it has support for distributed training and so on and it has very good rendering. So later if we want to move on to vision based RL or uh locomotion uh whole body control it will have uh relatively good

1607.8s Isac lab is highly paralyzable and it

1611.5s has support for distributed training and

1614.1s so on and it has very good rendering. So

1618.4s later if we want to move on to vision

1621.4s based RL or uh locomotion uh whole body

1625.6s control it will have uh relatively good

1629.0s support for rendering for vision as support for rendering for vision as well. How about how about with uh Mujoko? Is that something you have played with or um any thoughts on that as a simulator? Uh yeah, so Majoko is uh higher infidelity for the simulation models. Uh but I think it's relatively uh slower

1632.4s support for rendering for vision as well.

1633.0s How about how about with uh Mujoko? Is

1635.3s that something you have played with or

1637.8s um any thoughts on that as a simulator?

1641.6s Uh yeah, so Majoko is uh higher

1645.7s infidelity for the simulation models. Uh

1649.9s but I think it's relatively uh slower

1653.8s and it doesn't has uh have as good of the rendering uh support as Isaac but uh I want to note that we're using Majoko as uh simtosim validation. Oh uh which is saying that Yeah. Oh, so which is uh saying that we are training the RL policies in Isac and before deploying it directly on the real

1657.2s the rendering uh support as Isaac

1660.4s but uh I want to note that we're using

1663.0s Majoko as uh simtosim validation. Oh

1666.6s uh which is saying that

1668.4s Yeah. Oh, so which is uh saying that we

1670.7s are training the RL policies in Isac and

1675.0s before deploying it directly on the real

1677.7s robot, we actually run that policy in Madokco to test if it the dynamics per like the policy with the higher fidelity dynamics perform well enough in Madoko and if that's robust enough in Madr then deploy it onto the real hardware. So uh can you go in detail about what you mean by higher fidelity? Is it like

1680.5s Madokco to test if it the dynamics per

1683.6s like the policy with the higher fidelity

1686.2s dynamics perform well enough in Madoko

1689.1s and if that's robust enough in Madr then

1692.0s deploy it onto the real hardware.

1694.7s So uh can you go in detail about what

1697.1s you mean by higher fidelity? Is it like

1699.0s better physics calculation or what what do you mean by that? Yeah, so major supposedly have it has higher like has a more accurate contact modeling. So um there are different uh kinds of simulators which are have different trade-off like um mostly the trade-off between uh simulation accuracy versus the computation speed. So I would

1701.0s do you mean by that?

1702.9s Yeah, so major supposedly have it has

1706.3s higher like has a more accurate contact

1709.7s modeling. So um there are different uh

1712.4s kinds of simulators which are have

1714.6s different trade-off like um mostly the

1717.4s trade-off between uh simulation accuracy

1720.6s versus the computation speed. So I would

1724.1s say Isaac is on the higher uh throughput uh end of the spectrum while Majoku is sort of in the middle where it h has high enough fidelity uh and a reasonably a reasonable parallelization uh a reasonable parallelization uh framework. Yes. So uh on the other end end of the spectrum I would say Drake is probably

1728.7s uh end of the spectrum while Majoku is

1733.2s sort of in the middle where it h has

1736.2s high enough fidelity uh and a reasonably

1740.3s a reasonable parallelization uh

1742.6s a reasonable parallelization uh framework.

1743.8s Yes. So uh on the other end end of the

1746.2s spectrum I would say Drake is probably

1748.6s one of the most accurate uh simulator where uh it is actually solving optimization problems at each time step to simulate the contact dynamics but uh it's not GPU uh supported and uh it's hard to parallelize. So there is a a spectrum of different simulators which have different trade-offs um between computation time and simulation

1752.5s where uh it is actually solving

1754.9s optimization problems at each time step

1757.2s to simulate the contact dynamics but uh

1760.8s it's not GPU uh supported and uh it's

1765.4s hard to parallelize. So there is a a

1769.0s spectrum of different simulators which

1771.1s have different trade-offs um between

1773.8s computation time and simulation

1775.9s fidelity. So when you go from Isaac to Mujoko for example, um have you had specific experiences where when you did do that simtosim validation where you were able to go back and update your model somehow based on how it performed? Um yeah so I would say uh because we are

1779.5s Mujoko for example,

1781.9s um have you had specific experiences

1786.0s where when you did do that simtosim

1788.7s validation where you were able to go

1791.6s back and update your model somehow based

1793.8s on how it performed?

1797.2s Um yeah so I would say uh because we are

1801.8s doing so ID uh like well enough and we randomize uh reasonably in uh training in Isaac SIM the dynamics gap between Isaac and Majorco in our current pipeline is not that huge. pipeline is not that huge. Okay. But the SIM to SIM pipeline also helps uh us debug debug some other problems.

1806.0s randomize uh reasonably in uh training

1810.2s in Isaac SIM the dynamics gap between

1813.5s Isaac and Majorco in our current

1816.2s pipeline is not that huge.

1817.8s pipeline is not that huge. Okay.

1818.2s But the SIM to SIM pipeline also helps

1821.5s uh us debug debug some other problems.

1824.0s So for example, what should be the camera latency if we're adding the vision um into the loop? Since we're doing CIS ID carefully enough and for uh locom motion there is not too much a gap of dynamics for uh between Isaac and Jooko. So most of the time the dynamics

1827.0s camera latency if we're adding the

1829.1s vision um into the loop? Since we're

1832.2s doing CIS ID carefully enough and for uh

1836.4s locom motion there is not too much a gap

1838.9s of dynamics for uh between Isaac and

1842.6s Jooko. So most of the time the dynamics

1845.4s gap is not as big when we do the simtosim validation but rather some edge cases that we were not able to detect in Isaac. So for example like what if the vision latency uh in major in Majoko is uh can be more realistic than I is so that uh we can detect oh what is the

1848.2s simtosim validation but rather some edge

1851.7s cases that we were not able to detect in

1854.3s Isaac. So for example like what if the

1858.0s vision latency uh in major in Majoko is

1861.8s uh can be more realistic than I is so

1864.8s that uh we can detect oh what is the

1867.8s right version latency we should add in the training process to actually compensate for this discrepancy. So you're saying like Muchoko has uh longer latency. Is that what you mean by more accurate? Um so uh in training when we render the vision in Isac the simulation will actually pause to uh let the simulation

1870.4s the training process to actually

1872.1s compensate for this discrepancy.

1873.8s So you're saying like Muchoko has uh

1876.9s longer latency. Is that what you mean by

1879.0s more accurate?

1881.0s Um so uh in training when we render the

1885.0s vision in Isac the simulation will

1888.6s actually pause to uh let the simulation

1892.7s uh simulator render the the vision but during the deployment in modroo when we're actually running the policy it will actually not wait for the simulator to render the image so there will be some sort of latency in that regard. Okay. So is it possible to create a fake latency in Isaac to simulate that

1895.7s during the deployment in modroo when

1898.6s we're actually running the policy it

1901.0s will actually not wait for the simulator

1903.2s to render the image so there will be

1905.5s some sort of latency in that regard.

1907.9s Okay. So is it possible to create a fake

1912.1s latency in Isaac to simulate that

1914.3s behavior or is it pretty hard to Yeah. So what we do is that uh we create a buffer of the sensor uh readings and it we we do kind of a CIS ID Unreal to see what the latency is for like each sensor and then we chose the the reading in the buffer which is around that time

1917.5s Yeah. So what we do is that uh we create

1921.2s a buffer of the sensor uh readings and

1926.6s it we we do kind of a CIS ID Unreal to

1930.3s see what the latency is for like each

1933.4s sensor and then we chose the the reading

1938.2s in the buffer which is around that time

1940.8s range. H. So if you're able to do that then technically would you still have to do this muchoko verification if you're able to create a very realistic latency in Isaac or do you think that step would still be necessary? Uh yeah so I think uh another advantage of lat uh of major in addition to like

1943.8s then technically would you still have to

1946.7s do this muchoko verification if you're

1949.1s able to create a very realistic latency

1952.7s in Isaac or do you think that step would

1955.4s still be necessary?

1957.7s Uh yeah so I think uh another advantage

1961.4s of lat uh of major in addition to like

1964.7s uh verify that latency is uh running the deployment code in simulation first. So I it's actually might be kind of uh complicated to uh run the deployment code in isac directly but uh in major code it's a much more direct interface for us to uh deploy our uh inference code. So in addition to testing the

1968.2s deployment code in simulation first. So

1971.0s I it's actually might be kind of uh

1975.0s complicated to uh run the deployment

1977.2s code in isac directly but uh in major

1980.2s code it's a much more direct interface

1983.5s for us to uh deploy our uh inference

1986.6s code. So in addition to testing the

1989.4s vision discrepancy dynamics gap there will there's also the layer of we are testing our deployment code in simulation. What what's the main challenge with um running inference in challenge with um running inference in Isaac? I think there is just a lot of abstraction layer in Isaac. Um, and it's less direct of an interface than Majoko

1992.0s will there's also the layer of we are

1995.3s testing our deployment code in

1997.2s simulation. What what's the main

1998.9s challenge with um running inference in

2001.4s challenge with um running inference in Isaac?

2002.9s I think there is just a lot of

2005.0s abstraction layer in Isaac. Um, and it's

2009.0s less direct of an interface than Majoko

2012.2s for us to uh like uh deploy our uh inference code because in Mujoko they let you like send direct position or torque commands just based on you know how you set up your XML file, right? So it's like a pretty straightforward way to command it. Right. That's kind of what you mean.

2018.0s inference code

2019.4s because in Mujoko they let you like send

2021.9s direct position or torque commands just

2025.4s based on you know how you set up your

2027.6s XML file, right? So it's like a pretty

2030.5s straightforward way to command it.

2033.9s Right. That's kind of what you mean.

2036.4s Right. That's kind of what you mean. Mhm. Mhm. Okay. Okay. Cool. Um, so in general, if someone is like trying to learn more about RL, whether it's like the application side or theory side, what would you say is a good starting point for someone to kind of get into the

2037.2s Mhm. Okay.

2038.8s Okay. Cool. Um, so in general, if

2041.1s someone is like trying to learn more

2043.8s about RL, whether it's like the

2046.5s application side or theory side, what

2049.0s would you say is a good starting point

2051.1s for someone to kind of get into the

2053.5s for someone to kind of get into the topic? So for the theory side, like Richard Sutton's introduction to reinforcement learning is uh one of the primer book. uh and for for from for example from people from a controls perspective a dimitary bersa's uh dynamic programming and optimal control would be a very like

2056.2s So for the theory side, like Richard

2059.0s Sutton's introduction to reinforcement

2061.4s learning is uh one of the primer book.

2064.4s uh and for for from for example from

2067.0s people from a controls perspective a

2069.4s dimitary bersa's uh dynamic programming

2072.2s and optimal control would be a very like

2075.8s uh control perspective uh way to explain the RL concept and there are also some like Berkeley courses taught by professor Sergey Levvin uh on uh reinforcement uh learning and deep learning that one can also like watch it online. So I think for the theory side there are a lot of resources either it

2079.0s the RL concept and there are also some

2081.8s like Berkeley courses taught by

2083.6s professor Sergey Levvin uh on uh

2086.2s reinforcement uh learning and deep

2088.2s learning that one can also like watch it

2090.7s online. So I think for the theory side

2093.4s there are a lot of resources either it

2095.2s be online video or books one can uh refer to and for the practical side I think it's reading others codebase for RL deployment and try to adapt them uh for one's own use case so that one can get more and more hands-on experience on the training and uh deployment. Very nice. Those are some really good

2098.7s refer to and for the practical side I

2101.7s think it's reading others codebase for

2105.5s RL deployment and try to adapt them uh

2109.0s for one's own use case so that one can

2111.6s get more and more hands-on experience on

2114.5s the training and uh deployment.

2118.5s Very nice. Those are some really good

2120.0s useful resources. So um definitely look into those. Um I know we'll probably spend the second half of this or not second half but this second part talking about some of your specific applications that you worked on. So you know a lot of the new cutting edge work is probably you know the research that you've done

2122.2s into those. Um I know we'll probably

2125.3s spend the second half of this or not

2128.1s second half but this second part talking

2130.1s about some of your specific applications

2132.2s that you worked on. So you know a lot of

2134.1s the new cutting edge work is probably

2136.6s you know the research that you've done

2139.1s on like retargeting. So maybe we could start start looking into that topic right now. And maybe just for those that have never heard of kinematic retargeting, can you give like a highle overview of exactly what that is and what problem you're trying to solve? So the kinematic retarding is basically transforming human motions onto robot

2142.2s start start looking into that topic

2144.8s right now. And maybe just for those that

2146.9s have never heard of kinematic

2148.7s retargeting, can you give like a highle

2151.4s overview of exactly what that is and

2154.4s what problem you're trying to solve?

2156.6s So the kinematic retarding is basically

2159.5s transforming human motions onto robot

2162.0s motions. Since the humanoid robot looks very much like the human, we want to reuse the human motions to direct our search for how we command the robot. So say here's a task of human picking up the box and we want to transfer the same motions onto the robots picking up the

2165.4s very much like the human, we want to

2167.8s reuse the human motions to direct our

2170.5s search for how we command the robot. So

2174.9s say here's a task of human picking up

2177.7s the box and we want to transfer the same

2180.2s motions onto the robots picking up the

2182.2s box. So there are some standard ways of doing so such as defining some key points on the rob on the robot and the human um and try to match the absolute position between the two. But there are some problems of doing so. Uh for example, because the humanoid can be much shorter than the human, this direct

2184.8s doing so such as defining some key

2187.8s points on the rob on the robot and the

2190.6s human um and try to match the absolute

2193.8s position between the two. But there are

2196.6s some problems of doing so. Uh for

2199.0s example, because the humanoid can be

2202.0s much shorter than the human, this direct

2205.5s uh scaling and translation matching will will result in some penetration. And here we're using some technique to avoid this issue. Penetration you mean like uh in going into itself? Is that what you mean? Uh yes. So let me try to show a direct example. something like this. So the keyoint matching is the technique I was

2208.4s will result in some penetration.

2211.0s And here we're using some technique to

2213.6s avoid this issue.

2215.0s Penetration you mean like

2217.1s uh in going into itself? Is that what

2219.1s you mean?

2220.7s Uh yes. So let me try to show a direct

2224.2s example. something like this. So the

2226.6s keyoint matching is the technique I was

2229.4s describing as the standardized technique for the uh humanoid human to humanoid kinematic targeting pipeline which is essentially choosing some key points on the human and try to match the same set of semantic key points on the robot to the absolute position of these key points on the human. Um and this because

2231.8s for the uh humanoid human to humanoid

2235.1s kinematic targeting pipeline which is

2237.8s essentially choosing some key points on

2239.8s the human and try to match the same set

2243.8s of semantic key points on the robot to

2246.3s the absolute position of these key

2248.5s points on the human. Um and this because

2253.0s uh the humanoid can be much shorter than the human directly matching this absolute position can result some penetration with the object. So for example something like this and this is essentially um a very um like a very direct result of the different scale of human and the robot because uh say imagine the human is like

2255.6s the human directly matching this

2257.8s absolute position can result some

2260.3s penetration with the object. So for

2262.1s example something like this

2264.0s and this is essentially um a very um

2268.8s like a very direct result of the

2271.5s different scale of human and the robot

2274.5s because uh say imagine the human is like

2278.5s say 1.8 8 m while the humanoid we're using is 1.3 m. Picking up the same box will actually result in same in different relative skills for the human and the robots. So directly doing this key point matching will result in some artifacts like will result in some artifacts like penetration.

2281.9s using is 1.3 m.

2285.1s Picking up the same box will actually

2287.7s result in same in different relative

2290.4s skills for the human and the robots. So

2293.0s directly doing this key point matching

2295.0s will result in some artifacts like

2297.1s will result in some artifacts like penetration.

2299.3s So uh you talk about going from um like key points from a human. Is the main idea to get videos of people or what's the main are you trying to utilize like the whole internet data to do some of this? Like what's the bigger picture idea that um this method would end up being used for?

2302.9s key points from a human. Is the main

2304.6s idea to get videos of people or what's

2307.9s the main are you trying to utilize like

2310.2s the whole internet data to do some of

2312.0s this? Like what's the bigger picture

2313.8s idea that um this method would end up

2316.4s being used for?

2319.2s Yeah, that's a great question. So currently we're using motion capture data which is essentially a human demonstrator wearing a very specialized mocap suits in a specialized um room with cameras that can accurately identify the position of the human. identify the position of the human. clears throat Um but this sort of data is very

2321.0s currently we're using motion capture

2323.0s data which is essentially a human

2326.1s demonstrator wearing a very specialized

2329.4s mocap suits in a specialized um room

2332.6s with cameras that can accurately

2335.0s identify the position of the human.

2337.7s identify the position of the human. clears throat

2338.3s Um but this sort of data is very

2341.4s expensive and uh ultimately we want to utilize the videos of the entire internet to cheat teach how teach the robots to do the things but there are some challenges uh in this regard. So uh the uh re 3D reconstruction of human and objects from video is a very non-triv trivial research topic. Some of the

2344.6s utilize the videos of the entire

2347.5s internet to cheat teach how teach the

2350.3s robots to do the things but there are

2352.5s some challenges uh in this regard. So uh

2355.8s the uh re 3D reconstruction of human and

2359.4s objects from video is a very non-triv

2362.0s trivial research topic. Some of the

2364.6s problems involve like some the human root will kind of be floating in the air and uh going back and forth. So how to extract uh robust, reliable and realistic data from the video is a very um challenging and but also interesting research topic. I see. So I guess you guys are kind of

2367.9s root will kind of be floating in the air

2370.2s and uh going back and forth. So how to

2374.0s extract uh robust, reliable and

2377.2s realistic data from the video is a very

2381.3s um challenging and but also interesting

2384.2s research topic.

2385.0s I see. So I guess you guys are kind of

2387.5s assuming that the video to model part is handled, right? And you're just focusing more on if you already have the key points. Is that right? Uh exactly. Yeah. Okay. So um you were mentioning like oh if the robot is smaller then um you're trying to focus on having like a bigger

2393.4s handled, right? And you're just focusing

2395.4s more on if you already have the key

2397.5s points. Is that right?

2399.0s Uh exactly. Yeah.

2400.4s Okay. So um you were mentioning like oh

2403.2s if the robot is smaller then um you're

2407.1s trying to focus on having like a bigger

2409.4s human where the key points are bigger to something smaller. Do you think your current method can also work in reverse? If the human is smaller but the robot is bigger, can it also handle those cases? Like is it general enough to do that? Like is it general enough to do that? Absolutely.

2411.6s something smaller. Do you think your

2413.3s current method can also work in reverse?

2416.2s If the human is smaller but the robot is

2418.6s bigger, can it also handle those cases?

2421.8s Like is it general enough to do that?

2423.4s Like is it general enough to do that? Absolutely.

2425.1s Yeah, absolutely. Um so the way our method works is um let me try to share my screen again. We tried to build something called interaction mesh which is uh in addition to defining key points on the human we also define key points on the object. So let me give a more concrete example

2428.8s method works is um let me try to share

2432.5s my screen again. We tried to build

2435.1s something called interaction mesh which

2437.8s is uh in addition to defining key points

2441.6s on the human we also define key points

2444.6s on the object.

2446.1s So let me give a more concrete example

2448.9s here is a human picking up the box and we want to transfer its motion to the robot picking up the box. uh as I mentioned we select some key points semantically important key points on the human and the set of same key points on the robot. We also define key points on

2452.3s we want to transfer its motion to the

2454.6s robot picking up the box. uh as I

2457.4s mentioned we select some key points

2459.8s semantically important key points on the

2462.0s human and the set of same key points on

2464.6s the robot. We also define key points on

2467.7s the object and use the same set of key points on the uh uh on the object for the robot as well. Then we build uh an interaction mesh which is a volutric structure that uh captures the relative uh information uh position information between the human and the object. So uh here is how the mesh looks like.

2470.2s points on the uh uh on the object for

2473.4s the robot as well. Then we build uh an

2477.4s interaction mesh which is a volutric

2481.0s structure that uh captures the relative

2485.0s uh information uh position information

2488.2s between the human and the object.

2490.9s So uh here is how the mesh looks like.

2494.4s So as we can see it not only captures the information between the human joints themselves but also how it relates to the object. So uh in this example, the human's right hand uh is touching the right face of the object and we want the robot's right hand to also touch the right surface of the object

2497.3s the information between the human joints

2500.6s themselves but also how it relates to

2503.4s the object.

2504.9s So uh in this example, the human's right

2508.6s hand uh is touching the right face of

2510.8s the object and we want the robot's right

2513.5s hand to also touch the right surface of

2515.8s the object

2516.9s that is actually captured by the um graph structure that preserves the relative spatial information in this relative spatial information in this motion. Is there a minimum number of points you need for the object to have the full understanding of your object? Uh that's a really good question. So uh ideally we want the contact points

2521.0s graph structure that preserves the

2523.4s relative spatial information in this

2525.7s relative spatial information in this motion.

2526.5s Is there a minimum number of points you

2528.4s need for the object to have the full

2532.2s understanding of your object?

2535.0s Uh that's a really good question. So uh

2539.0s ideally we want the contact points

2543.5s between the human and the object. So say in this point there might be two key points which is like uh the left uh hand and the right hand uh touching the surface of the box. Um but in order to make the algorithm more robust uh we actually uh select more key points than

2546.9s in this point there might be two key

2550.2s points which is like uh the left uh hand

2553.5s and the right hand uh touching the

2556.1s surface of the box. Um but in order to

2558.8s make the algorithm more robust uh we

2561.2s actually uh select more key points than

2563.4s that. So uh we essentially randomly sample like say 20 to 50 points on the object to keep the relationship between the robot human uh and the object. Okay. So how how well because right now we see an example with a box. How well do you think your current method could extend to, you know, more deformable or

2567.2s sample like say 20 to 50 points on the

2571.4s object to keep the relationship between

2574.3s the robot human uh and the object.

2578.5s Okay. So how how well because right now

2582.0s we see an example with a box. How well

2584.1s do you think your current method could

2586.2s extend to, you know, more deformable or

2590.1s organic looking objects like, you know, maybe a pillow, uh maybe like a teddy bear, like something or like a blanket even like how how would this method extend to something like that? Uh I think uh this retargeting method will be directly applicable to uh all kinds of different objects including deformables. Um, and because essentially

2592.4s maybe a pillow, uh maybe like a teddy

2595.0s bear, like something or like a blanket

2597.0s even like how how would this method

2598.8s extend to something like that?

2601.8s Uh I think uh this retargeting method

2604.6s will be directly applicable to uh all

2607.7s kinds of different objects including

2609.8s deformables. Um, and because essentially

2613.3s we're capturing the relationship between the human and some points on the object. As long as we can define the key points, either it be contact points or it be semantically meaningful key points, uh, our retargeting pipeline can be directly our retargeting pipeline can be directly transferred. Okay, very cool.

2616.7s the human and some points on the object.

2619.8s As long as we can define the key points,

2622.5s either it be contact points or it be

2625.1s semantically meaningful key points, uh,

2627.4s our retargeting pipeline can be directly

2630.3s our retargeting pipeline can be directly transferred.

2631.4s Okay, very cool.

2633.1s Um so I know a lot of like the general trend that we're seeing in robotics is um you know there's still a lot of companies where they have very special models that do very specific tasks but then there's also companies like Tesla and some other companies that's trying to do of a more of a endtoend model

2636.1s trend that we're seeing in robotics is

2638.9s um you know there's still a lot of

2640.9s companies where they have very special

2643.6s models that do very specific tasks but

2646.2s then there's also companies like Tesla

2648.7s and some other companies that's trying

2650.5s to do of a more of a endtoend model

2653.6s where you know they only get input is the video that they see of the world and the output is the motor actions like let's say a very high level task. Go clean the room or you know go get me coffee, right? Something that high of a level of a task. Do you feel like in the

2656.5s the video that they see of the world and

2658.7s the output is the motor actions like

2661.8s let's say a very high level task. Go

2663.7s clean the room or you know go get me

2666.0s coffee, right? Something that high of a

2668.6s level of a task. Do you feel like in the

2672.7s future is that the direction that robotics is headed where there's such a general model that can do anything or do you feel like we still need very specialized models that does very specialized tasks? So uh this is a great question and a very hot topic that uh both industry and academia has been

2676.0s robotics is headed where there's such a

2678.6s general model that can do anything or do

2681.1s you feel like we still need very

2683.4s specialized models that does very

2685.8s specialized tasks? So uh this is a great

2688.8s question and a very hot topic that uh

2691.4s both industry and academia has been

2693.7s debating a lot. So I personally think would lean more towards a generalist policy. Uh the reason is that um multiple skills that are trained for the generalist policy can might be able to um transfer and help each other generalize. Um and as sometimes uh the generalist policies uh the advantage people believe is that it can learn

2696.6s would lean more towards a generalist

2699.4s policy. Uh the reason is that um

2702.2s multiple skills that are trained for the

2704.5s generalist policy can might be able to

2708.0s um transfer and help each other

2709.9s generalize. Um and as sometimes uh the

2713.7s generalist policies uh the advantage

2716.4s people believe is that it can learn

2718.5s something like a common sense or the intuition intu uh or intuitive physics which is uh roughly like a model of how the world will react that can actually transfer across different task. So if once the model uh gains this common sense, it will be able to more easily transfer to a new task it has never seen

2721.3s intuition intu uh or intuitive physics

2725.1s which is uh roughly like a model of how

2728.8s the world will react that can actually

2731.4s transfer across different task. So if

2734.1s once the model uh gains this common

2736.4s sense, it will be able to more easily

2739.4s transfer to a new task it has never seen

2742.2s it before. So, do you think these models eventually can get down to like millimeter precisions like for example if um like very hard tasks for example like surgery maybe or even like if they're trying to assemble PCB boards onto or put you know IC parts onto a PCB board. Do you think these models can

2744.9s eventually can get down to like

2748.2s millimeter precisions like for example

2750.9s if um like very hard tasks for example

2756.0s like surgery maybe or even like if

2759.4s they're trying to assemble PCB boards

2762.9s onto or put you know IC parts onto a PCB

2767.1s board. Do you think these models can

2769.6s eventually get to that level of precision or do you think they probably still need, you know, very like more of the typical robot programming that we think of where you program exact positions. What what's your thought on positions. What what's your thought on that? Uh yeah, that's a great question. I think it might be hard for these

2771.6s precision or do you think they probably

2774.9s still need, you know, very like more of

2778.2s the typical robot programming that we

2780.7s think of where you program exact

2782.4s positions. What what's your thought on

2784.2s positions. What what's your thought on that?

2785.8s Uh yeah, that's a great question. I

2788.0s think it might be hard for these

2790.7s generalist policy to directly perform millimeter accuracy task uh directly out of the box. But if we just collect a small number of uh demonstrations on these specific tasks and do post training that is to refine the pol generalist policy on our specific task with the in-domain data and specific data I think we will be able to achieve

2794.0s millimeter accuracy task uh directly out

2797.8s of the box. But if we just collect a

2800.7s small number of uh demonstrations on

2803.9s these specific tasks and do post

2807.0s training that is to refine the pol

2809.4s generalist policy on our specific task

2812.1s with the in-domain data and specific

2814.5s data I think we will be able to achieve

2817.2s very high accuracy. Uh the old traditional like classical methods like scripting the uh scripting uh the robot arms can achieve very high fidelity but they are more brittle to uh longtail uh problems and they might do less well in terms of like vision and uh the more semantic reasoning where these generalist policy might be able to learn

2820.1s traditional like classical methods like

2822.2s scripting the uh scripting uh the robot

2825.8s arms can achieve very high fidelity but

2829.0s they are more brittle to uh longtail uh

2833.3s problems and they might do less well in

2836.4s terms of like vision and uh the more

2839.8s semantic reasoning where these

2841.6s generalist policy might be able to learn

2844.0s from failures or recovery from the other skills and try uh try to directly recover from something uh that uh it hasn't been seen in the classical scripting kind of method. So I know you mentioned like having more data for those specific cases. Um in general, do you feel like what's stopping us from having robots in our

2846.5s skills and try uh try to directly

2849.5s recover from something uh that uh it

2852.8s hasn't been seen in the classical

2855.4s scripting kind of method.

2857.4s So I know you mentioned like having more

2860.2s data for those specific cases. Um in

2863.0s general, do you feel like what's

2865.7s stopping us from having robots in our

2868.8s house that's working? Do you feel like that's more of a data problem or is it more of a model and architecture or even robot hand development problem? What's your current take on that? Yeah, I think um the current obstacles are multiffold. uh I would say the humanoid hardware is uh relatively more

2870.6s that's more of a data problem or is it

2874.3s more of a model and architecture or even

2879.4s robot hand development problem? What's

2881.1s your current take on that?

2884.0s Yeah, I think um the current obstacles

2888.9s are multiffold. uh I would say the

2892.6s humanoid hardware is uh relatively more

2896.9s robust than say the hand hardware. So the sim to real gap is smaller and the motor control is more precise. Uh but of course uh in addition to the hardware problem there is also the software problem and uh for a lot of researchers the software the core of the software problem is the data problem. Um so I

2900.3s the sim to real gap is smaller and the

2902.6s motor control is more precise. Uh but of

2905.4s course uh in addition to the hardware

2907.6s problem there is also the software

2909.4s problem and uh for a lot of researchers

2914.0s the software the core of the software

2915.9s problem is the data problem. Um so I

2919.1s think some of our hypothesis is that uh if we have uh enough high quality data maybe the training architecture and policy architecture doesn't matter as policy architecture doesn't matter as much. So we are essentially trying to control the quality of the policy output uh by controlling the uh data quality

2922.2s if we have uh enough high quality data

2926.2s maybe the training architecture and

2929.0s policy architecture doesn't matter as

2931.0s policy architecture doesn't matter as much.

2931.8s So we are essentially trying to control

2935.0s the quality of the policy output uh by

2937.9s controlling the uh data quality

2939.7s directly. So if we can get uh very high quality um data for the robots uh I think it will be a very important uh improvement for reliable deployment of these robots. So right now a lot of people are you know either manually manually getting data or using data from like simulations. Um I know Nvidia recently

2943.8s quality um data for the robots uh I

2948.2s think it will be a very important uh

2951.4s improvement for reliable deployment of

2954.5s these robots.

2955.6s So right now a lot of people are you

2959.1s know either manually manually getting

2962.4s data or using data from like

2965.8s simulations. Um I know Nvidia recently

2969.0s have has been pushing Cosmos which is their uh AI data. Basically they're generating video data synthetically and they could augment and do like data transfer for different scenes. For example, if it's cloudy, sunny, they could augment all of that. um do you think that is the right approach to getting more data or do you think

2971.3s their uh AI data. Basically they're

2975.2s generating video data synthetically and

2977.8s they could augment and do like data

2981.0s transfer for different scenes. For

2983.2s example, if it's cloudy, sunny, they

2985.0s could augment all of that. um do you

2987.8s think that is the right approach to

2991.0s getting more data or do you think

2993.1s there's uh different ways someone should be focusing on to get more data? Yeah, that's a great question. So I think we should actually leverage data from all different kind of resources. uh either it be the most expensive but the uh arguably the highest quality data which is the teleoperation data on the

2995.5s be focusing on to get more data?

2998.7s Yeah, that's a great question. So I

3001.0s think we should actually leverage data

3003.8s from all different kind of resources. uh

3006.6s either it be the most expensive but the

3010.4s uh arguably the highest quality data

3013.7s which is the teleoperation data on the

3016.0s real robot or it be the simulation data which we can generate in large scale but always has the sim to real gap. Uh and the also there are also kinds of the other uh data which is for example internet video data uh world model data uh that has rich uh semantic and visual

3019.4s which we can generate in large scale but

3022.5s always has the sim to real gap. Uh and

3025.8s the also there are also kinds of the

3028.0s other uh data which is for example

3030.7s internet video data uh world model data

3034.1s uh that has rich uh semantic and visual

3037.1s features for like the uh video uh models but might has less action data. So I think different data has their own advantages uh and their own uh specialized um targeting area. Um and combining these different sources of data together to enable both control and dynamics accuracy as well as semantic

3041.8s but might has less action data. So I

3045.0s think different data has their own

3047.1s advantages uh and their own uh

3049.9s specialized um targeting area. Um and

3054.5s combining these different sources of

3056.9s data together to enable both control and

3061.4s dynamics accuracy as well as semantic

3065.4s and uh visual understanding is a very uh I think a very interesting and promising topic. So I guess if you were to put a allocation like if I imagine like if there's a pie chart and you were to allocate like the percent of each category that you mentioned just roughly you know roughly speaking how would you

3068.9s I think a very interesting and promising

3070.9s topic. So I guess if you were to put a

3074.1s allocation like if I imagine like if

3076.6s there's a pie chart and you were to

3078.2s allocate like the percent of each

3080.2s category that you mentioned just roughly

3083.7s you know roughly speaking how would you

3086.1s categorize each of the parts of the pie for the different types of data. Yeah, since myself is working on data generation in sim and doing kinematic retargeting, uh my answer will obviously be skewer towards using simulation data. Uh it's uh very uh scalable and it can give us reasonably accurate uh dynamics

3089.4s for the different types of data.

3092.9s Yeah, since myself is working on data

3095.2s generation in sim and doing kinematic

3097.6s retargeting, uh my answer will obviously

3100.3s be skewer towards using simulation data.

3103.4s Uh it's uh very uh scalable and it can

3106.9s give us reasonably accurate uh dynamics

3110.2s and action uh data. uh while so I would say I would allocate like half of the effort in uh doing uh in generating simulation data and the other half is split between uh video and uh real world teleoperation data. So I think video data is also more scalable than the real world teleoperation because for the tele

3114.7s say I would allocate like half of the

3117.0s effort in uh doing uh in generating

3120.3s simulation data and the other half is

3123.0s split between uh video and uh real world

3127.2s teleoperation data.

3129.0s So I think video data is also more

3132.7s scalable than the real world

3134.2s teleoperation because for the tele

3136.1s operation you always need a human operator to operate the robots. uh it can be time consuming uh it kind of cause fatigue for the human operator uh and wears the robot hardware right uh for the video you can we have the entire internet and we have these like you mentioned cosmos the generative uh

3138.1s operator to operate the robots. uh it

3141.0s can be time consuming uh it kind of

3144.6s cause fatigue for the human operator uh

3147.1s and wears the robot hardware right

3149.8s uh for the video you can we have the

3152.9s entire internet and we have these like

3155.0s you mentioned cosmos the generative uh

3157.8s video models word models that can generate like essentially endless video data for us to capture the uh visual dynamics uh semantic features and so on so I think that one is more scalable. So I think my personal take is to uh spend as much effort uh as possible on the scalable uh like methods including

3160.1s generate like essentially endless video

3163.3s data for us to capture the uh visual

3166.6s dynamics uh semantic features and so on

3169.7s so I think that one is more scalable. So

3172.4s I think my personal take is to uh spend

3175.8s as much effort uh as possible on the

3179.0s scalable uh like methods including

3182.0s simulation and video and also allocate a reasonable amount on the real data to actually close the sim to real gap. Uh there's been a lot of topic on you know traditional and newer ways of doing RL. Can you kind of dive into some of the details of that and maybe the differences between the two?

3185.6s reasonable amount on the real data to

3188.5s actually close the sim to real gap.

3190.2s Uh there's been a lot of topic on you

3193.2s know traditional and newer ways of doing

3196.1s RL. Can you kind of dive into some of

3198.2s the details of that and maybe the

3200.2s differences between the two?

3203.0s Uh yeah. So um I think I want to give the kinomide retargeting as an example of doing um things in uh both the classical optimization based and model based uh perspective and the more learning based perspective. So my very first background is actually in optimization and modelbased control and that actually lays a foundation for me

3206.5s the kinomide retargeting as an example

3209.4s of doing um things in uh both the

3213.9s classical optimization based and model

3216.0s based uh perspective and the more

3218.9s learning based perspective. So my very

3223.3s first background is actually in

3225.6s optimization and modelbased control and

3229.0s that actually lays a foundation for me

3231.5s to write omni retarget which is an constrained uh optimization based kinematic retargeting pipeline and this sort of optimization based uh pipeline will enable something that is not quite achievable by a learning based pipeline. So because we're reasoning about hard constraints kinematics in a optimization way uh fashion we can enforce higher quality. So we can actually enforce hard

3234.9s constrained uh optimization based

3237.4s kinematic retargeting pipeline and this

3240.4s sort of optimization based uh pipeline

3243.3s will enable something that is not quite

3245.8s achievable by a learning based pipeline.

3248.1s So because we're reasoning about hard

3251.0s constraints kinematics in a optimization

3254.6s way uh fashion we can enforce higher

3258.4s quality. So we can actually enforce hard

3261.0s constraints that learning based policy won't be able to enforced. So like say we don't want penetration of of the object. We don't want the joint to exceed its hard limits. We don't want the velocity to ex exceed a certain threshold. For us, we can write it as hard constraints in the optimization

3263.0s won't be able to enforced. So like say

3265.5s we don't want penetration of of the

3267.8s object. We don't want the joint to

3270.7s exceed its hard limits. We don't want

3272.9s the velocity to ex exceed a certain

3276.2s threshold. For us, we can write it as

3278.8s hard constraints in the optimization

3280.6s program versus uh in the more learning based method, people normally put it as soft penalty in the cost or reward and then try to optimize it. Uh it's not sometimes it's not guaranteed that these hard constraints are actually enforced. So there might be a little bit of penetration or joy limit violation if

3284.3s based method, people normally put it as

3288.2s soft penalty in the cost or reward and

3291.7s then try to optimize it. Uh it's not

3295.0s sometimes it's not guaranteed that these

3297.1s hard constraints are actually enforced.

3299.5s So there might be a little bit of

3301.0s penetration or joy limit violation if

3304.2s we're doing this kind of soft penalty. I see. But uh by doing the hard uh constraint uh optimization based uh pipeline, we're able to enforce these hard constraints very systematically and rigorously so that we can have higher quality data to then be consumed by downstream learning paradigms. So do you think it's possible to take both traditional optimization

3306.9s I see.

3307.5s But uh by doing the hard uh constraint

3310.8s uh optimization based uh pipeline, we're

3313.7s able to enforce these hard constraints

3315.7s very systematically and rigorously so

3318.1s that we can have higher quality data to

3321.0s then be consumed by downstream learning

3323.5s paradigms. So do you think it's possible

3325.4s to take both traditional optimization

3328.6s and RLbased methods together or do you think it's more of a eitheror type of think it's more of a eitheror type of situation? So uh yeah so the combination of both is actually my goal. So upstream I'm using this optimization based hard hard constraint uh formulations to generate hard quality data and downstream where

3333.1s think it's more of a eitheror type of

3335.3s think it's more of a eitheror type of situation?

3337.5s So uh yeah so the combination of both is

3340.5s actually my goal. So upstream I'm using

3343.9s this optimization based hard hard

3347.0s constraint uh formulations to generate

3349.4s hard quality data and downstream where

3352.6s training our policies to track this high quality data. So the combination of very rigorous uh high quality data generation plus the massively parallelizable RL training I think is a very promising paradigm. H. So you're saying the the main hard constraint is more like the highlevel loop closure in a way that's making sure the robot doesn't

3355.0s quality data. So the combination of very

3358.6s rigorous uh high quality data generation

3361.9s plus the massively parallelizable RL

3365.0s training I think is a very promising

3367.1s paradigm. H. So you're saying the the

3371.5s main hard constraint is more

3375.0s like the highlevel loop closure in a way

3378.0s that's making sure the robot doesn't

3380.4s break these constraints. Is that how you would describe it? Or how for like the general audience, how would you kind of describe how that's able to keep everything, you know, under the main constraints that you want it? Um yeah, I would say uh if we want to enforce like uh important hard constraints uh and specifically for the

3382.2s would describe it? Or how for like the

3385.3s general audience, how would you kind of

3387.5s describe how that's able to keep

3391.0s everything, you know, under the main

3393.1s constraints that you want it?

3396.3s Um yeah, I would say uh if we want to

3399.4s enforce like uh important hard

3402.0s constraints uh and specifically for the

3404.7s army project in a kinematic level uh which is just the robots abain its uh morphological constraints we can do it in a systematic optimization based way and later when we translate into physically uh plausible dynamically plausible uh behavior we want to leverage the uh large scale uh parallelization in simulation in Isac to

3408.4s uh which is just the robots abain its uh

3412.0s morphological constraints we can do it

3414.4s in a systematic optimization based way

3417.1s and later when we translate into

3419.9s physically uh plausible dynamically

3422.4s plausible uh behavior we want to

3425.3s leverage the uh large scale uh

3428.3s parallelization in simulation in Isac to

3431.4s do this translation. So it so like on a higher level if we can split the problems into two phases where like uh smaller number of computation but but requires higher quality uh we can do it with the modelbased strategy but uh for the uh lower fidelity requirement and uh and like massively parallelizable uh

3434.5s higher level if we can split the

3437.1s problems into two phases where like uh

3440.8s smaller number of computation but but

3444.4s requires higher quality uh we can do it

3447.4s with the modelbased strategy but uh for

3450.2s the uh lower fidelity requirement and uh

3454.6s and like massively parallelizable uh

3457.4s setup we can use the learning paradigm. But when you're trying to combine the two, um I guess if you were to try to describe the architecture of your program, like how would you lay out the pieces? So like for example, I'll just give you like a simple example of what I

3459.8s But when you're trying to combine the

3461.8s two, um I guess if you were to try to

3464.7s describe the architecture of your

3467.7s program, like how would you lay out the

3471.2s pieces? So like for example, I'll just

3473.4s give you like a simple example of what I

3475.2s mean by that. Like when you have uh like a RL model for example of a robot walking, um the blocks I would describe might be like on the left I have like my RL. Okay, like on the very left I might have a block that's like my desired trajectory and the input of that goes to

3477.7s a RL model for example of a robot

3479.8s walking, um the blocks I would describe

3483.1s might be like on the left I have like my

3486.6s RL. Okay, like on the very left I might

3488.7s have a block that's like my desired

3490.7s trajectory and the input of that goes to

3493.6s some RO inference model that's computing the desired torque and then to the right of that might be feeding it to the torqus of the actuator. So if you were to kind of describe in like a block diagram level structure of a hybrid approach where you have your um traditional optimization technique and your RL uh techniques,

3496.9s the desired torque and then to the right

3499.3s of that might be feeding it to the

3501.2s torqus of the actuator. So if you were

3503.8s to kind of describe in like a block

3506.1s diagram level structure of a hybrid

3509.7s approach where you have your um

3512.6s traditional optimization technique and

3514.7s your RL uh techniques,

3518.6s how would you kind of describe that visual picture of how data is flowing? Yeah. So I would say um the model based I I'm I'm personally using the modelbased approaches to generate higher quality data and then uh it it will be used as essentially as initial guess for the RL policy. So uh we can train our

3520.2s visual picture of how data is flowing?

3525.1s Yeah. So I would say um the model based

3529.7s I I'm I'm personally using the

3531.5s modelbased approaches to generate higher

3533.7s quality data and then uh it it will be

3537.5s used as essentially as initial guess for

3539.8s the RL policy. So uh we can train our

3543.0s policy from scratch but that is very time consuming requires a lot of reward tuning uh and uh like the uh the the behavior resulting behavior might be uh less natural for example but with the help of modelbased methods we will be able to enable um more fluid u motion uh from the uh initial guess provided by

3545.8s time consuming requires a lot of reward

3548.6s tuning uh and uh like the uh the the

3552.5s behavior resulting behavior might be uh

3555.3s less natural for example but with the

3558.1s help of modelbased methods we will be

3560.6s able to enable um more fluid u motion uh

3566.9s from the uh initial guess provided by

3569.6s the modelbased methods. And they then the arrow policy like just bootstraps or initialize from that to learn uh a better um controller. better um controller. Oh, so you were talking about hard constraints using um like traditional optimization based techniques and also combining that with RL techniques. So can you kind of walk a little bit into

3572.0s the arrow policy like just bootstraps or

3575.6s initialize from that to learn uh a

3578.6s better um controller.

3580.9s better um controller. Oh,

3582.0s so you were talking about hard

3583.2s constraints using um like traditional

3586.9s optimization based techniques and also

3589.5s combining that with RL techniques. So

3592.4s can you kind of walk a little bit into

3594.3s more detail about how you take the two things and combine them together? Uh yeah sure. So um as I mentioned uh before we do the kinematic retargeting by building a graph that preserves the relative location between the human and the robot uh as well as the objects. So here is the optimization program I'm

3597.0s things and combine them together?

3599.8s Uh yeah sure. So um as I mentioned uh

3603.5s before we do the kinematic retargeting

3606.5s by building a graph that preserves the

3609.3s relative location between the human and

3612.0s the robot uh as well as the objects. So

3616.0s here is the optimization program I'm

3618.6s solving. I'm trying to there are uh components that are the objective as well as the constraints. So the objectives are encouraging this interaction to be preserved. Uh and the hard constraints including hard constraints including non-penetration where we don't want the robot hand to penetrate the object to go into the object. So we want to penetrate that. We

3621.4s components that are the objective as

3624.3s well as the constraints. So the

3626.3s objectives are encouraging this

3628.8s interaction to be preserved. Uh and the

3631.8s hard constraints including

3633.3s hard constraints including non-penetration

3634.8s where we don't want the robot hand to

3637.8s penetrate the object to go into the

3639.7s object. So we want to penetrate that. We

3642.2s want the joint to stay within the limits and as well as the velocity to stay within the speed limit. And we also want a hard constraint that the food doesn't skate while the robot is walking. Otherwise the robot will just be sliding all the time. So these constraints together with the interaction preserving

3645.0s and as well as the velocity to stay

3647.1s within the speed limit. And we also want

3649.8s a hard constraint that the food doesn't

3652.9s skate while the robot is walking.

3655.4s Otherwise the robot will just be sliding

3657.7s all the time. So these constraints

3660.2s together with the interaction preserving

3662.8s objective uh will give us some very high quality data that uh preserves the human motion of picking up a box onto the robot uh as well as so uh as satisfying all the hard constraints including the robot hand not penetrating the object and the food is not sliding while walking. So would you say you're using

3666.0s quality data that uh preserves the human

3669.8s motion of picking up a box onto the

3672.6s robot uh as well as so uh as satisfying

3677.0s all the hard constraints including the

3679.4s robot hand not penetrating the object

3681.7s and the food is not sliding while

3683.6s walking. So would you say you're using

3686.1s this um constraint here as the input to your RL or are you using this constraint to generate like a full series of like motion data like how how would you say it's the right way to understand? Uh yeah so this is actually kind of a hierarchical framework. So f first we use this pipeline to generate data with

3690.7s your RL or are you using this constraint

3693.3s to generate like a full series of like

3697.0s motion data like how how would you say

3699.1s it's the right way to understand?

3701.8s Uh yeah so this is actually kind of a

3704.3s hierarchical framework. So f first we

3707.1s use this pipeline to generate data with

3709.8s hard constraints so that it can be used as initialization for the RL. So during RL training, we actually initialize the uh the agents to be in some random position or uh in some configurations at random time step in this uh in this motion data set and then from that the

3712.3s as initialization for the RL. So during

3715.8s RL training, we actually initialize the

3719.4s uh the agents to be in some random

3722.6s position or uh in some configurations at

3726.4s random time step in this uh in this

3728.7s motion data set and then from that the

3731.8s RO be will be able to start with from these configurations and bootstrap from that to come up with a dynamically feasible solution. So uh say let's compare it with a training from scratch paradigm where the uh the arrow policy initialized the agent to be in random initialized the agent to be in random modocation.

3734.6s these configurations and bootstrap from

3737.0s that to come up with a dynamically

3739.5s feasible solution. So uh say let's

3742.8s compare it with a training from scratch

3745.7s paradigm where the uh the arrow policy

3748.6s initialized the agent to be in random

3750.9s initialized the agent to be in random modocation.

3752.1s In that kind of scenario there can be there can be all kinds of penetrations uh joint limits violation velocity limit violation as well as food skating in that kind of uh initialization from scratch. So uh with this optimization based uh qual uh high quality data generated the RL will be initialized from a much better configuration than

3755.0s there can be all kinds of penetrations

3757.7s uh joint limits violation velocity limit

3760.5s violation as well as food skating in

3763.0s that kind of uh initialization from

3765.5s scratch. So uh with this optimization

3769.0s based uh qual uh high quality data

3772.7s generated the RL will be initialized

3775.2s from a much better configuration than

3778.6s just initializing from scratch. So how how do you know that the initial position if the initial position is in something that's like physically possible? How do you know the rest of the RL execution will also be physically possible? What is kind of enforcing possible? What is kind of enforcing that? Yeah, so that's actually just

3780.8s So how how do you know that the initial

3785.2s position if the initial position is in

3787.8s something that's like physically

3790.6s possible? How do you know the rest of

3793.1s the RL execution will also be physically

3796.8s possible? What is kind of enforcing

3799.9s possible? What is kind of enforcing that?

3801.6s Yeah, so that's actually just

3803.6s timestamping the simulator in Isac that is enforcing the dynamical constraints. So it's still like checking your optimization equation every time. Is that what you're saying? like your RL. that what you're saying? like your RL. Oh, so uh the uh data that comes from my optimization is simply used as the initialization for RL and then RL just

3806.2s is enforcing the dynamical constraints.

3808.9s So it's still like checking your

3810.6s optimization equation every time. Is

3812.7s that what you're saying? like your RL.

3815.5s that what you're saying? like your RL. Oh,

3817.3s so uh the uh data that comes from my

3820.9s optimization is simply used as the

3823.0s initialization for RL and then RL just

3826.6s does whatever uh it's supposed to do in does whatever uh it's supposed to do in ISC. Okay. So as long as you're saying as long as the initialization when you say initialization I guess you mean like the initial position it's in for like each episode. Is that the correct way to

3829.0s does whatever uh it's supposed to do in ISC.

3830.4s Okay. So as long as you're saying as

3832.9s long as the initialization when you say

3834.6s initialization I guess you mean like the

3836.6s initial position it's in for like each

3839.7s episode. Is that the correct way to

3841.9s describe it? Uh yes yes uh not only does the reference motion as uh a act as a good initialization for the RL policy but it also adds a guidance as a guidance for the RL policy. So uh it u tells the uh RL policy that where the robot should go at the next time step

3845.7s the reference motion as uh a act as a

3849.4s good initialization for the RL policy

3851.8s but it also adds a guidance as a

3854.5s guidance for the RL policy. So uh it u

3857.9s tells the uh RL policy that where the

3861.6s robot should go at the next time step

3863.9s and the RL tries to achieve that with the current dynamical constraints in the current dynamical constraints in Isaxim. All right. So, that's it for this episode. Uh, thank you Lou for coming on to this podcast show and I'll leave some links in the video description for some of her works so you guys can go ahead

3866.9s the current dynamical constraints in

3869.8s the current dynamical constraints in Isaxim.

3871.0s All right. So, that's it for this

3872.3s episode. Uh, thank you Lou for coming on

3874.6s to this podcast show and I'll leave some

3877.3s links in the video description for some

3879.3s of her works so you guys can go ahead

3881.3s and check that out. Thank you for inviting Kevin. Thank you for inviting Kevin. music

3882.5s Thank you for inviting Kevin.

3885.5s Thank you for inviting Kevin. music

How Robots Can Backflip with RL (Sim-to-Real, Kinematic Retargetting, Isaac Lab vs Mujoco)

Full Transcript