2.0s
here, an expert in reinforcement learning. She's also the author behind this cool work that allows robots to do back flips. music We're going to be covering a lot of cool and exciting topics like RL theory versus application, bridging the sim to real gap, simulation environments for RL, deploying RL on custom robots, music position versus torque control,
3.9s
learning. She's also the author behind
6.0s
this cool work that allows robots to do
8.2s
back flips. music We're going to be
10.0s
covering a lot of cool and exciting
11.4s
topics like RL theory versus
13.2s
application, bridging the sim to real
15.0s
gap, simulation environments for RL,
17.4s
deploying RL on custom robots, music
19.5s
position versus torque control,
21.4s
resources for learning RL, and then the second half we'll talk about kinematic music retargeting applications, generalization in robotics, generalist versus specialist, data and training, and finally end it with traditional optimization versus RL. And if you're new here, my name is Kevin. I've music been doing robotics and AI for 10 plus
23.7s
second half we'll talk about kinematic
24.9s
music retargeting applications,
26.7s
generalization in robotics, generalist
29.4s
versus specialist, data and training,
31.9s
and finally end it with traditional
33.8s
optimization versus RL. And if you're
36.1s
new here, my name is Kevin. I've music
37.4s
been doing robotics and AI for 10 plus
39.6s
years and have lots of resources on my channel. I also have a robotics builder music membership where you get deep dive access to how I build my projects and also resources on my website to help you get started. Link down below. Can you give us a quick self introduction? Hi everyone, I'm Luj. I go by Lou. I'm
41.6s
channel. I also have a robotics builder
43.2s
music membership where you get deep
45.0s
dive access to how I build my projects
47.8s
and also resources on my website to help
49.6s
you get started. Link down below. Can
51.4s
you give us a quick self introduction?
53.5s
Hi everyone, I'm Luj. I go by Lou. I'm
56.8s
currently an applied scientist at Frontier A and Robotics Lab at Amazon. I just obtained my PhD last October from MIT under the guidance of professor Russ Tedri. Uh I have been working on optimization based and learning based methods for robotic manipulation and control. And right now I'm working on whole humanoid whole body control.
59.2s
Frontier A and Robotics Lab at Amazon. I
62.4s
just obtained my PhD last October from
65.4s
MIT under the guidance of professor Russ
68.1s
Tedri. Uh I have been working on
70.4s
optimization based and learning based
72.6s
methods for robotic manipulation and
74.9s
control. And right now I'm working on
77.2s
whole humanoid whole body control.
79.6s
Very nice. Very cool topic because I know right now humanoid robots is probably one of the hottest topics right now. So it's very exciting to you know have a deeper dive into how some of the reinforcement learning techniques are actually applied in real life and understanding some of the you know existing challenges that people are
81.8s
know right now humanoid robots is
83.9s
probably one of the hottest topics right
85.8s
now. So it's very exciting to you know
88.2s
have a deeper dive into how some of the
91.0s
reinforcement learning techniques are
93.3s
actually applied in real life and
94.9s
understanding some of the you know
96.8s
existing challenges that people are
98.5s
dealing with in industry. Uh so first topic we want to talk about is um I know a lot of times when people are learning RL there's a lot of theory but then when people try to apply that in application sometimes it could be quite different from theory. So, uh, if you were to give
101.3s
topic we want to talk about is um I know
104.7s
a lot of times when people are learning
106.6s
RL there's a lot of theory but then when
109.5s
people try to apply that in application
111.5s
sometimes it could be quite different
113.3s
from theory. So, uh, if you were to give
115.8s
the audience like a good overview of what that difference is, how would you describe that? Uh, yeah. So, I think having a very solid background and foundation in RL theory is actually very rewarding and inspiring for RL practice. So um I think RL is more than just collecting rewards but also like how do we balance the
118.6s
what that difference is, how would you
120.5s
describe that?
121.8s
Uh, yeah. So, I think having a very
124.7s
solid background and foundation in RL
127.6s
theory is actually very rewarding and
130.2s
inspiring for RL practice. So um I think
133.8s
RL is more than just collecting rewards
136.5s
but also like how do we balance the
138.9s
exploration and exploitation uh that is actually embedded in the coefficients and the uh exploration uh re uh exploration formulation in the RL theory itself. So how do we uh gain some insights from uh tuning the RO rewards or or coefficient from the RO theory I think is a very uh interesting and uh rewarding topic.
141.8s
actually embedded in the coefficients
143.8s
and the uh exploration uh re uh
147.2s
exploration formulation in the RL theory
149.8s
itself. So how do we uh gain some
152.7s
insights from uh tuning the RO rewards
156.5s
or or coefficient from the RO theory I
159.0s
think is a very uh interesting and uh
162.1s
rewarding topic.
163.5s
Okay. So if someone is like for example if they're doing the tuning with all the coefficients is there like a specific example where you could think of where you know something where they try to apply that maybe in um like simulation but then when they actually try to deploy it it's very different.
165.8s
if they're doing the tuning with all the
168.9s
coefficients is there like a specific
171.0s
example where you could think of where
173.0s
you know something where they try to
175.2s
apply that maybe in um like simulation
179.6s
but then when they actually try to
181.3s
deploy it it's very different.
184.5s
uh do are you referring to the sim to row gap or just uh how is it what is a systematic way to tune the RL coefficients and reward uh in sim? Yeah, I guess more so in terms of the theory like um you know because a lot of times when someone if someone were to
186.9s
row gap or just uh how is it what is a
189.8s
systematic way to tune the RL
192.2s
coefficients and reward uh in sim?
195.1s
Yeah, I guess more so in terms of the
197.4s
theory like um you know because a lot of
199.7s
times when someone if someone were to
201.3s
jump in and use a framework like um gymnasium for example, they have everything set up where a lot of the function calls is kind of like a black box, right? they would maybe play with the reward structure or maybe they might change kind of like the exploration you were saying. So, you know, for example,
204.4s
gymnasium for example, they have
206.2s
everything set up where a lot of the
209.4s
function calls is kind of like a black
211.3s
box, right? they would maybe play with
214.0s
the reward structure or maybe they might
216.8s
change kind of like the exploration you
218.8s
were saying. So, you know, for example,
220.8s
someone who's very used to a gymnasium Python framework and um those that's what I would maybe consider more of the application side. So where do you see that that infrastructure start to break where you know like for someone who has spent maybe a couple years learning the theory what is something what edge do
223.8s
Python framework and um those that's
226.9s
what I would maybe consider more of the
228.7s
application side. So where do you see
231.3s
that that infrastructure start to break
234.2s
where you know like for someone who has
236.9s
spent maybe a couple years learning the
239.0s
theory what is something what edge do
242.2s
someone that know the theory might have over someone who doesn't you know if they're just playing with gymnasium for they're just playing with gymnasium for example um yeah that's a great question so uh during the training we might see some phenomenon as mode collapse. So for example uh if there is not enough enough
244.2s
over someone who doesn't you know if
246.5s
they're just playing with gymnasium for
248.5s
they're just playing with gymnasium for example
250.0s
um yeah that's a great question so uh
252.6s
during the training we might see some
255.1s
phenomenon as mode collapse. So for
257.4s
example uh if there is not enough enough
260.8s
exploration from the agents the agents will just quickly uh decay to a single behavior without exploring the environment further and it can easily get stuck in the local minimum which uh we don't want as the ideal behavior. So in that case uh we might one might have to increase the exploration coefficient
263.2s
will just quickly uh decay to a single
266.3s
behavior without exploring the
268.2s
environment further and it can easily
270.1s
get stuck in the local minimum which uh
272.5s
we don't want as the ideal behavior. So
276.0s
in that case uh we might one might have
278.3s
to increase the exploration coefficient
281.0s
to encourage the agent to explore the environments before collapsing to a single mode and doing explor exploitation afterwards. Okay. So someone who's just playing with the gymnasium um like Python library for the gymnasium um like Python library for example is that something they would be able to figure out or is it because if they had
283.9s
environments before collapsing to a
286.3s
single mode and doing explor
288.2s
exploitation afterwards.
290.2s
Okay. So someone who's just playing with
293.4s
the gymnasium um like Python library for
297.2s
the gymnasium um like Python library for example
298.8s
is that something they would be able to
302.2s
figure out or is it because if they had
304.6s
like no theory they wouldn't be able to figure out something like that I think if you yourself do enough of trial and error and change the coefficients this and there you might be able to bump into some uh coefficients tuning or uh you uh you just do a grid search on all the important co uh
306.2s
figure out something like that
308.5s
I think if you yourself do enough of
311.0s
trial and error and change the
312.6s
coefficients this and there you might be
314.8s
able to bump into some uh coefficients
318.3s
tuning or uh you uh you just do a grid
321.6s
search on all the important co uh
323.2s
coefficients and get a very good intuition afterwards uh like which part which are the most important coefficients or parameters you should change for the RL training but if we just start from the fundamental of the RO training and RO theory itself so most of the time the very common algorithm
325.1s
intuition afterwards uh like which part
327.8s
which are the most important
329.4s
coefficients or parameters you should
331.3s
change for the RL training but if we
334.0s
just start from the fundamental of the
336.5s
RO training and RO theory itself so most
339.4s
of the time the very common algorithm
342.5s
we're using for RO these days are PO so there are only a small number of terms in the PO that are really important that is causing the training to uh stabilize or destabilize. So if we just start from these formulation uh from the fundamental we will be have able to have
345.8s
there are only a small number of terms
348.3s
in the PO that are really important
351.0s
that is causing the training to uh
353.3s
stabilize or destabilize. So if we just
355.8s
start from these formulation uh from the
358.0s
fundamental we will be have able to have
360.2s
like a big picture of oh what are the important parameters we should tune. So something like doing like a grid search on all the coefficients. Do you feel like that's more of a advanced technique that someone would learn maybe only through school or is that something that someone who's more familiar with the application only part would also
362.7s
important parameters we should tune.
364.7s
So something like doing like a grid
366.6s
search on all the coefficients. Do you
368.3s
feel like that's more of a advanced
370.5s
technique that someone would learn maybe
372.5s
only through school or is that something
374.4s
that someone who's more familiar with
377.8s
the application only part would also
379.9s
know how to do? Uh I think people who are very practical will also come up with this idea. Oh really? Okay. Um you mentioned the PO. I know that's one of the most common uh RL models that people are using right now for robotics. Um would you say there's a specific reason why people are leaning towards
381.5s
Uh I think people who are very practical
383.6s
will also come up with this idea.
385.5s
Oh really? Okay.
387.1s
Um you mentioned the PO. I know that's
389.3s
one of the most common uh RL models that
392.8s
people are using right now for robotics.
395.7s
Um would you say there's a specific
397.8s
reason why people are leaning towards
399.4s
that versus another model or how they come to decide to use that one? Yeah. Um so uh in Local Motion for example, people use PO a lot because it's uh an online uh policy that is uh very good at encouraging the uh encouraging the training to stay within the distribution. So that means the uh
402.2s
come to decide to use that one?
404.5s
Yeah. Um so uh in Local Motion for
407.0s
example, people use PO a lot because
409.5s
it's uh an online uh policy that is uh
413.1s
very good at encouraging the uh
416.9s
encouraging the training to stay within
419.2s
the distribution. So that means the uh
422.6s
RL algorithm is actually experiencing the state the the same state and action distribution as the agent itself is exploring the environment. So for some other uh more efficient RL algorithms such as offline RL uh they are essentially using different strategies uh during the policy update versus what the agent is using to explore the
425.0s
the state the the same state and action
427.6s
distribution as the agent itself is
430.2s
exploring the environment. So for some
433.0s
other uh more efficient RL algorithms
436.2s
such as offline RL uh they are
438.8s
essentially using different strategies
441.1s
uh during the policy update versus what
443.4s
the agent is using to explore the
446.0s
environment. That kind of algorithm is uh cheaper and is can can be more uh efficient but because the policy update and the agent is experiencing different state distribution uh during deployment there will be some non ideal effect called distribution shift which can cause a a big gap during deployment and training. So would you say if you were
448.8s
uh cheaper and is can can be more uh
452.2s
efficient but because the policy update
454.9s
and the agent is experiencing different
457.3s
state distribution uh during deployment
459.8s
there will be some non ideal effect
462.2s
called distribution shift which can
464.4s
cause a a big gap during deployment and
467.7s
training. So would you say if you were
470.0s
to give like a percentage of users that's using PO versus other models, what would how would you kind of describe that distribution right now? Uh I would say probably 90 of robot locomotion researchers are using PO. Oh okay. We were talking about like the sim to real gap. That's like a very big
472.7s
that's using PO versus other models,
475.3s
what would how would you kind of
476.8s
describe that distribution right now?
479.8s
Uh I would say probably 90 of robot
483.7s
locomotion researchers are using PO.
486.3s
Oh okay. We were talking about like the
488.3s
sim to real gap. That's like a very big
491.2s
area that people are trying to tackle and I know there's people that try to tackle that problem either you know rand tackle that problem either you know rand randomizing uh you know physical properties of their robots or some people have like zeroot techniques and also there's people that try to train on hardware ignoring some
493.2s
and I know there's people that try to
495.4s
tackle that problem either you know rand
498.1s
tackle that problem either you know rand randomizing
499.6s
uh you know physical properties of their
501.4s
robots or some people have like zeroot
505.6s
techniques and also there's people that
507.9s
try to train on hardware ignoring some
510.6s
of the simulation. in your opinion, what is the best approach to sim to real? And maybe if you could explain some of the tradeoffs between uh maybe some of the methods I mentioned or any other methods that um you're familiar with. Yeah, that's a great question. Simple is a very important procedure in our uh
513.2s
is the best approach to sim to real? And
516.2s
maybe if you could explain some of the
518.0s
tradeoffs between uh maybe some of the
520.2s
methods I mentioned or any other methods
522.2s
that um you're familiar with.
525.6s
Yeah, that's a great question. Simple is
527.8s
a very important procedure in our uh
530.9s
entire robot deployment pipeline. So I would say we one should start with a reasonably accurate system identification of the robot and then to try to randomize a little bit around that nominal values in simulation during training. So that's called domain randomization uh and deployment uh and and deploy it on the robots and see if
533.9s
would say we one should start with a
536.5s
reasonably accurate system
538.8s
identification of the robot and then to
541.8s
try to randomize a little bit around
544.3s
that nominal values in simulation during
546.7s
training. So that's called domain
548.4s
randomization uh and deployment uh and
551.3s
and deploy it on the robots and see if
553.8s
there's other like mismatch between the sim and row. So one can essentially do uh a loop where uh one identify the coefficient system coefficients of the robots uh try to uh randomize it in sim deploy it on real and collect the system rowouts uh to back prop the parameters back to the sim to modify the sim
556.6s
sim and row. So one can essentially do
559.8s
uh a loop where uh one identify the
562.9s
coefficient system coefficients of the
565.3s
robots uh try to uh randomize it in sim
568.9s
deploy it on real and collect the system
571.8s
rowouts uh to back prop the parameters
575.3s
back to the sim to modify the sim
577.6s
parameters to match the real behavior. Uh that would be the most ideal case. uh but also one can uh train a base policy in simulation which is performing reasonably enough and then starting from that base policy uh do real world RL uh in real uh I would say training RL from
579.7s
Uh that would be the most ideal case. uh
582.2s
but also one can uh train a base policy
585.6s
in simulation which is performing
587.6s
reasonably enough and then starting from
590.6s
that base policy uh do real world RL uh
593.6s
in real uh I would say training RL from
597.4s
scratch in the real world is very uh time consuming and also can uh do a lot of damage to the hardware which is the hardware can be expensive so we want to u minimize the uh computation time as well as the cost for the entire training and deployment loop. Yeah, that's a good point. Uh on the
600.2s
time consuming and also can uh do a lot
604.3s
of damage to the hardware which is the
607.7s
hardware can be expensive so we want to
610.9s
u minimize the uh computation time as
613.4s
well as the cost for the entire training
616.2s
and deployment loop.
618.0s
Yeah, that's a good point. Uh on the
619.6s
hardware note, I'm very curious um like have you ever encountered any issues where um your hardware randomly fails during your RL deployment and you couldn't figure it out or maybe something that was harder hard for you to figure out. Uh yeah. So actually that's a very common issue when we're doing hardware
623.1s
have you ever encountered any issues
625.6s
where um your hardware randomly fails
630.6s
during your RL deployment and you
632.7s
couldn't figure it out or maybe
634.1s
something that was harder hard for you
636.7s
to figure out.
639.0s
Uh yeah. So actually that's a very
641.8s
common issue when we're doing hardware
644.0s
experiments. Um so there can be cases when uh as soon as we start our controller the robot is just like uh waving the their arms and robots uh like wildly. Um there can be a couple of issues. So first of all there can be like uh the calibration of IMU is not
646.6s
when uh as soon as we start our
649.3s
controller the robot is just like uh
652.1s
waving the their arms and robots uh like
655.3s
wildly. Um there can be a couple of
657.5s
issues. So first of all there can be
659.4s
like uh the calibration of IMU is not
662.1s
accurate enough. So it doesn't have like an accurate uh sense of where it is and what it's uh what is it it is angular and uh linear velocity and there can also uh be uh motor that is not producing enough torque. So for example if we're doing like very agile behavior
664.1s
an accurate uh sense of where it is and
668.2s
what it's uh what is it it is angular
670.9s
and uh linear velocity and there can
673.9s
also uh be uh motor that is not
677.8s
producing enough torque. So for example
679.9s
if we're doing like very agile behavior
683.3s
like climbing up a very high platform or jumping off a cliff or something like that. Uh sometimes uh in simulation although we are able to kind of train a policy for the robot to do that the motor curve is actually different in real and in reality uh on the hardware experiments the motor on the robot will
687.0s
jumping off a cliff or something like
689.7s
that. Uh sometimes uh in simulation
692.6s
although we are able to kind of train a
695.9s
policy for the robot to do that the
698.3s
motor curve is actually different in
700.2s
real and in reality uh on the hardware
703.0s
experiments the motor on the robot will
705.9s
not be able to produce as much torque as needed for doing such an agile behavior. So that might also cause some failure on hardware. So since motors typically have like a rated torque and peak torque, is that something you can cap uh accurately in simulation or is it pretty hard to set those constraints?
708.7s
needed for doing such an agile behavior.
711.7s
So that might also cause some failure on
714.6s
hardware. So
717.0s
since motors typically have like a rated
719.4s
torque and peak torque, is that
721.1s
something you can cap uh accurately in
725.3s
simulation or is it pretty hard to set
727.7s
those constraints?
729.8s
Yeah, I think it's pretty hard. So if we're just using the default parameters from the robot seller, uh I would say uh it's often not as accurate. So if one wants to do a very accurate calibration uh I think one has to run a lot of uh experiments with the uh current and
732.4s
we're just using the default parameters
734.2s
from the robot seller, uh I would say uh
737.7s
it's often not as accurate. So if one
741.0s
wants to do a very accurate calibration
743.9s
uh I think one has to run a lot of uh
747.4s
experiments with the uh current and
750.3s
torque and this and speed of the motor to record its behavior. Uh and also there's another caveat which is when we do more experiments the robot itself essentially wears off. So that uh relationship between motor current, speed and torque will actually change over time which is even harder to model in simulation. So I think the best we
753.2s
to record its behavior. Uh and also
756.9s
there's another caveat which is when we
759.8s
do more experiments the robot itself
762.4s
essentially wears off. So that uh
765.1s
relationship between motor current,
767.3s
speed and torque will actually change
769.8s
over time which is even harder to model
773.3s
in simulation. So I think the best we
775.9s
can do is try to add more safety guard in simulation. Uh try to penalize a little bit more uh of the torque uh limits in simulation so that when it transfer to real it won't hit its actual transfer to real it won't hit its actual limit. I see. So in theory, let's say someone were to
779.4s
in simulation. Uh try to penalize a
782.9s
little bit more uh of the torque uh
785.5s
limits in simulation so that when it
787.8s
transfer to real it won't hit its actual
790.4s
transfer to real it won't hit its actual limit.
791.0s
I see.
792.9s
So in theory, let's say someone were to
796.0s
be like very conservative and say, you know, they assume the peak torque is like 50 of the manufacturers's rated torque. Do you think they would still face any of these challenges or would that be pretty safe? I think in that case um it's more promising to uh transfer to real without hitting any hardware issue. But uh that
799.5s
know, they assume the peak torque is
801.7s
like 50 of the manufacturers's rated
805.0s
torque. Do you think they would still
806.8s
face any of these challenges or would
808.4s
that be pretty safe?
810.7s
I think in that case um it's more
813.2s
promising to uh transfer to real without
815.8s
hitting any hardware issue. But uh that
818.6s
of course limits the range of motion the robot can do. Like if we wanted to do the wolf flip or jumping uh very high, I think it will be more challenging. Yeah, it's definitely a good point because Yeah, you're pretty much constraining over constraining your robot. Yeah. Mhm. Exactly.
821.0s
robot can do. Like if we wanted to do
823.4s
the wolf flip or jumping uh very high, I
827.3s
think it will be more challenging.
829.2s
Yeah, it's definitely a good point
830.3s
because Yeah, you're pretty much
831.9s
constraining over constraining your
833.5s
robot. Yeah.
835.0s
Mhm. Exactly.
837.4s
How about So, so far I know a lot of your work has been on uh deploying it on the Uni Tree robots. So if someone were to try to take their techniques and deploy on their own custom robot, what do you think would be, you know, some techniques or strategies they would have
839.7s
your work has been on uh deploying it on
842.2s
the Uni Tree robots. So if someone were
844.6s
to try to take their techniques and
847.8s
deploy on their own custom robot, what
850.4s
do you think would be, you know, some
852.6s
techniques or strategies they would have
854.3s
to use or understand to do something like that? Yeah, that's a great question. So uh starting from uh the fundamental level one has to do a very careful CID to measure the uh for example the inertia and mass of the entire robot as well as the motors uh and do a careful calibration of all the sensors including
857.0s
like that?
859.0s
Yeah, that's a great question. So uh
861.5s
starting from uh the fundamental level
864.2s
one has to do a very careful CID to
867.7s
measure the uh for example the inertia
870.8s
and mass of the entire robot as well as
873.5s
the motors uh and do a careful
875.8s
calibration of all the sensors including
878.4s
IMU encoders and if one has depth camera uh the cameras as well and then one has to have like a very robust low-level controller which translates ates the higher level RA policy into lower level higher frequency torque control and that has to be very reliable in terms of both the magnitude as well as the frequency
881.9s
uh the cameras as well and then one has
885.0s
to have like a very robust low-level
888.0s
controller which translates ates the
890.5s
higher level RA policy into lower level
893.8s
higher frequency torque control and that
896.6s
has to be very reliable in terms of both
899.7s
the magnitude as well as the frequency
902.4s
the uh command is sent to the torque sends then to the motor and then uh one has to build like a reasonable enough uh simulation mod in simulation so that one can start like a arrow training session with a good model so that lighter when people want to transfer it to real there's a smaller centrial gap.
905.0s
sends then to the motor
906.9s
and then uh one has to build like a
909.7s
reasonable enough uh simulation mod in
913.7s
simulation so that one can start like a
917.0s
arrow training session with a good model
919.1s
so that lighter when people want to
921.5s
transfer it to real there's a smaller
923.6s
centrial gap.
924.9s
So for the CIS ID that you talked about are there generally common common methods or open-source methods that people like to leverage or do they try to make their own? Uh I think it can be case dependent because uh for the mass you can just take a scale or something like that to
927.3s
are there generally common common
930.6s
methods or open-source methods that
933.4s
people like to leverage or do they try
935.4s
to make their own?
937.1s
Uh I think it can be case dependent
939.4s
because uh for the mass you can just
941.9s
take a scale or something like that to
944.4s
uh measure the mass. Uh I think the most important piece of the CIS ID is actually amateur or which is the motors or rotational inertia. Uh that is actually very important. Uh we find out that one of the most important pieces for our SIM to real pipeline. uh and for others I think there are standardized uh
947.5s
important piece of the CIS ID is
950.6s
actually amateur or which is the motors
953.6s
or rotational inertia. Uh that is
956.3s
actually very important. Uh we find out
959.1s
that one of the most important pieces
960.7s
for our SIM to real pipeline. uh and for
963.8s
others I think there are standardized uh
966.2s
procedure online one can resort to but uh I think for each robot they have their own uh advantage and disadvantages and one should be careful about characterizing uh uh if it's the motor curve that is really important or it's the inertia that is uh causing a lot of trouble uh for the centuro deployment.
969.8s
uh I think for each robot they have
972.2s
their own uh advantage and disadvantages
975.2s
and one should be careful about
977.0s
characterizing uh uh if it's the motor
981.1s
curve that is really important or it's
983.4s
the inertia that is uh causing a lot of
986.2s
trouble uh for the centuro deployment.
988.4s
So in terms of the accuracy of your So in terms of the accuracy of your calibration, um how accurate do you think one has to be to have good enough results when they deploy their RL models? Uh yeah, I think it's again uh case by case. I would have to say as accurate as
992.3s
So in terms of the accuracy of your calibration,
993.8s
um how accurate do you think one has to
996.7s
be to have good enough results when they
1000.3s
deploy their RL models?
1003.5s
Uh yeah, I think it's again uh case by
1007.0s
case. I would have to say as accurate as
1010.8s
one would get so that when during the training uh phase when doing domain randomization you can randomize uh only in a very small range. Um and uh a very important thing is that uh maybe during your calibration or CID you only you do uh multiple rounds and take the average uh and also know the standard deviation
1013.8s
training uh phase when doing domain
1016.0s
randomization you can randomize uh only
1018.7s
in a very small range. Um and uh a very
1022.2s
important thing is that uh maybe during
1024.3s
your calibration or CID you only you do
1028.2s
uh multiple rounds and take the average
1030.7s
uh and also know the standard deviation
1033.0s
of your value so that you can use the mean and the standard deviation to characterize the range for your domain characterize the range for your domain randomization. Uh you mentioned uh frequency earlier. Can you dive into a little bit more details about what you were saying for the frequency part?
1036.0s
mean and the standard deviation to
1037.8s
characterize the range for your domain
1039.9s
characterize the range for your domain randomization.
1041.3s
Uh you mentioned uh frequency earlier.
1043.9s
Can you dive into a little bit more
1045.6s
details about what you were saying for
1047.3s
the frequency part?
1049.9s
Uh sure. So the higher level R policy is running at 50 HzT and the lower level uh torque command is running at 500 hertz. So the the robot itself has to consume like a relatively higher snorts frequency of torque command while on the higher level the R training is spitting
1054.2s
running at 50 HzT and the lower level uh
1058.1s
torque command is running at 500 hertz.
1061.7s
So the the robot itself has to consume
1065.8s
like a relatively higher snorts
1067.7s
frequency of torque command while on the
1070.6s
higher level the R training is spitting
1073.2s
out like 50 50 Hz uh PD target um for the motor to for the SDK to be converted into the motor commands. So how how is that typically determined like these values 50 500 is that something uh that was figured out through experiment or was it through theory? How how did you guys come to this conclusion?
1077.8s
the motor to for the SDK to be converted
1082.5s
into the motor commands. So how how is
1085.2s
that typically determined like these
1087.5s
values 50 500 is that something uh that
1091.2s
was figured out through experiment or
1093.4s
was it through theory? How how did you
1095.4s
guys come to this conclusion?
1098.6s
Uh I think it's uh most of the time u by convention. So for example, unitry SDK would uh support something like 500 uh uh 500 Hz work command and for the RL policy I think it's uh mostly trial and error and uh what other people have been using and have been work uh have been
1102.1s
convention. So for example, unitry SDK
1106.1s
would uh support something like 500 uh
1109.4s
uh 500 Hz work command and for the RL
1114.8s
policy I think it's uh mostly trial and
1117.3s
error and uh what other people have been
1120.0s
using and have been work uh have been
1122.2s
working reasonably well for them. Uh we will first try the value that is already testified to be working stably. So what what behavior would you typically see if you went a little bit too high? Say your RL was running at maybe 100 or even like 200. What sort of behavior might one see
1125.5s
will first try the value that is already
1129.3s
testified to be working stably. So what
1132.2s
what behavior would you typically see if
1135.0s
you went a little bit too high? Say your
1138.2s
RL was running at maybe 100 or even like
1141.2s
200. What sort of behavior might one see
1144.2s
if they were running too high? And what if they went too low like maybe say like 10 hertz? 10 hertz? snorts Mhm. Uh so I think there's a trade-off between the compute time and uh the RL like policy frequency. So the higher is probably better in some sense that we can react to the uh environment more
1146.6s
if they went too low like maybe say like
1149.3s
10 hertz?
1151.5s
10 hertz? snorts
1152.2s
Mhm. Uh so I think there's a trade-off
1155.0s
between the compute time and uh the RL
1159.4s
like policy frequency. So the higher is
1163.7s
probably better in some sense that we
1166.3s
can react to the uh environment more
1169.9s
quickly. Um but there is of course if the policy is running at a higher frequency then it consumes more uh memory and power on the GPU and sometimes there will if it's we're running too high we will run into latency problem that the the policy command cannot actually be sent in real
1174.2s
the policy is running at a higher
1176.1s
frequency then it consumes more uh
1179.2s
memory and power on the GPU and
1181.6s
sometimes there will if it's we're
1183.7s
running too high we will run into
1185.4s
latency problem that the the policy
1189.0s
command cannot actually be sent in real
1191.8s
time to the lower level uh motor command. uh if we're running it on a very low frequency then there's the problem that we're not reacting to the environment fast enough. If the robot is falling down and it cannot send the policy command fast enough it might not be able to recover uh immediately.
1194.7s
command. uh if we're running it on a
1197.8s
very low frequency then there's the
1200.6s
problem that we're not reacting to the
1202.6s
environment fast enough. If the robot is
1204.9s
falling down and it cannot send the
1207.4s
policy command fast enough it might not
1209.4s
be able to recover uh immediately.
1212.4s
How about in terms of um so you mentioned a lot about like reaction time. How about in terms of like the model or robot uh stability? Have you seen any trends in the stability whether it's higher or lower uh frequency? Uh yeah that's a great question. Uh our intuition is that the lower the
1214.6s
mentioned a lot about like reaction
1216.1s
time. How about in terms of like the
1219.0s
model or robot uh stability? Have you
1222.4s
seen any trends in the stability whether
1225.8s
it's higher or lower uh frequency?
1229.4s
Uh yeah that's a great question. Uh our
1232.0s
intuition is that the lower the
1234.3s
frequency the more stable the training is because we are actually uh exploring on a uh smoother manifold while versus if it's a higher frequency uh one might be like sampling around a smooth curve uh in a noisy way. So the command is actually like pretty uh zigzag and noisy. So that is a an actually like a
1236.2s
is because we are actually uh exploring
1240.0s
on a uh smoother manifold while versus
1244.1s
if it's a higher frequency uh one might
1247.7s
be like sampling around a smooth curve
1250.9s
uh in a noisy way. So the command is
1253.8s
actually like pretty uh zigzag and
1256.0s
noisy. So that is a an actually like a
1260.3s
harder exploration problem in some sense to actually stabilize your robot. So I think finding the balance between the smoothness and reactivity is what we uh like what drives us here for 50 Hz. I see. How about in terms of so like at 50 Hz, you know, your desired points are kind of spaced apart. So in practice, do
1262.4s
to actually stabilize your robot. So I
1265.8s
think finding the balance between the
1268.3s
smoothness and reactivity is what we uh
1271.6s
like what drives us here for 50 Hz.
1274.3s
I see. How about in terms of so like at
1277.7s
50 Hz, you know, your desired points are
1281.9s
kind of spaced apart. So in practice, do
1285.8s
you think it's good enough to send those points at 50 Hz or do you need something that's in between your desired commands that goes to your low level that does any interpolation? Do you think is there any interpolation that needs to happen or is that happening in the lower level
1288.4s
points at 50 Hz or do you need something
1292.1s
that's in between your desired commands
1295.1s
that goes to your low level that does
1297.0s
any interpolation? Do you think is there
1300.0s
any interpolation that needs to happen
1301.6s
or is that happening in the lower level
1303.8s
or is that happening in the lower level controllers? Yeah, so that interpolation is actually on the lower level. So we're essentially uh setting PD position target for the higher level RL policy which is actually uh implicitly doing torque control using this PD target uh that we interpolate and translate the position command into torque command using the PD
1305.8s
Yeah, so that interpolation is actually
1307.8s
on the lower level. So we're essentially
1310.7s
uh setting PD position target for the
1314.4s
higher level RL policy which is actually
1318.5s
uh implicitly doing torque control using
1321.0s
this PD target uh that we interpolate
1324.7s
and translate the position command into
1329.0s
torque command using the PD
1331.6s
torque command using the PD relationship. So their lower level controller would you say is like a position P. So you're taking a input as position and then the position P converts it to torque. Do they have like a cascaded type controller where there's a position velocity current or is it just position current? Is that some what's your
1332.7s
So their lower level controller would
1334.7s
you say is like a position P. So you're
1338.0s
taking a input as position and then the
1340.6s
position P converts it to torque. Do
1343.0s
they have like a cascaded type
1346.6s
controller where there's a position
1348.2s
velocity current or is it just position
1351.7s
current? Is that some what's your
1354.5s
understanding of their current setup? understanding of their current setup? snorts Um so it's so the for example the unitry SDK is kind of like a black spark black box to us. So our best understanding is that they are trying to use the PD relationship to translate the position uh target into torque command.
1357.8s
understanding of their current setup? snorts
1357.9s
Um so it's so the for example the unitry
1361.6s
SDK is kind of like a black spark
1365.2s
black box to us. So our best
1368.5s
understanding is that they are trying to
1370.9s
use the PD relationship to translate the
1374.8s
position uh target into torque command.
1378.5s
But sometimes there are weird behavior on the robot. So that might be something that uh in the blackbox is not per uh performing as we expected. So that that might also create some gap. So most of the experiments you have done was in uh position control. Is that right? Have you guys played with torque control
1381.1s
on the robot. So that might be something
1383.8s
that uh in the blackbox is not per uh
1387.1s
performing as we expected. So that that
1390.1s
might also create some gap. So most of
1393.0s
the experiments you have done was in uh
1395.9s
position control. Is that right?
1399.0s
Have you guys played with torque control
1400.9s
directly from your model? Uh not really. The reason is torque control is uh much less forgiving and it it should be sent at a very high frequency. So if there is any like non like any imperfections uh in the command it will be actually amplified by torque uh torque command versus if it's just PD
1403.8s
Uh not really. The reason is torque
1406.7s
control is uh much less forgiving and it
1410.6s
it should be sent at a very high
1412.7s
frequency. So if there is any like non
1417.7s
like any imperfections uh in the command
1421.1s
it will be actually amplified by torque
1423.5s
uh torque command versus if it's just PD
1426.4s
target you we do an interpolation at a um at a slower frequency that will be that will not be uh amplified as much as the higher frequency torque control. So, I'm just curious because like when you have a robot arm that you're tuning, if it's like let's say if your arm is in
1429.8s
um at a slower frequency that will be
1432.7s
that will not be uh amplified as much as
1435.2s
the higher frequency torque control. So,
1437.8s
I'm just curious because like when you
1439.4s
have a robot arm that you're tuning, if
1441.8s
it's like let's say if your arm is in
1444.4s
the vertical position and it's like swinging back and forth versus if the arm is like horizontal and swing up and down, the range of the robot it's in would be very the gains the gains that were tuned for the different position would be very different because of the load it's seeing. So if you're
1447.4s
swinging back and forth versus if the
1449.8s
arm is like horizontal and swing up and
1452.6s
down, the range of the robot it's in
1456.0s
would be very the gains the gains that
1458.9s
were tuned for the different position
1461.1s
would be very different because of the
1463.1s
load it's seeing. So if you're
1465.4s
controlling it in position and um I'm assuming you have the same gains for a different position, how does your robot usually still behave with similar response in the different positions? Yeah, that's a great question. So we are indeed using the same gain but uh essentially like very small PD gains for
1469.0s
assuming you have the same gains for a
1471.3s
different position, how does your robot
1474.4s
usually still behave with similar
1477.9s
response in the different positions?
1481.8s
Yeah, that's a great question. So we are
1484.7s
indeed using the same gain but uh
1487.8s
essentially like very small PD gains for
1491.9s
uh for for all the motors. Um that essentially uh like decreases the sim to real gap because it's like a more gentle uh response to our command send. uh and I think for of our experiments even including the wall flip and more agile behavior these gains work perfectly fine for all the motions.
1494.6s
essentially uh like decreases the sim to
1497.6s
real gap because it's like a more gentle
1501.6s
uh response to our command send. uh and
1504.8s
I think for of our experiments even
1507.7s
including the wall flip and more agile
1510.2s
behavior these gains work perfectly fine
1513.0s
for all the motions.
1514.6s
So do you think the RL model somehow can help compensate some of these differences is do you think that's what's happening because or what's what's your thought on what's happening? Um I think both the hardware is getting better that there if we have like a small gain that they can uh uh they can
1517.9s
help compensate some of these
1520.8s
differences is do you think that's
1523.0s
what's happening because or what's
1525.3s
what's your thought on what's happening?
1528.4s
Um I think both the hardware is getting
1531.2s
better that there if we have like a
1533.6s
small gain that they can uh uh they can
1537.4s
do reasonably well of sending the right command as well as we during the our training we're like randomly pushing the robots uh for domain randomization. So uh so even if uh the gains are like uh not the are not reflecting the actual not the are not reflecting the actual torque on hardware when we're do randomly
1539.6s
command as well as we during the our
1542.5s
training we're like randomly pushing the
1544.6s
robots uh for domain randomization. So
1548.1s
uh so even if uh the gains are like uh
1552.9s
not the are not reflecting the actual
1556.5s
not the are not reflecting the actual torque
1557.5s
on hardware when we're do randomly
1560.2s
pushing the robots the act the robot actually in simulation the robot actually experiences that a little bit that variation in sim as well. So it has been trained to see like sort of different motor uh perturbations uh during simulation. So in your setup when you're doing the simulations um what what specific tool
1562.3s
actually in simulation the robot
1564.2s
actually experiences that a little bit
1566.6s
that variation in sim as well. So it has
1569.8s
been trained to see like sort of
1572.5s
different motor uh perturbations uh
1576.1s
during simulation.
1577.5s
So in your setup when you're doing the
1580.2s
simulations um what what specific tool
1583.2s
sets or framework were you using to do your simulation? Uh yeah so we are using Isaac lab as the training framework uh which is running on issim for the uh lower level simulation engine. Is there a reason why you guys went with Isaac sim Isaac lab or was it like a choice that the team has made already?
1586.0s
your simulation?
1588.2s
Uh yeah so we are using Isaac lab as the
1590.9s
training framework uh which is running
1593.0s
on issim for the uh lower level
1596.3s
simulation engine. Is there a reason why
1598.5s
you guys went with Isaac sim Isaac lab
1600.8s
or was it like a choice that the team
1602.8s
has made already?
1604.6s
Uh yeah so I think it's both because the Isac lab is highly paralyzable and it has support for distributed training and so on and it has very good rendering. So later if we want to move on to vision based RL or uh locomotion uh whole body control it will have uh relatively good
1607.8s
Isac lab is highly paralyzable and it
1611.5s
has support for distributed training and
1614.1s
so on and it has very good rendering. So
1618.4s
later if we want to move on to vision
1621.4s
based RL or uh locomotion uh whole body
1625.6s
control it will have uh relatively good
1629.0s
support for rendering for vision as support for rendering for vision as well. How about how about with uh Mujoko? Is that something you have played with or um any thoughts on that as a simulator? Uh yeah, so Majoko is uh higher infidelity for the simulation models. Uh but I think it's relatively uh slower
1632.4s
support for rendering for vision as well.
1633.0s
How about how about with uh Mujoko? Is
1635.3s
that something you have played with or
1637.8s
um any thoughts on that as a simulator?
1641.6s
Uh yeah, so Majoko is uh higher
1645.7s
infidelity for the simulation models. Uh
1649.9s
but I think it's relatively uh slower
1653.8s
and it doesn't has uh have as good of the rendering uh support as Isaac but uh I want to note that we're using Majoko as uh simtosim validation. Oh uh which is saying that Yeah. Oh, so which is uh saying that we are training the RL policies in Isac and before deploying it directly on the real
1657.2s
the rendering uh support as Isaac
1660.4s
but uh I want to note that we're using
1663.0s
Majoko as uh simtosim validation. Oh
1666.6s
uh which is saying that
1668.4s
Yeah. Oh, so which is uh saying that we
1670.7s
are training the RL policies in Isac and
1675.0s
before deploying it directly on the real
1677.7s
robot, we actually run that policy in Madokco to test if it the dynamics per like the policy with the higher fidelity dynamics perform well enough in Madoko and if that's robust enough in Madr then deploy it onto the real hardware. So uh can you go in detail about what you mean by higher fidelity? Is it like
1680.5s
Madokco to test if it the dynamics per
1683.6s
like the policy with the higher fidelity
1686.2s
dynamics perform well enough in Madoko
1689.1s
and if that's robust enough in Madr then
1692.0s
deploy it onto the real hardware.
1694.7s
So uh can you go in detail about what
1697.1s
you mean by higher fidelity? Is it like
1699.0s
better physics calculation or what what do you mean by that? Yeah, so major supposedly have it has higher like has a more accurate contact modeling. So um there are different uh kinds of simulators which are have different trade-off like um mostly the trade-off between uh simulation accuracy versus the computation speed. So I would
1701.0s
do you mean by that?
1702.9s
Yeah, so major supposedly have it has
1706.3s
higher like has a more accurate contact
1709.7s
modeling. So um there are different uh
1712.4s
kinds of simulators which are have
1714.6s
different trade-off like um mostly the
1717.4s
trade-off between uh simulation accuracy
1720.6s
versus the computation speed. So I would
1724.1s
say Isaac is on the higher uh throughput uh end of the spectrum while Majoku is sort of in the middle where it h has high enough fidelity uh and a reasonably a reasonable parallelization uh a reasonable parallelization uh framework. Yes. So uh on the other end end of the spectrum I would say Drake is probably
1728.7s
uh end of the spectrum while Majoku is
1733.2s
sort of in the middle where it h has
1736.2s
high enough fidelity uh and a reasonably
1740.3s
a reasonable parallelization uh
1742.6s
a reasonable parallelization uh framework.
1743.8s
Yes. So uh on the other end end of the
1746.2s
spectrum I would say Drake is probably
1748.6s
one of the most accurate uh simulator where uh it is actually solving optimization problems at each time step to simulate the contact dynamics but uh it's not GPU uh supported and uh it's hard to parallelize. So there is a a spectrum of different simulators which have different trade-offs um between computation time and simulation
1752.5s
where uh it is actually solving
1754.9s
optimization problems at each time step
1757.2s
to simulate the contact dynamics but uh
1760.8s
it's not GPU uh supported and uh it's
1765.4s
hard to parallelize. So there is a a
1769.0s
spectrum of different simulators which
1771.1s
have different trade-offs um between
1773.8s
computation time and simulation
1775.9s
fidelity. So when you go from Isaac to Mujoko for example, um have you had specific experiences where when you did do that simtosim validation where you were able to go back and update your model somehow based on how it performed? Um yeah so I would say uh because we are
1779.5s
Mujoko for example,
1781.9s
um have you had specific experiences
1786.0s
where when you did do that simtosim
1788.7s
validation where you were able to go
1791.6s
back and update your model somehow based
1793.8s
on how it performed?
1797.2s
Um yeah so I would say uh because we are
1801.8s
doing so ID uh like well enough and we randomize uh reasonably in uh training in Isaac SIM the dynamics gap between Isaac and Majorco in our current pipeline is not that huge. pipeline is not that huge. Okay. But the SIM to SIM pipeline also helps uh us debug debug some other problems.
1806.0s
randomize uh reasonably in uh training
1810.2s
in Isaac SIM the dynamics gap between
1813.5s
Isaac and Majorco in our current
1816.2s
pipeline is not that huge.
1817.8s
pipeline is not that huge. Okay.
1818.2s
But the SIM to SIM pipeline also helps
1821.5s
uh us debug debug some other problems.
1824.0s
So for example, what should be the camera latency if we're adding the vision um into the loop? Since we're doing CIS ID carefully enough and for uh locom motion there is not too much a gap of dynamics for uh between Isaac and Jooko. So most of the time the dynamics
1827.0s
camera latency if we're adding the
1829.1s
vision um into the loop? Since we're
1832.2s
doing CIS ID carefully enough and for uh
1836.4s
locom motion there is not too much a gap
1838.9s
of dynamics for uh between Isaac and
1842.6s
Jooko. So most of the time the dynamics
1845.4s
gap is not as big when we do the simtosim validation but rather some edge cases that we were not able to detect in Isaac. So for example like what if the vision latency uh in major in Majoko is uh can be more realistic than I is so that uh we can detect oh what is the
1848.2s
simtosim validation but rather some edge
1851.7s
cases that we were not able to detect in
1854.3s
Isaac. So for example like what if the
1858.0s
vision latency uh in major in Majoko is
1861.8s
uh can be more realistic than I is so
1864.8s
that uh we can detect oh what is the
1867.8s
right version latency we should add in the training process to actually compensate for this discrepancy. So you're saying like Muchoko has uh longer latency. Is that what you mean by more accurate? Um so uh in training when we render the vision in Isac the simulation will actually pause to uh let the simulation
1870.4s
the training process to actually
1872.1s
compensate for this discrepancy.
1873.8s
So you're saying like Muchoko has uh
1876.9s
longer latency. Is that what you mean by
1879.0s
more accurate?
1881.0s
Um so uh in training when we render the
1885.0s
vision in Isac the simulation will
1888.6s
actually pause to uh let the simulation
1892.7s
uh simulator render the the vision but during the deployment in modroo when we're actually running the policy it will actually not wait for the simulator to render the image so there will be some sort of latency in that regard. Okay. So is it possible to create a fake latency in Isaac to simulate that
1895.7s
during the deployment in modroo when
1898.6s
we're actually running the policy it
1901.0s
will actually not wait for the simulator
1903.2s
to render the image so there will be
1905.5s
some sort of latency in that regard.
1907.9s
Okay. So is it possible to create a fake
1912.1s
latency in Isaac to simulate that
1914.3s
behavior or is it pretty hard to Yeah. So what we do is that uh we create a buffer of the sensor uh readings and it we we do kind of a CIS ID Unreal to see what the latency is for like each sensor and then we chose the the reading in the buffer which is around that time
1917.5s
Yeah. So what we do is that uh we create
1921.2s
a buffer of the sensor uh readings and
1926.6s
it we we do kind of a CIS ID Unreal to
1930.3s
see what the latency is for like each
1933.4s
sensor and then we chose the the reading
1938.2s
in the buffer which is around that time
1940.8s
range. H. So if you're able to do that then technically would you still have to do this muchoko verification if you're able to create a very realistic latency in Isaac or do you think that step would still be necessary? Uh yeah so I think uh another advantage of lat uh of major in addition to like
1943.8s
then technically would you still have to
1946.7s
do this muchoko verification if you're
1949.1s
able to create a very realistic latency
1952.7s
in Isaac or do you think that step would
1955.4s
still be necessary?
1957.7s
Uh yeah so I think uh another advantage
1961.4s
of lat uh of major in addition to like
1964.7s
uh verify that latency is uh running the deployment code in simulation first. So I it's actually might be kind of uh complicated to uh run the deployment code in isac directly but uh in major code it's a much more direct interface for us to uh deploy our uh inference code. So in addition to testing the
1968.2s
deployment code in simulation first. So
1971.0s
I it's actually might be kind of uh
1975.0s
complicated to uh run the deployment
1977.2s
code in isac directly but uh in major
1980.2s
code it's a much more direct interface
1983.5s
for us to uh deploy our uh inference
1986.6s
code. So in addition to testing the
1989.4s
vision discrepancy dynamics gap there will there's also the layer of we are testing our deployment code in simulation. What what's the main challenge with um running inference in challenge with um running inference in Isaac? I think there is just a lot of abstraction layer in Isaac. Um, and it's less direct of an interface than Majoko
1992.0s
will there's also the layer of we are
1995.3s
testing our deployment code in
1997.2s
simulation. What what's the main
1998.9s
challenge with um running inference in
2001.4s
challenge with um running inference in Isaac?
2002.9s
I think there is just a lot of
2005.0s
abstraction layer in Isaac. Um, and it's
2009.0s
less direct of an interface than Majoko
2012.2s
for us to uh like uh deploy our uh inference code because in Mujoko they let you like send direct position or torque commands just based on you know how you set up your XML file, right? So it's like a pretty straightforward way to command it. Right. That's kind of what you mean.
2018.0s
inference code
2019.4s
because in Mujoko they let you like send
2021.9s
direct position or torque commands just
2025.4s
based on you know how you set up your
2027.6s
XML file, right? So it's like a pretty
2030.5s
straightforward way to command it.
2033.9s
Right. That's kind of what you mean.
2036.4s
Right. That's kind of what you mean. Mhm. Mhm. Okay. Okay. Cool. Um, so in general, if someone is like trying to learn more about RL, whether it's like the application side or theory side, what would you say is a good starting point for someone to kind of get into the
2037.2s
Mhm. Okay.
2038.8s
Okay. Cool. Um, so in general, if
2041.1s
someone is like trying to learn more
2043.8s
about RL, whether it's like the
2046.5s
application side or theory side, what
2049.0s
would you say is a good starting point
2051.1s
for someone to kind of get into the
2053.5s
for someone to kind of get into the topic? So for the theory side, like Richard Sutton's introduction to reinforcement learning is uh one of the primer book. uh and for for from for example from people from a controls perspective a dimitary bersa's uh dynamic programming and optimal control would be a very like
2056.2s
So for the theory side, like Richard
2059.0s
Sutton's introduction to reinforcement
2061.4s
learning is uh one of the primer book.
2064.4s
uh and for for from for example from
2067.0s
people from a controls perspective a
2069.4s
dimitary bersa's uh dynamic programming
2072.2s
and optimal control would be a very like
2075.8s
uh control perspective uh way to explain the RL concept and there are also some like Berkeley courses taught by professor Sergey Levvin uh on uh reinforcement uh learning and deep learning that one can also like watch it online. So I think for the theory side there are a lot of resources either it
2079.0s
the RL concept and there are also some
2081.8s
like Berkeley courses taught by
2083.6s
professor Sergey Levvin uh on uh
2086.2s
reinforcement uh learning and deep
2088.2s
learning that one can also like watch it
2090.7s
online. So I think for the theory side
2093.4s
there are a lot of resources either it
2095.2s
be online video or books one can uh refer to and for the practical side I think it's reading others codebase for RL deployment and try to adapt them uh for one's own use case so that one can get more and more hands-on experience on the training and uh deployment. Very nice. Those are some really good
2098.7s
refer to and for the practical side I
2101.7s
think it's reading others codebase for
2105.5s
RL deployment and try to adapt them uh
2109.0s
for one's own use case so that one can
2111.6s
get more and more hands-on experience on
2114.5s
the training and uh deployment.
2118.5s
Very nice. Those are some really good
2120.0s
useful resources. So um definitely look into those. Um I know we'll probably spend the second half of this or not second half but this second part talking about some of your specific applications that you worked on. So you know a lot of the new cutting edge work is probably you know the research that you've done
2122.2s
into those. Um I know we'll probably
2125.3s
spend the second half of this or not
2128.1s
second half but this second part talking
2130.1s
about some of your specific applications
2132.2s
that you worked on. So you know a lot of
2134.1s
the new cutting edge work is probably
2136.6s
you know the research that you've done
2139.1s
on like retargeting. So maybe we could start start looking into that topic right now. And maybe just for those that have never heard of kinematic retargeting, can you give like a highle overview of exactly what that is and what problem you're trying to solve? So the kinematic retarding is basically transforming human motions onto robot
2142.2s
start start looking into that topic
2144.8s
right now. And maybe just for those that
2146.9s
have never heard of kinematic
2148.7s
retargeting, can you give like a highle
2151.4s
overview of exactly what that is and
2154.4s
what problem you're trying to solve?
2156.6s
So the kinematic retarding is basically
2159.5s
transforming human motions onto robot
2162.0s
motions. Since the humanoid robot looks very much like the human, we want to reuse the human motions to direct our search for how we command the robot. So say here's a task of human picking up the box and we want to transfer the same motions onto the robots picking up the
2165.4s
very much like the human, we want to
2167.8s
reuse the human motions to direct our
2170.5s
search for how we command the robot. So
2174.9s
say here's a task of human picking up
2177.7s
the box and we want to transfer the same
2180.2s
motions onto the robots picking up the
2182.2s
box. So there are some standard ways of doing so such as defining some key points on the rob on the robot and the human um and try to match the absolute position between the two. But there are some problems of doing so. Uh for example, because the humanoid can be much shorter than the human, this direct
2184.8s
doing so such as defining some key
2187.8s
points on the rob on the robot and the
2190.6s
human um and try to match the absolute
2193.8s
position between the two. But there are
2196.6s
some problems of doing so. Uh for
2199.0s
example, because the humanoid can be
2202.0s
much shorter than the human, this direct
2205.5s
uh scaling and translation matching will will result in some penetration. And here we're using some technique to avoid this issue. Penetration you mean like uh in going into itself? Is that what you mean? Uh yes. So let me try to show a direct example. something like this. So the keyoint matching is the technique I was
2208.4s
will result in some penetration.
2211.0s
And here we're using some technique to
2213.6s
avoid this issue.
2215.0s
Penetration you mean like
2217.1s
uh in going into itself? Is that what
2219.1s
you mean?
2220.7s
Uh yes. So let me try to show a direct
2224.2s
example. something like this. So the
2226.6s
keyoint matching is the technique I was
2229.4s
describing as the standardized technique for the uh humanoid human to humanoid kinematic targeting pipeline which is essentially choosing some key points on the human and try to match the same set of semantic key points on the robot to the absolute position of these key points on the human. Um and this because
2231.8s
for the uh humanoid human to humanoid
2235.1s
kinematic targeting pipeline which is
2237.8s
essentially choosing some key points on
2239.8s
the human and try to match the same set
2243.8s
of semantic key points on the robot to
2246.3s
the absolute position of these key
2248.5s
points on the human. Um and this because
2253.0s
uh the humanoid can be much shorter than the human directly matching this absolute position can result some penetration with the object. So for example something like this and this is essentially um a very um like a very direct result of the different scale of human and the robot because uh say imagine the human is like
2255.6s
the human directly matching this
2257.8s
absolute position can result some
2260.3s
penetration with the object. So for
2262.1s
example something like this
2264.0s
and this is essentially um a very um
2268.8s
like a very direct result of the
2271.5s
different scale of human and the robot
2274.5s
because uh say imagine the human is like
2278.5s
say 1.8 8 m while the humanoid we're using is 1.3 m. Picking up the same box will actually result in same in different relative skills for the human and the robots. So directly doing this key point matching will result in some artifacts like will result in some artifacts like penetration.
2281.9s
using is 1.3 m.
2285.1s
Picking up the same box will actually
2287.7s
result in same in different relative
2290.4s
skills for the human and the robots. So
2293.0s
directly doing this key point matching
2295.0s
will result in some artifacts like
2297.1s
will result in some artifacts like penetration.
2299.3s
So uh you talk about going from um like key points from a human. Is the main idea to get videos of people or what's the main are you trying to utilize like the whole internet data to do some of this? Like what's the bigger picture idea that um this method would end up being used for?
2302.9s
key points from a human. Is the main
2304.6s
idea to get videos of people or what's
2307.9s
the main are you trying to utilize like
2310.2s
the whole internet data to do some of
2312.0s
this? Like what's the bigger picture
2313.8s
idea that um this method would end up
2316.4s
being used for?
2319.2s
Yeah, that's a great question. So currently we're using motion capture data which is essentially a human demonstrator wearing a very specialized mocap suits in a specialized um room with cameras that can accurately identify the position of the human. identify the position of the human. clears throat Um but this sort of data is very
2321.0s
currently we're using motion capture
2323.0s
data which is essentially a human
2326.1s
demonstrator wearing a very specialized
2329.4s
mocap suits in a specialized um room
2332.6s
with cameras that can accurately
2335.0s
identify the position of the human.
2337.7s
identify the position of the human. clears throat
2338.3s
Um but this sort of data is very
2341.4s
expensive and uh ultimately we want to utilize the videos of the entire internet to cheat teach how teach the robots to do the things but there are some challenges uh in this regard. So uh the uh re 3D reconstruction of human and objects from video is a very non-triv trivial research topic. Some of the
2344.6s
utilize the videos of the entire
2347.5s
internet to cheat teach how teach the
2350.3s
robots to do the things but there are
2352.5s
some challenges uh in this regard. So uh
2355.8s
the uh re 3D reconstruction of human and
2359.4s
objects from video is a very non-triv
2362.0s
trivial research topic. Some of the
2364.6s
problems involve like some the human root will kind of be floating in the air and uh going back and forth. So how to extract uh robust, reliable and realistic data from the video is a very um challenging and but also interesting research topic. I see. So I guess you guys are kind of
2367.9s
root will kind of be floating in the air
2370.2s
and uh going back and forth. So how to
2374.0s
extract uh robust, reliable and
2377.2s
realistic data from the video is a very
2381.3s
um challenging and but also interesting
2384.2s
research topic.
2385.0s
I see. So I guess you guys are kind of
2387.5s
assuming that the video to model part is handled, right? And you're just focusing more on if you already have the key points. Is that right? Uh exactly. Yeah. Okay. So um you were mentioning like oh if the robot is smaller then um you're trying to focus on having like a bigger
2393.4s
handled, right? And you're just focusing
2395.4s
more on if you already have the key
2397.5s
points. Is that right?
2399.0s
Uh exactly. Yeah.
2400.4s
Okay. So um you were mentioning like oh
2403.2s
if the robot is smaller then um you're
2407.1s
trying to focus on having like a bigger
2409.4s
human where the key points are bigger to something smaller. Do you think your current method can also work in reverse? If the human is smaller but the robot is bigger, can it also handle those cases? Like is it general enough to do that? Like is it general enough to do that? Absolutely.
2411.6s
something smaller. Do you think your
2413.3s
current method can also work in reverse?
2416.2s
If the human is smaller but the robot is
2418.6s
bigger, can it also handle those cases?
2421.8s
Like is it general enough to do that?
2423.4s
Like is it general enough to do that? Absolutely.
2425.1s
Yeah, absolutely. Um so the way our method works is um let me try to share my screen again. We tried to build something called interaction mesh which is uh in addition to defining key points on the human we also define key points on the object. So let me give a more concrete example
2428.8s
method works is um let me try to share
2432.5s
my screen again. We tried to build
2435.1s
something called interaction mesh which
2437.8s
is uh in addition to defining key points
2441.6s
on the human we also define key points
2444.6s
on the object.
2446.1s
So let me give a more concrete example
2448.9s
here is a human picking up the box and we want to transfer its motion to the robot picking up the box. uh as I mentioned we select some key points semantically important key points on the human and the set of same key points on the robot. We also define key points on
2452.3s
we want to transfer its motion to the
2454.6s
robot picking up the box. uh as I
2457.4s
mentioned we select some key points
2459.8s
semantically important key points on the
2462.0s
human and the set of same key points on
2464.6s
the robot. We also define key points on
2467.7s
the object and use the same set of key points on the uh uh on the object for the robot as well. Then we build uh an interaction mesh which is a volutric structure that uh captures the relative uh information uh position information between the human and the object. So uh here is how the mesh looks like.
2470.2s
points on the uh uh on the object for
2473.4s
the robot as well. Then we build uh an
2477.4s
interaction mesh which is a volutric
2481.0s
structure that uh captures the relative
2485.0s
uh information uh position information
2488.2s
between the human and the object.
2490.9s
So uh here is how the mesh looks like.
2494.4s
So as we can see it not only captures the information between the human joints themselves but also how it relates to the object. So uh in this example, the human's right hand uh is touching the right face of the object and we want the robot's right hand to also touch the right surface of the object
2497.3s
the information between the human joints
2500.6s
themselves but also how it relates to
2503.4s
the object.
2504.9s
So uh in this example, the human's right
2508.6s
hand uh is touching the right face of
2510.8s
the object and we want the robot's right
2513.5s
hand to also touch the right surface of
2515.8s
the object
2516.9s
that is actually captured by the um graph structure that preserves the relative spatial information in this relative spatial information in this motion. Is there a minimum number of points you need for the object to have the full understanding of your object? Uh that's a really good question. So uh ideally we want the contact points
2521.0s
graph structure that preserves the
2523.4s
relative spatial information in this
2525.7s
relative spatial information in this motion.
2526.5s
Is there a minimum number of points you
2528.4s
need for the object to have the full
2532.2s
understanding of your object?
2535.0s
Uh that's a really good question. So uh
2539.0s
ideally we want the contact points
2543.5s
between the human and the object. So say in this point there might be two key points which is like uh the left uh hand and the right hand uh touching the surface of the box. Um but in order to make the algorithm more robust uh we actually uh select more key points than
2546.9s
in this point there might be two key
2550.2s
points which is like uh the left uh hand
2553.5s
and the right hand uh touching the
2556.1s
surface of the box. Um but in order to
2558.8s
make the algorithm more robust uh we
2561.2s
actually uh select more key points than
2563.4s
that. So uh we essentially randomly sample like say 20 to 50 points on the object to keep the relationship between the robot human uh and the object. Okay. So how how well because right now we see an example with a box. How well do you think your current method could extend to, you know, more deformable or
2567.2s
sample like say 20 to 50 points on the
2571.4s
object to keep the relationship between
2574.3s
the robot human uh and the object.
2578.5s
Okay. So how how well because right now
2582.0s
we see an example with a box. How well
2584.1s
do you think your current method could
2586.2s
extend to, you know, more deformable or
2590.1s
organic looking objects like, you know, maybe a pillow, uh maybe like a teddy bear, like something or like a blanket even like how how would this method extend to something like that? Uh I think uh this retargeting method will be directly applicable to uh all kinds of different objects including deformables. Um, and because essentially
2592.4s
maybe a pillow, uh maybe like a teddy
2595.0s
bear, like something or like a blanket
2597.0s
even like how how would this method
2598.8s
extend to something like that?
2601.8s
Uh I think uh this retargeting method
2604.6s
will be directly applicable to uh all
2607.7s
kinds of different objects including
2609.8s
deformables. Um, and because essentially
2613.3s
we're capturing the relationship between the human and some points on the object. As long as we can define the key points, either it be contact points or it be semantically meaningful key points, uh, our retargeting pipeline can be directly our retargeting pipeline can be directly transferred. Okay, very cool.
2616.7s
the human and some points on the object.
2619.8s
As long as we can define the key points,
2622.5s
either it be contact points or it be
2625.1s
semantically meaningful key points, uh,
2627.4s
our retargeting pipeline can be directly
2630.3s
our retargeting pipeline can be directly transferred.
2631.4s
Okay, very cool.
2633.1s
Um so I know a lot of like the general trend that we're seeing in robotics is um you know there's still a lot of companies where they have very special models that do very specific tasks but then there's also companies like Tesla and some other companies that's trying to do of a more of a endtoend model
2636.1s
trend that we're seeing in robotics is
2638.9s
um you know there's still a lot of
2640.9s
companies where they have very special
2643.6s
models that do very specific tasks but
2646.2s
then there's also companies like Tesla
2648.7s
and some other companies that's trying
2650.5s
to do of a more of a endtoend model
2653.6s
where you know they only get input is the video that they see of the world and the output is the motor actions like let's say a very high level task. Go clean the room or you know go get me coffee, right? Something that high of a level of a task. Do you feel like in the
2656.5s
the video that they see of the world and
2658.7s
the output is the motor actions like
2661.8s
let's say a very high level task. Go
2663.7s
clean the room or you know go get me
2666.0s
coffee, right? Something that high of a
2668.6s
level of a task. Do you feel like in the
2672.7s
future is that the direction that robotics is headed where there's such a general model that can do anything or do you feel like we still need very specialized models that does very specialized tasks? So uh this is a great question and a very hot topic that uh both industry and academia has been
2676.0s
robotics is headed where there's such a
2678.6s
general model that can do anything or do
2681.1s
you feel like we still need very
2683.4s
specialized models that does very
2685.8s
specialized tasks? So uh this is a great
2688.8s
question and a very hot topic that uh
2691.4s
both industry and academia has been
2693.7s
debating a lot. So I personally think would lean more towards a generalist policy. Uh the reason is that um multiple skills that are trained for the generalist policy can might be able to um transfer and help each other generalize. Um and as sometimes uh the generalist policies uh the advantage people believe is that it can learn
2696.6s
would lean more towards a generalist
2699.4s
policy. Uh the reason is that um
2702.2s
multiple skills that are trained for the
2704.5s
generalist policy can might be able to
2708.0s
um transfer and help each other
2709.9s
generalize. Um and as sometimes uh the
2713.7s
generalist policies uh the advantage
2716.4s
people believe is that it can learn
2718.5s
something like a common sense or the intuition intu uh or intuitive physics which is uh roughly like a model of how the world will react that can actually transfer across different task. So if once the model uh gains this common sense, it will be able to more easily transfer to a new task it has never seen
2721.3s
intuition intu uh or intuitive physics
2725.1s
which is uh roughly like a model of how
2728.8s
the world will react that can actually
2731.4s
transfer across different task. So if
2734.1s
once the model uh gains this common
2736.4s
sense, it will be able to more easily
2739.4s
transfer to a new task it has never seen
2742.2s
it before. So, do you think these models eventually can get down to like millimeter precisions like for example if um like very hard tasks for example like surgery maybe or even like if they're trying to assemble PCB boards onto or put you know IC parts onto a PCB board. Do you think these models can
2744.9s
eventually can get down to like
2748.2s
millimeter precisions like for example
2750.9s
if um like very hard tasks for example
2756.0s
like surgery maybe or even like if
2759.4s
they're trying to assemble PCB boards
2762.9s
onto or put you know IC parts onto a PCB
2767.1s
board. Do you think these models can
2769.6s
eventually get to that level of precision or do you think they probably still need, you know, very like more of the typical robot programming that we think of where you program exact positions. What what's your thought on positions. What what's your thought on that? Uh yeah, that's a great question. I think it might be hard for these
2771.6s
precision or do you think they probably
2774.9s
still need, you know, very like more of
2778.2s
the typical robot programming that we
2780.7s
think of where you program exact
2782.4s
positions. What what's your thought on
2784.2s
positions. What what's your thought on that?
2785.8s
Uh yeah, that's a great question. I
2788.0s
think it might be hard for these
2790.7s
generalist policy to directly perform millimeter accuracy task uh directly out of the box. But if we just collect a small number of uh demonstrations on these specific tasks and do post training that is to refine the pol generalist policy on our specific task with the in-domain data and specific data I think we will be able to achieve
2794.0s
millimeter accuracy task uh directly out
2797.8s
of the box. But if we just collect a
2800.7s
small number of uh demonstrations on
2803.9s
these specific tasks and do post
2807.0s
training that is to refine the pol
2809.4s
generalist policy on our specific task
2812.1s
with the in-domain data and specific
2814.5s
data I think we will be able to achieve
2817.2s
very high accuracy. Uh the old traditional like classical methods like scripting the uh scripting uh the robot arms can achieve very high fidelity but they are more brittle to uh longtail uh problems and they might do less well in terms of like vision and uh the more semantic reasoning where these generalist policy might be able to learn
2820.1s
traditional like classical methods like
2822.2s
scripting the uh scripting uh the robot
2825.8s
arms can achieve very high fidelity but
2829.0s
they are more brittle to uh longtail uh
2833.3s
problems and they might do less well in
2836.4s
terms of like vision and uh the more
2839.8s
semantic reasoning where these
2841.6s
generalist policy might be able to learn
2844.0s
from failures or recovery from the other skills and try uh try to directly recover from something uh that uh it hasn't been seen in the classical scripting kind of method. So I know you mentioned like having more data for those specific cases. Um in general, do you feel like what's stopping us from having robots in our
2846.5s
skills and try uh try to directly
2849.5s
recover from something uh that uh it
2852.8s
hasn't been seen in the classical
2855.4s
scripting kind of method.
2857.4s
So I know you mentioned like having more
2860.2s
data for those specific cases. Um in
2863.0s
general, do you feel like what's
2865.7s
stopping us from having robots in our
2868.8s
house that's working? Do you feel like that's more of a data problem or is it more of a model and architecture or even robot hand development problem? What's your current take on that? Yeah, I think um the current obstacles are multiffold. uh I would say the humanoid hardware is uh relatively more
2870.6s
that's more of a data problem or is it
2874.3s
more of a model and architecture or even
2879.4s
robot hand development problem? What's
2881.1s
your current take on that?
2884.0s
Yeah, I think um the current obstacles
2888.9s
are multiffold. uh I would say the
2892.6s
humanoid hardware is uh relatively more
2896.9s
robust than say the hand hardware. So the sim to real gap is smaller and the motor control is more precise. Uh but of course uh in addition to the hardware problem there is also the software problem and uh for a lot of researchers the software the core of the software problem is the data problem. Um so I
2900.3s
the sim to real gap is smaller and the
2902.6s
motor control is more precise. Uh but of
2905.4s
course uh in addition to the hardware
2907.6s
problem there is also the software
2909.4s
problem and uh for a lot of researchers
2914.0s
the software the core of the software
2915.9s
problem is the data problem. Um so I
2919.1s
think some of our hypothesis is that uh if we have uh enough high quality data maybe the training architecture and policy architecture doesn't matter as policy architecture doesn't matter as much. So we are essentially trying to control the quality of the policy output uh by controlling the uh data quality
2922.2s
if we have uh enough high quality data
2926.2s
maybe the training architecture and
2929.0s
policy architecture doesn't matter as
2931.0s
policy architecture doesn't matter as much.
2931.8s
So we are essentially trying to control
2935.0s
the quality of the policy output uh by
2937.9s
controlling the uh data quality
2939.7s
directly. So if we can get uh very high quality um data for the robots uh I think it will be a very important uh improvement for reliable deployment of these robots. So right now a lot of people are you know either manually manually getting data or using data from like simulations. Um I know Nvidia recently
2943.8s
quality um data for the robots uh I
2948.2s
think it will be a very important uh
2951.4s
improvement for reliable deployment of
2954.5s
these robots.
2955.6s
So right now a lot of people are you
2959.1s
know either manually manually getting
2962.4s
data or using data from like
2965.8s
simulations. Um I know Nvidia recently
2969.0s
have has been pushing Cosmos which is their uh AI data. Basically they're generating video data synthetically and they could augment and do like data transfer for different scenes. For example, if it's cloudy, sunny, they could augment all of that. um do you think that is the right approach to getting more data or do you think
2971.3s
their uh AI data. Basically they're
2975.2s
generating video data synthetically and
2977.8s
they could augment and do like data
2981.0s
transfer for different scenes. For
2983.2s
example, if it's cloudy, sunny, they
2985.0s
could augment all of that. um do you
2987.8s
think that is the right approach to
2991.0s
getting more data or do you think
2993.1s
there's uh different ways someone should be focusing on to get more data? Yeah, that's a great question. So I think we should actually leverage data from all different kind of resources. uh either it be the most expensive but the uh arguably the highest quality data which is the teleoperation data on the
2995.5s
be focusing on to get more data?
2998.7s
Yeah, that's a great question. So I
3001.0s
think we should actually leverage data
3003.8s
from all different kind of resources. uh
3006.6s
either it be the most expensive but the
3010.4s
uh arguably the highest quality data
3013.7s
which is the teleoperation data on the
3016.0s
real robot or it be the simulation data which we can generate in large scale but always has the sim to real gap. Uh and the also there are also kinds of the other uh data which is for example internet video data uh world model data uh that has rich uh semantic and visual
3019.4s
which we can generate in large scale but
3022.5s
always has the sim to real gap. Uh and
3025.8s
the also there are also kinds of the
3028.0s
other uh data which is for example
3030.7s
internet video data uh world model data
3034.1s
uh that has rich uh semantic and visual
3037.1s
features for like the uh video uh models but might has less action data. So I think different data has their own advantages uh and their own uh specialized um targeting area. Um and combining these different sources of data together to enable both control and dynamics accuracy as well as semantic
3041.8s
but might has less action data. So I
3045.0s
think different data has their own
3047.1s
advantages uh and their own uh
3049.9s
specialized um targeting area. Um and
3054.5s
combining these different sources of
3056.9s
data together to enable both control and
3061.4s
dynamics accuracy as well as semantic
3065.4s
and uh visual understanding is a very uh I think a very interesting and promising topic. So I guess if you were to put a allocation like if I imagine like if there's a pie chart and you were to allocate like the percent of each category that you mentioned just roughly you know roughly speaking how would you
3068.9s
I think a very interesting and promising
3070.9s
topic. So I guess if you were to put a
3074.1s
allocation like if I imagine like if
3076.6s
there's a pie chart and you were to
3078.2s
allocate like the percent of each
3080.2s
category that you mentioned just roughly
3083.7s
you know roughly speaking how would you
3086.1s
categorize each of the parts of the pie for the different types of data. Yeah, since myself is working on data generation in sim and doing kinematic retargeting, uh my answer will obviously be skewer towards using simulation data. Uh it's uh very uh scalable and it can give us reasonably accurate uh dynamics
3089.4s
for the different types of data.
3092.9s
Yeah, since myself is working on data
3095.2s
generation in sim and doing kinematic
3097.6s
retargeting, uh my answer will obviously
3100.3s
be skewer towards using simulation data.
3103.4s
Uh it's uh very uh scalable and it can
3106.9s
give us reasonably accurate uh dynamics
3110.2s
and action uh data. uh while so I would say I would allocate like half of the effort in uh doing uh in generating simulation data and the other half is split between uh video and uh real world teleoperation data. So I think video data is also more scalable than the real world teleoperation because for the tele
3114.7s
say I would allocate like half of the
3117.0s
effort in uh doing uh in generating
3120.3s
simulation data and the other half is
3123.0s
split between uh video and uh real world
3127.2s
teleoperation data.
3129.0s
So I think video data is also more
3132.7s
scalable than the real world
3134.2s
teleoperation because for the tele
3136.1s
operation you always need a human operator to operate the robots. uh it can be time consuming uh it kind of cause fatigue for the human operator uh and wears the robot hardware right uh for the video you can we have the entire internet and we have these like you mentioned cosmos the generative uh
3138.1s
operator to operate the robots. uh it
3141.0s
can be time consuming uh it kind of
3144.6s
cause fatigue for the human operator uh
3147.1s
and wears the robot hardware right
3149.8s
uh for the video you can we have the
3152.9s
entire internet and we have these like
3155.0s
you mentioned cosmos the generative uh
3157.8s
video models word models that can generate like essentially endless video data for us to capture the uh visual dynamics uh semantic features and so on so I think that one is more scalable. So I think my personal take is to uh spend as much effort uh as possible on the scalable uh like methods including
3160.1s
generate like essentially endless video
3163.3s
data for us to capture the uh visual
3166.6s
dynamics uh semantic features and so on
3169.7s
so I think that one is more scalable. So
3172.4s
I think my personal take is to uh spend
3175.8s
as much effort uh as possible on the
3179.0s
scalable uh like methods including
3182.0s
simulation and video and also allocate a reasonable amount on the real data to actually close the sim to real gap. Uh there's been a lot of topic on you know traditional and newer ways of doing RL. Can you kind of dive into some of the details of that and maybe the differences between the two?
3185.6s
reasonable amount on the real data to
3188.5s
actually close the sim to real gap.
3190.2s
Uh there's been a lot of topic on you
3193.2s
know traditional and newer ways of doing
3196.1s
RL. Can you kind of dive into some of
3198.2s
the details of that and maybe the
3200.2s
differences between the two?
3203.0s
Uh yeah. So um I think I want to give the kinomide retargeting as an example of doing um things in uh both the classical optimization based and model based uh perspective and the more learning based perspective. So my very first background is actually in optimization and modelbased control and that actually lays a foundation for me
3206.5s
the kinomide retargeting as an example
3209.4s
of doing um things in uh both the
3213.9s
classical optimization based and model
3216.0s
based uh perspective and the more
3218.9s
learning based perspective. So my very
3223.3s
first background is actually in
3225.6s
optimization and modelbased control and
3229.0s
that actually lays a foundation for me
3231.5s
to write omni retarget which is an constrained uh optimization based kinematic retargeting pipeline and this sort of optimization based uh pipeline will enable something that is not quite achievable by a learning based pipeline. So because we're reasoning about hard constraints kinematics in a optimization way uh fashion we can enforce higher quality. So we can actually enforce hard
3234.9s
constrained uh optimization based
3237.4s
kinematic retargeting pipeline and this
3240.4s
sort of optimization based uh pipeline
3243.3s
will enable something that is not quite
3245.8s
achievable by a learning based pipeline.
3248.1s
So because we're reasoning about hard
3251.0s
constraints kinematics in a optimization
3254.6s
way uh fashion we can enforce higher
3258.4s
quality. So we can actually enforce hard
3261.0s
constraints that learning based policy won't be able to enforced. So like say we don't want penetration of of the object. We don't want the joint to exceed its hard limits. We don't want the velocity to ex exceed a certain threshold. For us, we can write it as hard constraints in the optimization
3263.0s
won't be able to enforced. So like say
3265.5s
we don't want penetration of of the
3267.8s
object. We don't want the joint to
3270.7s
exceed its hard limits. We don't want
3272.9s
the velocity to ex exceed a certain
3276.2s
threshold. For us, we can write it as
3278.8s
hard constraints in the optimization
3280.6s
program versus uh in the more learning based method, people normally put it as soft penalty in the cost or reward and then try to optimize it. Uh it's not sometimes it's not guaranteed that these hard constraints are actually enforced. So there might be a little bit of penetration or joy limit violation if
3284.3s
based method, people normally put it as
3288.2s
soft penalty in the cost or reward and
3291.7s
then try to optimize it. Uh it's not
3295.0s
sometimes it's not guaranteed that these
3297.1s
hard constraints are actually enforced.
3299.5s
So there might be a little bit of
3301.0s
penetration or joy limit violation if
3304.2s
we're doing this kind of soft penalty. I see. But uh by doing the hard uh constraint uh optimization based uh pipeline, we're able to enforce these hard constraints very systematically and rigorously so that we can have higher quality data to then be consumed by downstream learning paradigms. So do you think it's possible to take both traditional optimization
3306.9s
I see.
3307.5s
But uh by doing the hard uh constraint
3310.8s
uh optimization based uh pipeline, we're
3313.7s
able to enforce these hard constraints
3315.7s
very systematically and rigorously so
3318.1s
that we can have higher quality data to
3321.0s
then be consumed by downstream learning
3323.5s
paradigms. So do you think it's possible
3325.4s
to take both traditional optimization
3328.6s
and RLbased methods together or do you think it's more of a eitheror type of think it's more of a eitheror type of situation? So uh yeah so the combination of both is actually my goal. So upstream I'm using this optimization based hard hard constraint uh formulations to generate hard quality data and downstream where
3333.1s
think it's more of a eitheror type of
3335.3s
think it's more of a eitheror type of situation?
3337.5s
So uh yeah so the combination of both is
3340.5s
actually my goal. So upstream I'm using
3343.9s
this optimization based hard hard
3347.0s
constraint uh formulations to generate
3349.4s
hard quality data and downstream where
3352.6s
training our policies to track this high quality data. So the combination of very rigorous uh high quality data generation plus the massively parallelizable RL training I think is a very promising paradigm. H. So you're saying the the main hard constraint is more like the highlevel loop closure in a way that's making sure the robot doesn't
3355.0s
quality data. So the combination of very
3358.6s
rigorous uh high quality data generation
3361.9s
plus the massively parallelizable RL
3365.0s
training I think is a very promising
3367.1s
paradigm. H. So you're saying the the
3371.5s
main hard constraint is more
3375.0s
like the highlevel loop closure in a way
3378.0s
that's making sure the robot doesn't
3380.4s
break these constraints. Is that how you would describe it? Or how for like the general audience, how would you kind of describe how that's able to keep everything, you know, under the main constraints that you want it? Um yeah, I would say uh if we want to enforce like uh important hard constraints uh and specifically for the
3382.2s
would describe it? Or how for like the
3385.3s
general audience, how would you kind of
3387.5s
describe how that's able to keep
3391.0s
everything, you know, under the main
3393.1s
constraints that you want it?
3396.3s
Um yeah, I would say uh if we want to
3399.4s
enforce like uh important hard
3402.0s
constraints uh and specifically for the
3404.7s
army project in a kinematic level uh which is just the robots abain its uh morphological constraints we can do it in a systematic optimization based way and later when we translate into physically uh plausible dynamically plausible uh behavior we want to leverage the uh large scale uh parallelization in simulation in Isac to
3408.4s
uh which is just the robots abain its uh
3412.0s
morphological constraints we can do it
3414.4s
in a systematic optimization based way
3417.1s
and later when we translate into
3419.9s
physically uh plausible dynamically
3422.4s
plausible uh behavior we want to
3425.3s
leverage the uh large scale uh
3428.3s
parallelization in simulation in Isac to
3431.4s
do this translation. So it so like on a higher level if we can split the problems into two phases where like uh smaller number of computation but but requires higher quality uh we can do it with the modelbased strategy but uh for the uh lower fidelity requirement and uh and like massively parallelizable uh
3434.5s
higher level if we can split the
3437.1s
problems into two phases where like uh
3440.8s
smaller number of computation but but
3444.4s
requires higher quality uh we can do it
3447.4s
with the modelbased strategy but uh for
3450.2s
the uh lower fidelity requirement and uh
3454.6s
and like massively parallelizable uh
3457.4s
setup we can use the learning paradigm. But when you're trying to combine the two, um I guess if you were to try to describe the architecture of your program, like how would you lay out the pieces? So like for example, I'll just give you like a simple example of what I
3459.8s
But when you're trying to combine the
3461.8s
two, um I guess if you were to try to
3464.7s
describe the architecture of your
3467.7s
program, like how would you lay out the
3471.2s
pieces? So like for example, I'll just
3473.4s
give you like a simple example of what I
3475.2s
mean by that. Like when you have uh like a RL model for example of a robot walking, um the blocks I would describe might be like on the left I have like my RL. Okay, like on the very left I might have a block that's like my desired trajectory and the input of that goes to
3477.7s
a RL model for example of a robot
3479.8s
walking, um the blocks I would describe
3483.1s
might be like on the left I have like my
3486.6s
RL. Okay, like on the very left I might
3488.7s
have a block that's like my desired
3490.7s
trajectory and the input of that goes to
3493.6s
some RO inference model that's computing the desired torque and then to the right of that might be feeding it to the torqus of the actuator. So if you were to kind of describe in like a block diagram level structure of a hybrid approach where you have your um traditional optimization technique and your RL uh techniques,
3496.9s
the desired torque and then to the right
3499.3s
of that might be feeding it to the
3501.2s
torqus of the actuator. So if you were
3503.8s
to kind of describe in like a block
3506.1s
diagram level structure of a hybrid
3509.7s
approach where you have your um
3512.6s
traditional optimization technique and
3514.7s
your RL uh techniques,
3518.6s
how would you kind of describe that visual picture of how data is flowing? Yeah. So I would say um the model based I I'm I'm personally using the modelbased approaches to generate higher quality data and then uh it it will be used as essentially as initial guess for the RL policy. So uh we can train our
3520.2s
visual picture of how data is flowing?
3525.1s
Yeah. So I would say um the model based
3529.7s
I I'm I'm personally using the
3531.5s
modelbased approaches to generate higher
3533.7s
quality data and then uh it it will be
3537.5s
used as essentially as initial guess for
3539.8s
the RL policy. So uh we can train our
3543.0s
policy from scratch but that is very time consuming requires a lot of reward tuning uh and uh like the uh the the behavior resulting behavior might be uh less natural for example but with the help of modelbased methods we will be able to enable um more fluid u motion uh from the uh initial guess provided by
3545.8s
time consuming requires a lot of reward
3548.6s
tuning uh and uh like the uh the the
3552.5s
behavior resulting behavior might be uh
3555.3s
less natural for example but with the
3558.1s
help of modelbased methods we will be
3560.6s
able to enable um more fluid u motion uh
3566.9s
from the uh initial guess provided by
3569.6s
the modelbased methods. And they then the arrow policy like just bootstraps or initialize from that to learn uh a better um controller. better um controller. Oh, so you were talking about hard constraints using um like traditional optimization based techniques and also combining that with RL techniques. So can you kind of walk a little bit into
3572.0s
the arrow policy like just bootstraps or
3575.6s
initialize from that to learn uh a
3578.6s
better um controller.
3580.9s
better um controller. Oh,
3582.0s
so you were talking about hard
3583.2s
constraints using um like traditional
3586.9s
optimization based techniques and also
3589.5s
combining that with RL techniques. So
3592.4s
can you kind of walk a little bit into
3594.3s
more detail about how you take the two things and combine them together? Uh yeah sure. So um as I mentioned uh before we do the kinematic retargeting by building a graph that preserves the relative location between the human and the robot uh as well as the objects. So here is the optimization program I'm
3597.0s
things and combine them together?
3599.8s
Uh yeah sure. So um as I mentioned uh
3603.5s
before we do the kinematic retargeting
3606.5s
by building a graph that preserves the
3609.3s
relative location between the human and
3612.0s
the robot uh as well as the objects. So
3616.0s
here is the optimization program I'm
3618.6s
solving. I'm trying to there are uh components that are the objective as well as the constraints. So the objectives are encouraging this interaction to be preserved. Uh and the hard constraints including hard constraints including non-penetration where we don't want the robot hand to penetrate the object to go into the object. So we want to penetrate that. We
3621.4s
components that are the objective as
3624.3s
well as the constraints. So the
3626.3s
objectives are encouraging this
3628.8s
interaction to be preserved. Uh and the
3631.8s
hard constraints including
3633.3s
hard constraints including non-penetration
3634.8s
where we don't want the robot hand to
3637.8s
penetrate the object to go into the
3639.7s
object. So we want to penetrate that. We
3642.2s
want the joint to stay within the limits and as well as the velocity to stay within the speed limit. And we also want a hard constraint that the food doesn't skate while the robot is walking. Otherwise the robot will just be sliding all the time. So these constraints together with the interaction preserving
3645.0s
and as well as the velocity to stay
3647.1s
within the speed limit. And we also want
3649.8s
a hard constraint that the food doesn't
3652.9s
skate while the robot is walking.
3655.4s
Otherwise the robot will just be sliding
3657.7s
all the time. So these constraints
3660.2s
together with the interaction preserving
3662.8s
objective uh will give us some very high quality data that uh preserves the human motion of picking up a box onto the robot uh as well as so uh as satisfying all the hard constraints including the robot hand not penetrating the object and the food is not sliding while walking. So would you say you're using
3666.0s
quality data that uh preserves the human
3669.8s
motion of picking up a box onto the
3672.6s
robot uh as well as so uh as satisfying
3677.0s
all the hard constraints including the
3679.4s
robot hand not penetrating the object
3681.7s
and the food is not sliding while
3683.6s
walking. So would you say you're using
3686.1s
this um constraint here as the input to your RL or are you using this constraint to generate like a full series of like motion data like how how would you say it's the right way to understand? Uh yeah so this is actually kind of a hierarchical framework. So f first we use this pipeline to generate data with
3690.7s
your RL or are you using this constraint
3693.3s
to generate like a full series of like
3697.0s
motion data like how how would you say
3699.1s
it's the right way to understand?
3701.8s
Uh yeah so this is actually kind of a
3704.3s
hierarchical framework. So f first we
3707.1s
use this pipeline to generate data with
3709.8s
hard constraints so that it can be used as initialization for the RL. So during RL training, we actually initialize the uh the agents to be in some random position or uh in some configurations at random time step in this uh in this motion data set and then from that the
3712.3s
as initialization for the RL. So during
3715.8s
RL training, we actually initialize the
3719.4s
uh the agents to be in some random
3722.6s
position or uh in some configurations at
3726.4s
random time step in this uh in this
3728.7s
motion data set and then from that the
3731.8s
RO be will be able to start with from these configurations and bootstrap from that to come up with a dynamically feasible solution. So uh say let's compare it with a training from scratch paradigm where the uh the arrow policy initialized the agent to be in random initialized the agent to be in random modocation.
3734.6s
these configurations and bootstrap from
3737.0s
that to come up with a dynamically
3739.5s
feasible solution. So uh say let's
3742.8s
compare it with a training from scratch
3745.7s
paradigm where the uh the arrow policy
3748.6s
initialized the agent to be in random
3750.9s
initialized the agent to be in random modocation.
3752.1s
In that kind of scenario there can be there can be all kinds of penetrations uh joint limits violation velocity limit violation as well as food skating in that kind of uh initialization from scratch. So uh with this optimization based uh qual uh high quality data generated the RL will be initialized from a much better configuration than
3755.0s
there can be all kinds of penetrations
3757.7s
uh joint limits violation velocity limit
3760.5s
violation as well as food skating in
3763.0s
that kind of uh initialization from
3765.5s
scratch. So uh with this optimization
3769.0s
based uh qual uh high quality data
3772.7s
generated the RL will be initialized
3775.2s
from a much better configuration than
3778.6s
just initializing from scratch. So how how do you know that the initial position if the initial position is in something that's like physically possible? How do you know the rest of the RL execution will also be physically possible? What is kind of enforcing possible? What is kind of enforcing that? Yeah, so that's actually just
3780.8s
So how how do you know that the initial
3785.2s
position if the initial position is in
3787.8s
something that's like physically
3790.6s
possible? How do you know the rest of
3793.1s
the RL execution will also be physically
3796.8s
possible? What is kind of enforcing
3799.9s
possible? What is kind of enforcing that?
3801.6s
Yeah, so that's actually just
3803.6s
timestamping the simulator in Isac that is enforcing the dynamical constraints. So it's still like checking your optimization equation every time. Is that what you're saying? like your RL. that what you're saying? like your RL. Oh, so uh the uh data that comes from my optimization is simply used as the initialization for RL and then RL just
3806.2s
is enforcing the dynamical constraints.
3808.9s
So it's still like checking your
3810.6s
optimization equation every time. Is
3812.7s
that what you're saying? like your RL.
3815.5s
that what you're saying? like your RL. Oh,
3817.3s
so uh the uh data that comes from my
3820.9s
optimization is simply used as the
3823.0s
initialization for RL and then RL just
3826.6s
does whatever uh it's supposed to do in does whatever uh it's supposed to do in ISC. Okay. So as long as you're saying as long as the initialization when you say initialization I guess you mean like the initial position it's in for like each episode. Is that the correct way to
3829.0s
does whatever uh it's supposed to do in ISC.
3830.4s
Okay. So as long as you're saying as
3832.9s
long as the initialization when you say
3834.6s
initialization I guess you mean like the
3836.6s
initial position it's in for like each
3839.7s
episode. Is that the correct way to
3841.9s
describe it? Uh yes yes uh not only does the reference motion as uh a act as a good initialization for the RL policy but it also adds a guidance as a guidance for the RL policy. So uh it u tells the uh RL policy that where the robot should go at the next time step
3845.7s
the reference motion as uh a act as a
3849.4s
good initialization for the RL policy
3851.8s
but it also adds a guidance as a
3854.5s
guidance for the RL policy. So uh it u
3857.9s
tells the uh RL policy that where the
3861.6s
robot should go at the next time step
3863.9s
and the RL tries to achieve that with the current dynamical constraints in the current dynamical constraints in Isaxim. All right. So, that's it for this episode. Uh, thank you Lou for coming on to this podcast show and I'll leave some links in the video description for some of her works so you guys can go ahead
3866.9s
the current dynamical constraints in
3869.8s
the current dynamical constraints in Isaxim.
3871.0s
All right. So, that's it for this
3872.3s
episode. Uh, thank you Lou for coming on
3874.6s
to this podcast show and I'll leave some
3877.3s
links in the video description for some
3879.3s
of her works so you guys can go ahead
3881.3s
and check that out. Thank you for inviting Kevin. Thank you for inviting Kevin. music
3882.5s
Thank you for inviting Kevin.
3885.5s
Thank you for inviting Kevin. music