The EWRL in Toulouse was an awesome event. Here are some of my favorite posters!
This year’s EWRL in Toulouse was a great event: well-organized, interesting program, awesome people. I can only recommend attending the next edition! Here are the most interesting papers I discovered during the poster sessions - if yours isn’t here, it’s most likely due to me getting too caught up chatting or the wine during lunch. I ordered the papers according to the EWRL website, so you shouldn’t take that as an indicator of quality in any way.
The idea here is to use an exploration strategy that is trained to find really high error states of the actual policy. The fun thing: this policy apparently can discover quite complex Lunar Lander behaviors (e.g. barrel rolls!) this way. It struck me how much more complex the error behavior was compared to the actual solution, I’d be interested to know if it’s actually easier to find complex errors than to do the right thing.
The interesting bit here is that we should probably treat the entropy and value advantage estimation differently - if put like that, it seems obvious and the results here support why. Depending on the environment, we want more or less exploration decay and using this method we can control that separately. Basically, we expose a hyperparameter that so far was kept completely static and thus had to serve the same purpose for advantage and entropy estimation. Being able to adjust it is likely going to be an advantage, especially if we can figure out how to do that on the fly.
This sounds like such a crazy idea: let’s translate RL policies into Python code! It’s even crazier that the authors figured out how to do this. I’m impressed and really want to try this now.
Yes, the title refers to Go-Explore, so I asked the authors. This is a much simpler idea, though: just by randomly finding new “starting states” at the beginning of an episode, we can improve generalization. Obviously, there are environments where random actions aren’t helpful, but the point stands: artificially increasing diversity in starting states is important for learning. The fact that we can accomplish it so easily is pretty neat!
This paper ablates a few interesting design decisions in SAC and compares it to PPO. What I find most interesting (quite obviously if you know me) is of course the fact that some of these design decisions seem to be somewhat environment-agnostic, at least on MuJoCo, while others might be interesting to adapt, e.g. Momentum or Off-Policy. The difference is also only really visible on the Inverted Double Pendulum, kind of the odd one out environment - meaning that their ON-SAC configuration seems pretty stable within the MuJoCo locomotion domain, but might need adjustment elsewhere. Interesting signals when thinking about connecting algorithm configurations to environment domains.
Another crazy-sounding idea: let’s compress an offline RL dataset into an image for performance prediction! This is a really creative paper and a great idea overall, I recommend taking a look.
This paper contains quite a few interesting ablations centered around the question of how well an agent generalizes on Frozen Lake variations. It reinforces that even here in this simple environment we don’t really know how to do task scheduling, that different context features have different influences on generalization and that the state representation matters. The ablations are quite thorough and on a simple domain, so I think they’re a nice reference on different factors for zero-shot generalization and task ordering.
Why do we use policy improvement operators and not value improvement operators (explicit ones, at least)? This paper tries and there are a few interesting results - most importantly maybe that value improvement operators allow more (or less, if you prefer) greedy improvement than before. This interacts nicely with over- vs underestimation: the danger of being “too pessimistic” in the long run is reduced if we’re very greedy with respect to the pessimistic estimate, which is better than trying to force a more optimistic estimate in my opinion.
I was almost convinced this method must have existed beforehand when I came to the poster - this is a compliment! Using the model uncertainty to gauge whether it’s better to use the model to generate samples or collect new experiences feels so much more natural than scheduling the tradeoff without taking the model into account.
I left out papers I’d seen previously and that have been presented by the AutoRL.org circle - you should still take a look at those, though: ARLBench, Relational Structure in Representations, Joint Optimization of Reward Shaping and Hyperparameters and the wonderfully titled Dreaming of Many Worlds.
I really liked Elise van der Pol’s talk about symmetry. I was vaguely aware of the concepts she talked about before, but apparently it took until now to really connect with me. Super interesting topic and I hope to see more research here!
Thanks for reading, hopefully see you again next year!