Based on the helicopter simulator from Andrew Ng's group, agents must control a helicopter which is attempting to stably hover. Challenges include:
  • Dynamics: Wind effects and complicated nonlinear dynamics make this a challenging problem.
  • Explore / exploit: this domain includes the catastrophic event of crashing. Agents must explore carefully to avoid unrecoverable errors.

Competitors must code a general purpose RL agent. Agents are tested on a variety of different MDPs which do not exhibit systematic structure between themselves. This forces the agent to learn quickly and reason flexibly about general MDPs. Challenges include:
  • Explore / exploit: in a general MDP, the explore/exploit dilemma is key. Although some theoretical analyses exist for different algorithms that navigate this tradeoff, which will perform best in practice?
  • Structure learning: is there structure in the space of rewards or state transitions?
  • Aggregation: can states be aggregated, either to learn an improved model or accelerate planning?

