Multi-agent task allocation using reinforcement learning
This is the first paper (in review) in the Multi-agent learning in dynamic systems series. Developing new algorithms to optimise task allocation in multi-agent systems. Q-learning, historical reward convolution, and dynamically adaptable risk-based system exploration approaches are developed.
Multi-agent learning in dynamic systems Focused on applying reinforcement learning techniques to multi-agent systems where the environment is dynamic, and realistic resource constraints exist. This work combines task-allocation optimisation, resource allocation, and self-organising hierarchical agent structures.
Abstract
In large-scale systems there are fundamental challenges when centralised techniques are used for task allocation. The number of interactions is limited by resource constraints such as on computation, storage, and network communication.We can increase scalability by implementing the system as a distributed task-allocation system, sharing tasks across many agents. However, this also increases the resource cost of communications and synchronisation, and is difficult to scale.
In this paper we present four algorithms to solve these problems. The combination of these algorithms enable each agent to improve their task allocation strategy through reinforcement learning, while changing how much they explore the system in response to how optimal they believe their current strategy is, given their past experience. We focus on distributed agent systems where the agents’ behaviours are constrained by resource usage limits, limiting agents to local rather than system-wide knowledge.
We evaluate these algorithms in a simulated environment where agents are given a task composed of multiple subtasks that must be allocated to other agents with differing capabilities, to then carry out those tasks. We also simulate real-life system effects such as networking instability. Our solution is shown to solve the task allocation problem to 6.7% of the theoretical optimal within the system configurations considered. It provides 5x better performance recovery over no-knowledge retention approaches when system connectivity is impacted, and is tested against systems up to 100 agents with less than a 9% impact on the algorithms’ performance.

