Improvement of Cooperative Action for Multi-Agent System by Rewards Distribution

The frequency of natural disasters is increasing everywhere in the world, which is a major impediment to sustainable development. One important issue for the international community is to reduce vulnerability to and damage from disasters. In addi-tion, a large number of injuries occur simultaneously in a large-scale disaster, and the condition of the injured will change over time. Efficient rescue activities are carried out using triage to determine the priority of injury treatment based on the severity of the persons’ conditions. In this chapter, we discuss acquiring cooperative behavior of rescuing the injured and clearing obstacles according to triage of the injured in a multi-agent system. We propose three methods of reward distribution: (1) reward distribution responding to the condition of the injured, (2) reward distribution based on the contribution degree, and (3) reward distribution by the contribution degree responding to the condition of the injured. We investigated the effectiveness of the three proposed methods for a disaster relief problem by an experiment. The results of the experiment showed that agents gained high rewards by rescuing those in most urgent need under the method having the reward distributed according to the contribution degree responding to the condition of the injured.


Introduction
The frequency of natural disasters is increasing everywhere in the world, which is a major impediment to sustainable development. In order to minimize the damage of disasters, the United Nations Office for Disaster Risk Reduction (UNISDR) calls for the promotion of disaster prevention and mitigation by local governments in each country. This is an important issue for the international community in order to reduce vulnerability to and damage from disasters.
In the case of a large-scale disaster, a large number of injuries occur simultaneously, and the condition of the injured changes with the lapse of time. This implies that, to conduct efficient treatment when resources are insufficient to immediately treat all the people who are injured, it is necessary to use triage, which is the process of determining the priority of treatment based on the severity of the injured person's condition [1].
To date, many different remote-controlled disaster relief robots have been developed. A further complication, besides the need for triage, is that these robots must work in environments in which communication is not always secure. For these reasons, there is a need for autonomous disaster relief robots, that is, robots which can learn from the conditions that they encounter and then take independent action [2]. Thus, efficient rescue needs to consider the condition of the injured, which changes with the lapse of time, even with the use of disaster rescue robots.
Reinforcement learning is one way that robots can acquire information about appropriate behavior in new environments. Under this learning system, robots can observe the environment, select and perform actions, and obtain rewards [3][4][5][6]. Each robot must learn what the best policy (i.e., the policy that obtains the largest amount of reward over time) is by itself.
Recent research on disaster relief robots has included consideration of multiagent systems, that is, systems that include two or more disaster relief robots. A multi-agent system in which multiple agents explore sections of damaged building with the goal of updating a topological map of the building with safe routes is discussed [7,8]. John et al. constructed a multi-agent systems approach to disaster situation management, which is a complex multidimensional process involving a large number of mobile interoperating agents [9]. However, to successfully interact in the real world, agents must be able to reason about their interactions with heterogeneous agents of widely varying properties and capabilities. It is necessary that agents are able to learn from the environment and implement independent actions by using perceptual and reasoning in order to carry out their task in the best possible way [10,11].
Numerous studies regarding learning in multi-agent systems have been conducted. Spychalski and Arendt proposed a methodology for implementing machine learning capability in multi-agent systems for aided design of selected control systems allowed to improve their performance by reducing the time spent processing requests that were previously acknowledged and stored in the learning module [12]. In [13], a new kind of multi-agent reinforcement learning algorithm, called TM_Qlearning, which combines traditional Q-learning with observation-based teammate modeling techniques, was proposed. Two multi-agent reinforcement learning methods, both consisting of promoting the selection of actions so that the chosen action not only relies on the present experience but also on an estimation of possible future ones, have been proposed to better solve the coordination problem and the exploration/exploitation dilemma in the case of nonstationary environments [14]. In [15], the construction of a multi-agent evacuation guidance simulation that consists of evacuee agents and instruction agents was reported, and the optimum evacuation guidance method was discussed through numerical simulations by using the multi-agent system for post-earthquake tsunami events. A simulation of a disaster relief problem that included multiple autonomous robots working as a multi-agent system has been reported [16].
In disaster relief problems, it is important to rescue the injured and remove obstacles according to conditions that are changing with the passage of time. However, conventional research on multiple agents targeted for disaster relief has not taken into consideration the condition of the injured, so it is insufficient for efficient rescue.
In this chapter, we discuss acquiring cooperative behavior of rescuing the injured and clearing obstacles according to triage of the injured in a multi-agent system. We propose three methods of reward distribution: (1) reward distribution responding to the condition of the injured, (2) reward distribution based on the contribution degree, and (3) reward distribution by the contribution degree responding to the condition of the injured. We investigated the effectiveness of Improvement of Cooperative Action for Multi-Agent System by Rewards Distribution DOI: http://dx.doi.org /10.5772/intechopen.85109 these proposed methods for a disaster relief problem by an experiment. The results of the experiment showed that agents gained high rewards by rescuing those in most urgent need under the method having the reward distributed according to the contribution degree responding to the condition of the injured.

Learning of multi-agent systems and representation of disaster relief problem 2.1 Learning of multi-agent systems
Agents are a computational mechanism that exist in some complex environment, sense and perform actions in its environment, and by doing so realize a set of tasks for which it is assigned. A multi-agent system consists of agents that interact with each other, situated in a common environment, which they perceive with sensors and upon which they act with actuators (Figure 1). Agent and environment are relationships of the interaction. In the meantime, when the environments are inaccessible, the information which can be perceived from the environment is limited, and it is inaccurate, and it entails delay [2,17,18].
In [19], the following major characteristics of multi-agent systems were identified: • Each agent has incomplete information and is restricted in its capabilities.
• The control of the system is distributed.
• The data are decentralized.
• The computation is asynchronous.
In multi-agent systems, individual agents are forced to engage with other agents that have varying goals, abilities, and composition. Reinforcement learning have been used to learn about other agents and adapt local behavior for the purpose of achieving coordination in multi-agent situations in which the individual agents are not aware of each another [20].
Various reinforcement learning strategies have been proposed that can be used by agents to develop a policy to maximizing rewards accumulated over time. A prominent algorithm in reinforcement learning is the Q-learning algorithm.
In Q-learning, the decision policy is represented by the Q-factors, which estimates long-term discounted rewards for each state-action pair. Let Q(s, a) denote the Q-factor for state s and action a. If an action a in state s produces a reinforcement of r and a transition to state s t + 1 , then the corresponding Q-factor is modified as follows: where α is a small constant called learning rate, which denotes a step-size parameter in the iteration, and γ denotes the discounting factor. Theoretically, the algorithm is allowed to run for infinitely many iterations until the Q-factors converge.

The target problem
In this chapter, we focus on a disaster relief as a target problem similar to previous research [16]. In the disaster relief problem, agents must rescue the injured as quickly as possible, and the injured with different severity and urgency of the condition are placed on a field of fixed size. Because there are multiple injured and obstacles, the disaster relief problem can be considered to be a multi-agent system. Each agent focuses on achieving its own target, and the task of system is to efficiently rescue all of the injured and remove obstacles (Figure 2).
Efficient rescue is performed at a disaster site using triage to assign priority of transport or treatment based on the severity and urgency of the condition of the injured. In the disaster rescue problem, it is thus necessary to reflect triage based on the condition of the injured. For this purpose, in this chapter, we designate the condition of the injured as red (requiring emergency treatment), yellow (requiring urgent treatment), green (light injury), or black (lifesaving is difficult) in descending order of urgency.
The disaster relief problem is represented as shown in Figure 3. The field is divided into an N × N lattice. Agents are indicated by circles, ◎; the injured are indicated by R, Y, G, and B; removable obstacles are indicated by white triangles, △; and nonremovable obstacles are indicated by black triangles, ▲. The destination of injures is indicated by a white square, □; and the collection site of movable obstacles is indicated by a black square, ■. A single step is defined such that each of the agents on the field completes a single action, and the field is re-initialized once all of the injured have been moved.
The agents considered in this chapter are depicted in Figure 4 and included two types, to obtain cooperative actions. The rescue agents have the primary function of rescuing the injured, and the removal agents have the primary function of removing obstacles, although either type can perform both functions. The agents  recognize the colors of the injured on the field and identify the condition of the injured in correspondence with those colors.
Each agent considers the overall tasks in the given environment, carries out the tasks in accordance with the assigned roles, and learns appropriate actions that bring high rewards.
An agent can recognize its circumstance within a prescribed field of vision and move one cell vertically or horizontally, but will stay in place without moving if a nonremovable obstacle, injured transport destination, obstacle transport destination, or other agent occupies the movement destination or if the movement destination is outside the field. Each agent has a constant but limited view of the field, and it can assess the surrounding environment.
The available actions of agents are (1) moving up, down, right, or left to an adjacent cell; (2) remaining in the present cell when adjacent cell is occupied by an obstacle that cannot be removed or by another agent; and (3) finding an injured person or a movable obstacle and taking it to the appropriate location.
If an agent is processing an injured person and next action is moving it to the appropriate destination, then the task of the agent is completed. The agent can begin a new task for rescuing or removing. When all of the injured on the field have been rescued, the overall task is completed.

Cooperative agents
In multi-agent system, the environment changes from static to dynamic because multiple autonomous agents exist. An agent engaged in cooperative action decides its actions by referring to not only its own information and purpose but to those of other agents as well [16]. The cooperative agents are acquired by sharing sensation, sharing episodes, and sharing learned policies [17,[19][20][21]. Cooperative actions are important not only in situations where multiple agents have to work together to accomplish a common goal but also in situations where each agents has its own goal [22].
In this chapter, the multi-agent systems are composed of agents' different behaviors. One is to perform relief (relief agents), and the other is to remove obstacles (removing agents). Cooperation was achieved by giving different rewards to different behaviors.

Reward distribution with consideration of condition of the injured
It is necessary to have the multi-agent systems learn efficient rescue with the condition of the injured taken into consideration.
Prior studies used reward distribution with the reward value differing in accordance with the agent action but gave no consideration to the condition of the injured.
In this chapter, we propose three types of reward distribution as methods for obtaining cooperative action of injured rescue and obstacle removal in accordance with the urgency of the condition of the injured.

Method 1: reward distribution responding to the condition of the injured
A conferred reward is high in value when an injured person in a condition of high urgency is rescued and decreases in value for those in less urgent conditions. Thus, R r > R y > R g > R b , where R r , R y , R g , and R b are the reward values for the rescue of injured persons in the red, yellow, green, and black condition categories, respectively.

Method 2: reward distribution based on the contribution degree
In Method 2, the reward value reflects the time spent by the rescue agent as the contribution degree.
With R as the basic reward value when the rescue agent completes the injured rescue, C as the contribution degree, and λ as a weighting factor, the reward r earned by the rescue agent in learning is as given by Eq. (2). A large λ results in a reward that is greatly augmented relative to the basic reward, according to contribution degree.
Assessed contribution degree C increases with decreasing time spent in rescuing the injured, as shown in Eq. (3), in which T e is the time of completion of rescue of all the injured by the rescue agents and T i is the time spent by an agent to rescue an injured person.

Method 3: reward distribution by the contribution degree responding to the condition of the injured
In Eq. (2), basic reward value R at the time of completion of each task takes on one of the values R r > R y > R g > R b according to the condition of the injured person.

Experimental conditions
In the study presented in this chapter, we experimented on obtaining cooperative action by agents for efficient rescue in accordance with the condition of the injured and obstacle removal, using the three proposed reward distributions. We assigned the injured and obstacle transport destinations to one cell each on the field shown in Figure 3 and numbers of agents, injured persons, and obstacles as listed in Table 1. The mean of five simulation trials was taken as the result.

Effects of reward distribution timing
We investigated the effects of the three patterns of reward distribution timing on task completion on the efficiency of learning injured rescue. The reward is given when an injured person or obstacle is discovered in Pattern 1. The reward is given when an injured person or object is taken for removal to the appropriate location in Pattern 2. Rewards are given twice in Pattern 3: at the stage of discovering an injured person or obstacle and at the stage where transportation is completed.
The results of an experiment to compare the three reward distribution timing patterns are shown in Figure 5. The horizontal axis represents the episodes, and the vertical axis represents the number of steps for task completion by all agents. These results indicate that Pattern 3 allowed completion of the tasks in a smaller number

The setting of field
The field size 10 × 10 The number of rescue agents 2 The number of clearing agents 2 The number of injured individuals    of steps than did Patterns 1 and 2, which in turn indicates that efficient rescue and removal was learned by conferring rewards in two stages and thus led the agent to regard the course from discovery to transport as one task. We therefore applied Pattern 3 in the subsequent experiments.

Obtaining cooperative action that considers the condition of the injured and the effects of reward distribution timing
We applied the three types of reward distributions in experiments for efficient rescue in accordance with the urgency of the condition of the injured. In the following descriptions of experimental results, the horizontal axis represents episodes, and the vertical axis represents the number of steps for task completion by all agents. Figure 6 shows the results of an experiment to investigate the effectiveness of a reward distribution in accordance with condition (Method 1) compared to the conventional method [16]. As shown, the number of steps is higher throughout with Method 1 than with the standard method, thus indicating learning to postpone rescue of the low urgency injured and prioritize rescue of the high urgency injured.
Finally, we performed an experiment to investigate the effectiveness of reward distribution based on contribution degree (Method 2) in comparison to Method 1 and another experiment to investigate the effectiveness of reward distribution by contribution degree in accordance with injured condition (Method 3) in Improvement of Cooperative Action for Multi-Agent System by Rewards Distribution DOI: http://dx.doi.org /10.5772/intechopen.85109 comparison with the other proposed methods (Methods 1 and 2). The results are shown in Figure 7 and Table 2.
As shown, Method 2 tends to yield a higher number of steps than Method 1 in and after around episode 6000. This indicates that, for efficient injured rescue with consideration for contribution degree, the rescue order learned was to first rescue those injured who were nearby and thus shorten the rescue time, leaving for later the rescue of those who were farther away.
Method 3 is approximately 2.2 and 3.4% superior to Methods 1 and 2, respectively. The agents were apparently able to learn rescue of the injured in accordance with urgency because a reward differing in accordance with injured condition was conferred on the agents. These results also show that the agents were able to learn efficient rescue action because the reward distribution reflected contribution degree.

Conclusion
In this chapter, we considered rescue robots as a multi-agent system and proposed three reward distributions for the agents to learn cooperative action with consideration given to the condition of the injured and obstacle removal in responding to our disaster rescue problem, as well as investigating the timing of reward conferral on the agents.
Comparative experiments showed that the timing of reward distribution enabling the agents to obtain the most efficient cooperative actions consisted of  © 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Author details
Mengchun Xie Department of Electrical and Computer Engineering, National Institute of Technology, Wakayama College, Wakayama-ken, Japan *Address all correspondence to: xie@wakayama-nct.ac.jp reward conferral at the stages in which the agent discovered the injured person or obstacle and at completion of their transport. The results also showed that the capability of cooperative actions for the most efficient injured rescue and obstacle removal could be acquired through reward distribution by contribution degree in accordance with condition. In this chapter, the multi-agent system corresponding to the disaster rescue problem, rescue simulations were performed with the condition of the injured determined in advance. In future studies, we plan to conduct simulations with dynamic changes over time in both the condition of the injured and removable versus nonremovable states of the obstacles.