Computers play the beer game: Can artificial agents manage supply chains?
Steven O. Kimbrough^{a}, D.J. Wu ^{b,}^{}, Fang Zhong^{b}
^{a}The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
^{b}LeBow College of Business, Drexel University, Philadelphia, PA 19104, USA
Abstract
We model an electronic supply chain managed by artificial agents. We investigate whether artificial agents do better than humans when playing the MIT Beer Game. Can the artificial agents discover good and effective business strategies in supply chains both in stationary and nonstationary environments? Can the artificial agents discover policies that mitigate the Bullwhip effect? In particular, we study the following questions: Can agents learn reasonably good policies in the face of deterministic demand with fixed leadtime? Can agents cope reasonably well in the face of stochastic demand with stochastic leadtime? Can agents learn and adapt in various contexts to play the game? Can agents cooperate across the supply chain?
Keywords: Artificial agents; Automated supply chains; Beer game; Bullwhip effect; Genetic algorithms
1. Introduction
We consider the case in which an automated enterprise has a supply chain ‘manned’ by various artificial agents. In particular, we view such a virtual enterprise as a multiagent system [10]. Our example of supply chain management is the MIT Beer Game [11], which has attracted much attention from supply chain management practitioners as well as academic researchers. There are four types of agents (or roles) in this chain, sequentially arranged: Retailer, Wholesaler, Distributor, and Manufacturer. In the MIT Beer Game, as played with human subjects, each agent tries to achieve the goal of minimizing longterm systemwide total inventory cost in ordering from its immediate supplier. The decision of how much to order for each agent may depend on its own prediction of future demand by its customer, which could be based on its own observations. Other rules might as well be used by the agents in making decisions on how much to order from a supplier.
The observed performance of human beings operating supply chains, whether in the field or in laboratory settings, is usually far from optimal from a systemwide point of view [11]. This may be due to lack of incentives for information sharing, bounded rationality, or possibly the consequence of individually rational behavior that works against the interests of the group. It would thus be interesting to see if we can gain insights into the operations and dynamics of such supply chains by using artificial agents instead of humans. This paper makes a contribution in that direction. We differ from current research on supply chains in the OM/OR area in that the approach taken here is an agentbased, informationprocessing model in an automated marketplace. In the OM/OR literature the focus is usually on finding optimal solutions assuming a fixed environment such as: fixed linear supply chain structure, known demand distribution to all players, and fixed lead time in information delay as well physical delay. It is generally very difficult to derive and to generalize such optimal policies for changing environments.
This paper can be viewed as a contributing to the understanding of how to design learning agents to discover insights for complicated systems, such as supply chains, which are intractable when using analytic methods. We believe our initial findings vindicate the promise of the agentbased approach. We do not essay here to contribute to management science or operations research analytic modeling, or to the theory of genetic algorithms. However, throughout this study, we use findings from the management science literature to benchmark the performance of our agentbased approach. The purpose of the comparison is to assess the effectiveness of an adaptable or dynamic order policy that is automatically managed by computer programs – artificial agents.
We have organized the paper as follows. Section 2 provides a brief literature review. Section 3 describes our methodology and implementation of the Beer Game using an agentbased approach. Section 4 summarizes results from various experiments, mainly on the MIT Beer Game. Section 5 provides a discussion of our findings in light of relevant literature. Section 6 summarizes our findings, offers some comments, and points to future research.
2. Literature review
A wellknown phenomenon, both in industrial practice and when humans play the Beer Game, is the bullwhip or whiplash effect – the variance of orders amplifies upstream in the supply chain [4, 5]. An example of the bullwhip effect is shown in Fig. 1, using data from a group of undergraduates playing the MIT Beer Game (date played May 8, 2000).
Insert Fig. 1 here.
These results are typical and have been often replicated. As a consequence, much interest has been generated among researchers regarding how to eliminate or minimize the bullwhip effect in the supply chain. In the management science literature, sources of the bullwhip effect have been identified and counter actions have been offered [5]. It is shown in [1] that the bullwhip effect can be eliminated under a basestock installation policy given the assumption that all divisions of the supply chain work as a team [8]. Basestock is optimal when facing stochastic demand with fixed information and physical lead time [1]. When facing deterministic demand with penalty costs for every player (The MIT Beer Game), the optimal order for every player is the socalled “Pass Order,” or “One for one” (11) policy  order whatever is ordered from your own customer.
In the MIS literature, the idea of combining multiagent systems and supply chain management has been proposed in [7, 9, 10, 12, 14]. Actual results, however, have mostly been limited to the conceptualizing. Our approach differs from previous work in that we focus on quantitative and agent computation methodologies, with agents learning via genetic algorithms.
It is well known that the optimality of 11 depends on the initial base stock level, and it is optimal only for limited cases. In particular, 11 requires the system to be stationary. We shall see that artificial agents not only can handle the stationary case by discovering the optimal policy, they also perform well in the nonstationary case.
In the next section, we describe in detail our methodology and implementation of the Beer Game.
3. Methodology and implementation
We adopt an evolutionary computation approach. Our main idea is to replace human players with artificial agents playing the Beer Game and to see if artificial agents can learn and discover good and efficient order policies in supply chains. We are also interested in knowing whether the agents can discover policies that mitigate the bullwhip effect.
The basic setup and temporal ordering of events in the MIT Beer Game [11] are as follows. New shipments arrive from the upstream player (in the case of the Manufacturer, from the factory floor); orders are received from the downstream player (for the Retailer, exogenously, from the end customer); the orders just received plus any backlog orders are filled if possible; the agent decides how much to order to replenish stock; inventory costs (holding and penalty costs) are calculated^{1}.
3.1. Cost function
In the MIT Beer Game, each player incurs both inventory holding costs, and penalty costs if the player has a backlog. We now derive the total inventory cost function of the whole supply chain. As noted above, we depart from previous literature in that we are using an agentbased inductive approach, rather than a deductive/analytical approach.
We begin with needed notation. N = number of players, i = 1…N. In the MIT Beer Game, N = 4. IN_{i}(t)_{ } = Net Inventory of player i at the beginning of period t. C_{i}(t)_{ }= Cost of player i at period t. H_{i} = Inventory holding cost of player i, per unit per period (e.g., in the MIT Beer Game, $1 per case per week). P_{i} = Penalty/Backorder cost of player i, per unit per period (e.g., in the MIT Beer Game, $2 per case per week). S_{i}(t) = New shipment player i received in period t. D_{i}(t) = Demand received from the downstream player in week t (for the Retailer, the demand from customers).
According to the temporal ordering of the MIT Beer Game, each player’s cost for a given time period, e.g., a week, can be calculated as following: If IN_{i}(t) >= 0 then C_{i}(t) = IN_{i}(t) * H_{i }else C_{i}(t) = IN_{i}(t) * P_{i,} where IN_{i}(t) = IN_{i}(t1) + S_{i}(t) – D_{i}(t)_{ } and S_{i}(t) is a function of both information lead time and physical lead time. The total cost for supply chain (the whole team) after M periods (e.g., weeks) is .
We implemented a multiagent system for the Beer Game, and used it as a vehicle to perform a series of investigations that aim to test the performance of our agents. The following algorithm defines the procedure for our multiagent system to search for the optimal order policy.
3.2. Algorithm

Initialization. A certain number of rules are randomly generated to form generation 0.

Pick the first rule from the current generation.

Agents play the Beer Game according to their current rules.

Repeat step 3, until the game period (say 35 weeks) is finished.

Calculate the total average cost for the whole team and assign fitness value to the current rule.

Pick the next rule from the current generation and repeat step 2, 3 and 4 until the performance of all the rules in the current generation have been evaluated.

Use genetic algorithm to generate a new generation of rules and repeat steps 2 to 6 until the maximum number of generation is reached.
3.3. Rule representation
Now we describe our coding strategy for the rules. Each agent's rule is represented with a 6bit binary string. The leftmost bit represents the sign (“+” or “ – “) and the next five bits represent (in base 2) how much to order. For example, rule 101001 can be interpreted as “x+9”:
1 0 1 0 0 1
“+” = 9 in decimal
That is, if demand is x then order x+9. Technically, our agents are searching in a simple function space. Notice that if a rule suggests a negative order, then it is truncated to 0 since negative orders are infeasible. For example, the rule x15 means that if demand is x then order Max[0, x15]. Thus the size of the search space is 2^{6}^{ } for each agent.
When all four agents are allowed to make their own decisions simultaneously, we will have a “Über” rule for the team, and
the length of binary string becomes 6 * 4 = 24. The size of search space is considerably enlarged, but still fairly small at 2^{24}. Below is an example of such an Über rule:
110001 001001 101011 000000, where 110001 is the rule for Retailer; 001001 is the rule for Wholesaler; 101011 is the rule for Distributor; 000000 is the rule for Manufacturer.
Why use the “x+y” rule? Recent literature has identified the main cause of the bullwhip effect as due to the agents’ overreaction on their immediate orders (see. e.g., [5]). While the optimality of this rule (the “x+y” rule) depends on the initial inventories, so do the optimality of other inventory control policies such as the (s, S) policy or basestock policy [2],  which are only optimal for the stationary case. The “x+y” rule is more general in the sense that it handles both the stationary and the nonstationary case. We leave for future research exploration of alterative coding strategies and function spaces. Why not let everybody to use the “orderupto” policy? This is essentially the “11” policy our agents discover. It, however, is only optimal for the stationary case with the optimal initial configuration. There are three main reasons that preclude the usefulness of this rule (every body uses the “orderupto” policy) in the general case where an adaptable rule is more useful. First, there is the coordination problem, the key issue we are trying to address here. It is well established in the literature (see, e.g. [1]), if everyone uses the (s, S) policy, (which is optimal in the stationary case), the total system performance need not be optimal, and indeed could be very bad. Second, we are interested to see if the agent can discover an optimal policy when it exists. We will see that a simple GA learning agent, or a simple rule language, does discover the “11” policy (the winning rule), which is essentially equivalent to the basestock policy. Third, the supply chain management literature has only been able to provide analytic solutions in very limited – some would argue unrealistic – cases. For the real world problems we are interested in, the literature has little to offer directly, and we believe our agentbased approach may well be a good alternative.
Share with your friends: 