Optimal Selections Among Competing Projects: Multi ... - gcbe.us
... the Hardy-Littlewood principle (Hürlimann(1998a)), a CAPM fair principle ..... (
1986) for the maximal stop-loss transforms and Hürlimann(1997e) in general).
Part of the document
Optimal Selections among Competing Projects: Multi-Armed Bandits
By Kemal Gürsoy
ABSTRACT
A society creates wealth by utilizing its resources and distributes
it. The decision for how to create wealth and how to distribute it is
critical for the survival of a society. An approach to make these types of
decisions is called a multi-armed bandit problem, where the multi-armed
bandit problem is concerned with optimal allocation of resources between
competing activities, in order to generate benefits depending upon the
degree of evolution of the activities. In this work, an optimal decision
making approach to select several competing projects at a time, based on
the multi-armed bandit problem, was considered.
Keywords: Multi-armed bandits, index policies, ensemble of projects.
INTRODUCTON
The problem is to decide how to utilize available resources, as time
goes by, in order to maximize the collective utility. Consequently,
relevant facts will be obtained and processed under a background knowledge,
in order to generate sufficient and necessary information to make a
decision, as best as it could be. In the rich and long history of decision
making methods, there is one which is relevant for this work, known as the
sequential analysis, Wald (1947), Wald (1950), and Robbins (1952), and it
refers to deciding when is the best time to terminate an experiment and to
take actions accordingly.
Our goal is to allocate resources sequentially into several activities
such that the overall benefits collected during the life cycle of all
activities will be maximized, Thompson (1933). The multi-armed bandit
problem, Gittins (1972) and Gittins (1979), approaches to these types of
problems, where a multi-armed bandit refers to a collection of independent
binomial reward processes and a multi-armed bandit problem is how to select
one process and also for how long to activate it, in order to collect the
maximum expected average reward. It is assumed that the state of the
activated process will change, according to a probability law, yet the
states of the inactive processes will not change.
The state-space is the cartesian product of the individual bandits'
state space, and also, the ?-field is generated by the tensor product of
individual bandit's ?-fields, parametrized by time. Thus, the evolution of
a multi-armed bandit process is governed by a semi-Markov decision process,
with a binary action space. As one process is activated, the probability
law that shapes this process is learned, to some extent. Therefore,
choosing between bandits that are known versus bandits which are unknown
introduces the dilemma of choice. It may be tempting to activate the bandit
which we have the most experience, for less risk immediate gain, yet a
bandit which we have not experimented may have a better chance to be the
most profitable one. That is the dilemma of taking actions that yield
immediate reward versus actions such that their rewards would only appear
in the future, Kiefer (1952) and Lai (1985).
There is an approach to solve these classes of problems, based on the
Hardy-Littlewood maximal function theorem, which states that among a set of
measurable functions, say
{fi | i?I} over an index set I, the maximal function is found by comparing
the volume generated by these functions subject to the measure, relative to
the volume generated by the measure itself, say µ > 0, then the supremum of
this ratio identifies the maximal function,
Mf=supi?I {? dµ (fi) / ?
dµ} (1)
An economical index based on ensemble averages of economical entities,
and maximal function theorem was constructed by Wald (1939). Another
application of this theorem is to solve the multi-armed bandit problems by
constructing a positive measure of performance for bandits that depends
upon their history and to allocate resources at each decision moment to a
bandit that has the highest performance measure at that time.
This performance measure is known as the "Gittins' index," Gittins
(1972). If there are finite number of bandits, say N, then let ??(0,1) be a
discount factor, Xi t be the state of the bandit i at time t, and R(Xi t)
be the bounded reward generated by activating the bandit i at time t,
starting at an initial time s, for i=1,2,3,...,N.
The discounted expected reward would be:
E[?i=1,N ?t=s,T ?t R(Xi
t)| F(Xis)] (2)
We are trying to maximize it by selecting an activation order of the
projects.
Hence, the Gittins' index for the bandit i, at time s, be
Gi(s) = sup?>s {E[?t=s,?-1 ?t R(Xi t)| F(Xis)]/E[?t=s,?-1 ?t|
F(Xis)]} (3)
Where F(Xis) is the ?-field generated by the history of Xis and ? is the
stopping time of the activated bandit i, such that
?=argsup?>s {E[?t=s,?-1 ?t R(Xi t)| F(Xis)]/E[?t=s,?-1 ?t|
F(Xis)]} (4)
This formulation helps to compare performance of the bandits and then
select the one that will provide the maximum reward rate, together with its
activation time duration. Ordering bandits by using their Gittins' index,
from the highest index to the lowest index, provides an optimal policy for
activating them, Gittins (1979). For more comprehensive information,
please see Gittins (1989) and Gittins (1994).
A generalization of the multi-armed bandit problem is to let bandits
influence each other, while keeping the independence assumption for the
state transition probabilities for the bandits. A multiplicative influence
for the rewards was introduced by Nash (1980), with a performance measure
different than Gittins' index, known as the Nash index. Let Q(Xjt) be the
positive, bounded influence factor of bandit j at a state Xjt which is not
activated at time t.
Then the discounted influential expected reward would be
E[?i=1,N and i?j ?t=s,T ?t R(Xi t) ?j=1,N and j?i Qj(Xjt) |
F(Xis)] (5)
The Nash index was defined as
Ni(s)=sup??T {E[?t=s,?-1 ?t R(Xi t)| F(Xis)]/E[Q(Xis)-?t Q(Xi?)|
F(Xis)]} (6)
for the bandit i, Nash (1980).
SELECTING SEVERAL COMPETING PROJECTS SEQUENTIALLY
If one is capable of activating more than one project at a time, then
the search for an optimal policy to obtain the best possible expected
collective return, according to these choices, is a reasonable task. Where
possibility of finding an optimal selection policy was conjectured by
Whittle (1988). By employing the multi-armed bandit problem, together with
the Nash index policy, we can construct an approach for choosing a subset
of competing projects out of a set of feasible projects. Hence, the
collective activation of influential armed-bandits model is as follows.
Model Assumptions:
(I) There are finitely many, N, statistically independent processes
(projects) to choose from and each one has a state of nature, Xi t, at a
time t in a finite time horizon, for i=1,2,...,N.
(II) Each selected project provides a positive reward on activation, R(Xi
t)>0, subject to a discount factor, ??(0,1), which identifies the present
value of the reward.
(III) Each non-selected project influences the reward of the selected
projects with a multiplicative factor, Q(Xjt)?(0,1), at time t, for
j=1,2,...,N.
(IV) The selected project changes its state, but non-selected ones do not
change their states.
The above model expresses that each selected project provides a reward
according to a probability distribution where a project also influences the
reward of other projects through a multiplicative influence factor, known
as the generalized reward-decision processes, Gittins (1994). Here, as an
extension of this model, an ensemble of processes is selected at a decision
time and their collective time-discounted rewards provides the decision
making criterion.
Let us select 1 ? k ? N projects at a time. Also, let I be the index
set of the activated bandits and J be the index set of inactive bandits.
Thus, the collective expected return, subject to competition at time t,
will be
?j?J Qj(Xjt) Si?I ?t E[Ri (Xi t)| F(Xit)]
(7)
Let Family_1 be defined as collection of bandits with the property
E[Q(Xjs)-?t Q(Xj?)| F(Xjs)]s
(8)
Where
T={?| E[Q(Xjs)-?t Q(Xj?)| F(Xjs)]s
(9)
Also, let Family_2 be the collection of bandits with the property
E[Q(Xjs)-?t Q(Xj?)| F(Xjs)]?0, for ?>s
(10)
Where
T={?|?>s}
(11)
This introduces a preference relation between two families such that
Family_1 is preferred to Family_2.
According to this construct, there is an optimal policy that maximizes
the overall expected discounted rewards, as follows:
a) Select a bandit in Family_1 with maximum Nash index, find its stopping
time, set the initial time, s, to this stopping time for the next
project, activate it, finish it, and repeat.
b) When the Family_1 is empty, apply the selection process to the
Family_2.
A proof of the optimality of this policy was provided by Nash (1980).
The finiteness of the state-space and the time-horizon enable us to
construct a combinatorial algorithm with finite steps. The rate of growth
is related with the dimension of the space, according to the Hilbert-
Noether theorem, see Schmid (1991). Thus, there is an upper bound on the
number of generators. Hence, the orbit space, or power set, of the bandits
collective is constructed b