본문으로 건너뛰기

IAI +006

· 약 8분

Markov process or Markov chain

the state sequence satisfies the Markov assumption

  • The current state depends on only a finite fixed number of previous states.
  • P(XtX0:t1)=P(XtXt1)P(X_t | X_{0:t-1}) = P(X_t | X_{t-1})

Decision Theory

  • Choosing actions based on the desirability of their immediate outcomes.
  • The outcome of taking action aa in state s0s_0 is deterministic then Result(s0,a)Result(s_0, a).
  • If the outcome is non-deterministic (stochastic), fully observable, the agent knows the current state s0s_0.
    • It is represented as a random variable whose values are the possible outcome states.
    • The transition model specifies the probabilities of possbile outcome states.
    • P(s,a,s)P(s, a, s')

MEU, Maximum Expected Utility

A rational agent should choose the action that maximizes its expcted utility.

  • The MEU principle could be seen as defining all of AI.
  • action=argmaxa(EU(ae))action = argmax_a(EU(a|e))
    • among this set of actions, give the highest expected utility.
  • EU(ae)=s(P(Result(a)=sa,e)U(s))EU(a|e) = \sum_{s'} (P(Result(a) = s' | a, e) * U(s'))

Sequential decision problem

type of problem or scenario where a decision-maker must make a series of choices or decisions over time in order to achieve a desired outcome.

  • a sequential decision-making problem or sequential decision-making task

MDP, Markov Decision Process

a sequential decision problem for a fully observable, stochastic environment environment with a Markovian transition model and additive rewards.

  • a formal framework for modeling sequential decision problems.
  • including states, actions, transition probabilities, and rewards
  • used in reinforcement learning and operations research.
ComponentDescription
Environmentfully observable, Stochastic
Statesa set of states (with an initial state S₀)
Transition modelMarkovian
Rewardadditive
AgentMake decisions at time step for actions
Interacts with environment through percepts and actions
  • A set of states: an initial state S0S_0, possible terminal state(s)
  • A set of actions in each state: ACTIONS(s)ACTIONS(s)
  • A stochastic Markov transition model: P(ss,a)P(s' | s, a)
  • A reward function: R(s)R(s) or R(s,a,s)R(s, a, s')
  • In MDP, Additive Reward is defined as a function that assigns a numerical value to each state-action pair or state-transition pair.
    • represents the immediate benefit or cost associated with taking a specific action in a particular state.
  • The solution must specify what the agent should do for any state that the agent might reach.
  • Policy: π(s)π(s), an action for each state
    • A solution to an MDP.
    • π(s)\pi(s) is the action recommended by the policy π\pi for state ss
  • Q state: Q(s,a)Q(s, a), the expected utility of doing action aa in state ss

RL, Reinforcement Learning

  • an agent interacts with an environment
  • takes actions to maximize a cumulative reward signal over time.
  • learns a policy that balances exploration and exploitation
    • exploration: trying new actions.
    • exploitation: choosing known good actions.

Sample MDP

Environment

  • The agent lives in a 4x3 fully observable stochastic environment
  • Walls block the agent's path; Begin in the start state, the agent must choose an action at each time step
  • Actions available to the agent in each state: Up, Down, Left, or Right
  • The interaction with the environment terminates when the agent reaches one of the goal states, marked +1 or -1

4x3 Grid World

1234
3+1
2🚫-1
1START
  • START: Initial position (1,1)
  • 🚫: Wall (impassable)
  • +1: Goal state with positive reward
  • -1: Goal state with negative reward

Transition Model

  • Stochastic environment with uncertainty
  • When agent chooses an action, there's:
    • 0.8 probability of moving in intended direction
    • 0.1 probability of moving perpendicular to intended direction (each side)

Reward Function

  • The agent receives rewards each time step:
    • Small "living" reward each step (-0.04 in all states except the terminal states)
    • Big rewards come at the end (+1 for good and -1 for bad terminal states)

Goal

  • To maximize sum of rewards from the start state to a terminate state

Additive Reward

  • a type of reward structure
  • the total reward earned by an agent in an MDP
  • calculated by summing up the individual rewards received at each time step as the agent interacts with the environment.

Discounted reward

  • a techinque used to calcluate the expected cumulative reward over time.
  • applying a discount factor to future rewards, prioritizing more immediate rewards over distant ones.

Utility

Uπ(s)=E[t=0γtR(St,π(St),St+1)]U^{\pi}(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(S_t, \pi(S_t), S_{t+1})\right]

  • The expectation EE is with repect to the probability distribution over state sequences determined by ss and π\pi.
  • γ\gamma is the discount factor, 0γ<10 \leq \gamma < 1, which determines the present value of future rewards.
  • R(St,π(St),St+1)R(S_t, \pi(S_t), S_{t+1}) is the reward received when transitioning from state StS_t to state St+1S_{t+1} by taking action π(St)\pi(S_t).

Bellman equation

U(s)=maxaA(s)sP[ss,a](R(s,a,s)+γU(s))U(s) = max_{a \in A(s)} \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s'))

Formula ComponentDescription
U(s)U(s)the utility of state ss
maxamax_athe maximum over all possible actions aACTIONS(s)a \in ACTIONS(s)
s\sum_{s'}the sum over all possible successor states ss'
P[ss,a]P[s'\vert s,a]the transition probability of reaching state ss'
R(s,a,s)R(s,a,s')the reward received after transitioning from state ss to state ss' by taking action aa
γ\gammathe discount factor, 0γ<10 \leq \gamma < 1
U(s)U(s')the utility of the successor state ss'
  • In optimal setting, the neighbor state meets the Bellman equation.
  • From s0s_0 to termination, each step in the sequence should match the Bellman equation.

πs=argmaxπUπ(s)\pi^{*}_s = argmax_{\pi}U^{\pi}(s)

Q function

  • the expected utility of taking a given action in a given state.
  • enables the agent to act optimally simply by choosing
    • U(s)=maxaQ(s,a)U(s) = max_a Q(s,a)
  • The optimal policy can be extracted from the Q-function.
    • π(s)=argmaxaQ(s,a)\pi^{*}(s) = argmax_a Q(s,a)

Q(s,a)=sP[ss,a](R(s,a,s)+γU(s)) =sP[ss,a](R(s,a,s)+γmaxaQ(s,a))Q(s,a) = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma U(s')) \\ \space = \sum_{s'}P[s'|s,a](R(s,a,s') + \gamma max_{a'}Q(s',a'))

Reinforcement Learning

Key components of RL

  • Agent
  • Environment
  • State
  • Action
  • Reward
  • Policy

Application of RL

  • Robotics
  • Autonomous systems
  • Game playing
    • Atari video games from raw visual input
    • Playing poker
    • Alpha Go
  • Recommendation systems

RL Algorithms

  • Model-based RL
    • often learns a utility function U(s)U(s)
    • the sum of rewards from state ss onward from observing the effects of its actions on the environment.
  • Model-free RL
    • Action-utility learning, such as Q-learning
    • Policy learning
  • Passive RL
    • the agent's policy is fixed
    • the task is to learn the utilities of states.
  • Active RL
    • the agent must figure out what to do through exploration and exploitation.
      • Exploration: trying new actions to discover their effects.
      • Exploitation: leveraging known actions that yield high rewards.

Direct Utility Estimation

  • the utility of a state is the expected total reward from the state onward.
    • the expecgted reward-to-go from that state.
    • each trial provides a sample of this quantity for state visited.
  • end of each sequence, the algorighm calculates the observed reward-to-go for each state and updates the estimated utility for the state accordingly.
  • In the limit of infinitly many trials, the sample average will converge to the expectation in teh Bellman equation.
  • Ignores the connections between states.
    • misses opportunities to learning, it learns nothing until the end of the trial.
    • often converges very slowly.

Adaptive Dynamic Programming, ADP

Ui(s)=sP(ss,πi(s))[R(s,πi(s),s)+γUi(s)]U_i(s) = \sum_{s'}P(s'|s,\pi_{i}(s))[R(s,\pi_{i}(s),s') + \gamma U_{i}(s')]

  • advantage of the constraints among the utilities of states by learning the transition model
    • that connects them and solves the corresponding MDP
    • using dynamic programming to calculate the utilities of the states.
  • These Bellman equations are linear when the policy πi\pi_i is fixed
    • can be solved using any linear algebra package.

Learning a transition model

  • the environment is fully observable
  • the agent can learn the mapping from a state-action pair (s,a)(s, a) to the resulting state ss'.
  • The transition model P(ss,a)P(s'|s,a) is represented as a table and it is estimated directly from the counts that are accumulated in Nss,aN_{s'|s,a}.
  • The counts record how often state ss' is reached when executing aa in ss.
    • P(ss,a)=Nss,asNss,aP(s'|s,a) = \frac{N_{s'|s,a}}{\sum_{s''}N_{s''|s,a}}

Temporal difference, TD

Uπ(s)Uπ(s)+α[R(s,π(s),s)+γUπ(s)Uπ(s)]U^{\pi}(s) \leftarrow U^{\pi}(s) + \alpha[R(s,\pi(s),s') + \gamma U^{\pi}(s') - U^{\pi}(s)]

  • when a transition occurs from state ss to state ss' via action π(s)\pi(s)
  • α\alpha is the learning rate, 0<α10 < \alpha \leq 1
  • no need a transition model to perform updates
  • the environment is itself supplies the connection between neighboring states in the form of observed transitions.

Q learning

  • model-free reinforcement learning algorithm
  • the agent learns a Q-function, Q(s,a)Q(s,a)
  • avoids the need for a model by learning an action-utility function Q(s,a)Q(s,a) instead of utility function U(s)U(s).
  • a model-free Q-learning TD update
    • Q(s,a)Q(s,a)+α[R(s,a,s)+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha[R(s,a,s') + \gamma max_{a'}Q(s',a') - Q(s,a)]
    • this update is calculated whenever action aa is executed in state ss leading to state ss'
  • TD Q-learning agent doesn't need a transition model P(ss,a)P(s'|s,a)
    • either for learning or for action selection.