Learn about the Markov decision process (MDP), a stochastic decision-making process that undergirds reinforcement learning, machine learning, and artificial intelligence.
![[Feature Image] Two learners discuss the topic, “What is a Markov decision process,” as they plan how to use it to aid in a project.](https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://images.ctfassets.net/wp1lcwdav1p1/hfjbnHNIx8xUkeL19Un8a/05610da062c1db553d48fadf3f98fbaf/GettyImages-2029000903.jpg?w=1500&h=680&q=60&fit=fill&f=faces&fm=jpg&fl=progressive&auto=format%2Ccompress&dpr=1&w=1000)
In machine learning, Markov decision processes (MDPs) are models for making optimal decisions where the result is random. This modeling process comes from the theorization of Markov chains, which are discrete-time stochastic processes depicted by the Markov property. MDPs are important models for reinforcement learning, a core process for artificial intelligence and machine learning that robotics, autonomous vehicles, and other advanced automatic systems use.
Explore Markov decision processes, their core concepts, uses, and applications in many industries.
Many different components make up a Markov decision process, the first being the Markov property, which establishes that all future states' conditions depend only on their current states and not their previous ones. In an MDP, an agent is a decision-maker that executes a system's actions to optimize desired system performance. The agent makes decisions throughout specific points in time, known as decision epochs, as it attempts to optimize performance. At each point in time, the agent incurs a reward or a cost, which affects the agent's future actions when making a decision.
The core concepts of MDP, as described above, take the form of a tuple (S, A, p, r) :
States (S): A state of space in the system; for example, in a vehicle, it could be all positions the vehicle can move
Actions (A): A set of actions; for example, all ways a vehicle can move when you turn the wheel, move forward, reverse, or stop
Transition probabilities (p): The probability of state transitions, which describes the distribution of states over a set number of actions, depending on which action occurs in which state
Reward function (r): The reward function describes the cost or reward of performing an action in a specific state
When the agent decides which action to take in the Markov decision process framework, it must do so under a predetermined policy. In MDPs, policies are rules that agents follow when making decisions. Two kinds of policy classifications exist in MDPs:
Stationary: A stationary policy is static in that the decision remains the same when a given state presents itself. For example, if you are playing poker, you could set a policy in which you always bet five dollars if you are dealt a pair.
Nonstationary: A nonstationary policy allows multiple actions to occur in a single state. The action taken depends upon the specific instance of time or decision epoch in which the system is.
As you progress through an MDP framework, you want each decision, an action taken, to change the current state into a new state based on the policies in place until you reach the last stage. Much of this framework's work is optimizing the number of steps it takes to reach the final stage.
To solve a Markov decision process problem, you must find the optimal policy to yield the best results. Since the optimal policy is the one that gives the best return or reward, to see the optimal policy for a given state, you need to find the returns for the agent at every state. This function is known as the value function. You use the Bellman equation for the value function to find the needed optimization steps. The Bellman equation splits the value function into two parts:
Immediate reward: The expected reward the agent receives when leaving a state
Discounted value: The value of the successor state that the agent moves to
Once you decompose the value function into the Bellman equation, you subject the Bellman equation to a specific policy, which means the value function depends on the policy. Solving this equation is a core aspect of dynamic programming, which solves multi-step optimization problems using recursive algorithms.
Two popular algorithms use the Bellman equation to find the optimal policy for a system. They are as follows:
Value iteration: Calculates the optimal value function, then finds the optimal policy from the final determined result of the optimal value function
Policy iteration: Evaluates the optimal policy by randomly evaluating the value function; it does this by making the locally optimal choice until convergence using the Bellman equation
Reinforcement learning (RL) is an important part of autonomous machine learning algorithms based on the Markov decision process. The reinforcement learning agent follows the process of the MDP as the agent explores the state space (consisting of all possible states) and the action space (made up of all possible actions it can take). As it explores, the RL agent receives rewards for making optimal decisions and remembers to make them again when in a similar future state. The RL agent eventually learns how to operate in this environment as it meets the goals over time.
The RL agent learns certain actions as it receives rewards for choosing those actions, but still maintains the exploration of new states and actions. As it does this, it improves decision-making by balancing exploiting previously learned knowledge and exploring new states.
MDPs and reinforcement learning have many real-world applications that use dynamic programming and recursive algorithms. Some of the applications include:
Robotics: Robotics uses deep reinforcement learning for complex movement, decision-making, and sensory input.
Natural language processing: Deep reinforcement learning trains large language models like chatbots.
Autonomous vehicle decision-making: Reinforcement learning trains autonomous vehicles to respond like a human driver.
Financial investment and insurance: MDPs help analyze current investment practices based on previous decisions.
Maintenance and repair of equipment: MDPs create models for evaluating the problem in a machine and prescribing whether maintenance or replacement is better over time.
Epidemics and public health: MDPs can help model epidemic outbreaks and make decisions based on the number of infections present at a given time.
Markov decision processes are a key component of reinforcement learning and help create machine learning and artificial intelligence. If you want to gain in-demand skills in machine learning, try the Machine Learning Specialization from Stanford and DeepLearning.AI on Coursera. Also, try the IBM AI Engineering Professional Certificate to help you build practical AI experience.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.