AI
]
Reinforcement Learning
It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Different from supervised learning, it learns from its experience.
Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.
The āqā in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.
s- set of states a- set of actions Pr = transitions alpha = starting state distribution gamma = discount factor rs /r(s,a)= reward
we want to obtain a function Q(s,a) that predicts best action a in state s in order to maximize a cumulative reward. This function can be estimated using Q-learning, which iteratively updates Q(s,a) using the Bellman Equation.
The exploration vs exploitation dilemma is exemplified by the question of whether an AI should trust the learnt values of Q enough to select actions based on it or try other actions hoping that might give it a better reward.
It will perform the sequence of actions that will eventually generate the maximum total reward. This total reward is also called the Q-value and we will formalise our strategy as:
The above equation states that the Q-value yielded from being at state s and performing action a is the immediate reward r(s,a) plus the highest Q-value possible from the next state sā. Gamma here is the discount factor which controls the contribution of rewards further in the future.
References:
Simple Reinforcement Learning: Q-learning
A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python