basic

Reference

Deep-Reinforcement-Learning-With-Python

Types


Supervised learning

In supervised learning, the machine learns from training data. The training data consists of a labeled pair of inputs and outputs. So, we train the model (agent) using the training data in such a way that the model can generalize its learning to new unseen data. It is called supervised learning because the training data acts as a supervisor, since it has a labeled pair of inputs and outputs, and it guides the model in learning the given task.

Regression

Quantitative response
predict a quantitative variable from a set of features

Classification

Categorical response
predict a categorical variable


Unsupervised learning

Similar to supervised learning, in unsupervised learning, we train the model (agent) based on the training data. But in the case of unsupervised learning, the training data does not contain any labels; that is, it consists of only inputs and not outputs. The goal of unsupervised learning is to determine hidden patterns in the input. There is a common misconception that RL is a kind of unsupervised learning, but it is not. In unsupervised learning, the model learns the hidden structure, whereas, in RL, the model learns by maximizing the reward.


Reinforcement learning

Action space

The set of all possible actions in the environment is called the action space. Thus, for this grid world environment, the action space will be [up, down, left, right]. We can categorize action spaces into two types:

  • Discrete action space
    When our action space consists of actions that are discrete, then it is called a discrete action space. For instance, in the grid world environment, our action space consists of four discrete actions, which are up, down, left, right, and so it is called a discrete action space.
  • Continuous action space
    When our action space consists of actions that are continuous, then it is called a continuous action space. For instance, let’s suppose we are training an agent to drive a car, then our action space will consist of several actions that have continuous values, such as the speed at which we need to drive the car, the number of degrees we need to rotate the wheel, and so on. In cases where our action space consists of actions that are continuous, it is called a continuous action space.

Policy

A policy defines the agent’s behavior in an environment. The policy tells the agent what action to perform in each state.
Over a series of iterations, the agent will learn a good policy that gives a positive reward.
The optimal policy tells the agent to perform the correct action in each state so that the agent can receive a good reward.

  • Deterministic Policy
    deterministic policy tells the agent to perform a one particular action in a state. Thus, the deterministic policy maps the state to one particular action

  • Stochastic Policy
    maps the state to a probability distribution over an action space.

    • Categorical policy
      when the action space is discrete
      uses categorical probability distribution over action space to select actions
    • Gaussian policy
      when our action space is continuous
      the stochastic policy uses Gaussian probability distribution over action space to select actions when the action space is continuous

Episode

The agent interacts with the environment by performing some action starting from the initial state and reach the final state. This agent-environment interaction starting from the initial state until the final state is called an episode. For instance, in the car racing video game, the agent plays the game by starting from the initial state (starting point of the race) and reach the final state (endpoint of the race). This is considered an episode. An episode is also often called trajectory (path taken by the agent)

  • Episodic task
    As the name suggests episodic task is the one that has the terminal state. That is, episodic tasks are basically tasks made up of episodes and thus they have a terminal state. Example: Car racing game.
  • Continuous task
    Unlike episodic tasks, continuous tasks do not contain any episodes and so they don’t have any terminal state. For example, a personal assistance robot does not have a terminal state.

Horizon

Horizon is the time step until which the agent interacts with the environment. We can classify the horizon into two:

  • Finite horizon
    If the agent environment interaction stops at a particular time step then it is called finite Horizon. For instance, in the episodic tasks agent interacts with the environment starting from the initial state at time step t =0 and reach the final state at a time step T. Since the agent environment interaction stops at the time step T, it is considered a finite horizon.
  • Infinite horizon
    If the agent environment interaction never stops then it is called an infinite horizon. For instance, we learned that the continuous task does not have any terminal states, so the agent environment interaction will never stop in the continuous task and so it is considered an infinite horizon.

Return

Return is the sum of rewards received by the agent in an episode.

Value function

Value function or the value of the state is the expected return that the agent would get starting from the state $s$ following the policy $\pi$

Q function

implies the expected return agent would obtain starting from the state $s$ and an action $a$ following the policy $\pi$.