We can minimize the MSVE using gradient descent. The gradient of the MSVE with respect to w is:
∇VE(w)=−2s∈S∑μ(s)[vπ(s)−v^(s,w)]∇v^(s,w)
In practice, we use stochastic gradient descent by sampling states s from μ(s) and updating w incrementally.
Gradient Monte Carlo Algorithm
Input: the policy π to be evaluatedInput: a differentiable function v^:S×Rd→RAlgorithm parameter: step size α>0Initialize value-function weights w∈Rd arbitrarily (e.g.,w=0)Loop forever (for each episode):Generate an episode following π:S0,A0,R1,S1,A1,R2,⋯,ST−1,AT−1,RT,STLoop for each step of episode, t=0,1,⋯,T−2,T−1:w←w+α[Gt−v^(s,w)]∇v^(s,w)
Semi-Gradient TD updates the value function approximation using the TD error but treats the target as a constant with respect to the weights when computing gradients:
w←w+αδt∇v^(St,w)
TD Error: δt=Rt+1+γv^(St+1,w)−v^(St,w)
The method is called "semi-gradient" because it ignores the dependency of the target on w during differentiation.
Semi-Gradient TD(0) Algorithm
Input: the policy π to be evaluatedInput: a differentiable function v^:S+×Rd→R,such that v^(terminal,⋅)=0Algorithm parameter: step size α>0Initialize value-function weights w∈Rd arbitrarily (e.g.,w=0)Loop for each episode:Initialize SLoop for each step of episode:Choose A∼π(⋅∣S)Take action A, observe R,S′w←w+α[R+γv^(S′,w)−v^(s,w)]∇v^(s,w)S←S′until S is terminal
Bias and Variance: TD methods introduce bias due to bootstrapping but have lower variance compared to Monte Carlo methods.
Convergence: TD learning with function approximation does not minimize the MSVE directly. Instead, it converges to a fixed point that minimizes the Projected Bellman Error. The difference between the TD fixed point and the MSVE minimum is given by:
A neural network is a parameterized function composed of layers of interconnected nodes (neurons):
Input Layer: Receives the input data (state representation).
Hidden Layers: Perform transformations using weights and activation functions.
Output Layer: Produces the final output (e.g., value estimate).
Each neuron computes a weighted sum of its inputs and passes the result through an activation function (e.g., ReLU, sigmoid).
Neural networks can approximate complex, non-linear functions, making them suitable for high-dimensional or continuous state spaces. They can learn useful representations (features) from raw inputs.