We study non-parametric estimation of the value function of an
infinite-horizon $\gamma $ -discounted Markov reward process (MRP) using
observations from a single trajectory. We provide non-asymptotic guarantees for
a general family of kernel-based multi-step temporal difference (TD) estimates,
including canonical $K$ -step look-ahead TD for $K=1,2,\dots $ and the
TD$(\lambda )$ family for $\lambda \in [0,1)$ as special cases. Our bounds
capture its dependence on Bellman fluctuations, mixing time of the Markov
chain, any mis-specification in the model, as well as the choice of weight
function defining the estimator itself, and reveal some delicate interactions
between mixing time and model mis-specification. For a given TD method applied
to a well-specified model, its statistical error under trajectory data is
similar to that of i.i.d. sample transition pairs, whereas under
mis-specification, temporal dependence in data inflates the statistical error.
However, any such deterioration can be mitigated by increased look-ahead. We
complement our upper bounds by proving minimax lower bounds that establish
optimality of TD-based methods with appropriately chosen look-ahead and
weighting, and reveal some fundamental differences between value function
estimation and ordinary non-parametric regression.

PREPRINT

# Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

Yaqi Duan and Martin J. Wainwright

Submitted on 7 November 2022

## Abstract

## Preprint

Subjects: Statistics - Machine Learning; Computer Science - Machine Learning; Mathematics - Statistics Theory