演講主題: Clustered Cross-Validation and Cross-Fitting for Dependent Data
講題摘要:
Traditional K-fold cross-validation, a widely used technique in machine learning for model selection and performance evaluation, relies on the assumption of independent and identically distributed (i.i.d.) data. However, in many empirical studies in economics, data often exhibit dependence structures that violate this assumption. Arbitrary partitioning of the dependent data into K folds can lead to undesirable clustering structures, thus compromising the effectiveness of cross-validation in estimating true risk. This paper explores the conditions under which a carefully designed clus-
tering structure can be used to define the folds for cross-validation in the presence of dependent data. Our results focus on β-mixing dependence. When we impose no assumptions on the nonparametric estimator other than the usual rate condition, a buffer zone is shown to be useful in mitigating the effects of dependence and ensuring the validity of cross-validation and cross-fitting procedures. We present a special case of (asymptotically) linear learners, for which we show that a buffer zone is not necessary. Furthermore, we discuss clustering algorithms that can produce desirable clustering structures for dependent data. The paper also explores the potential extension to network data.