Data Thinning to Avoid Double Dipping by Anna Neufeld, Fred Hutchinson Cancer Research Center, Thursday November 2, 1:00 – 1:50pm, North Science Building 017, Wachenheim
We refer to the practice of using the same data to fit and validate a model as double dipping. Problems arise when standard statistical procedures are applied in settings that involve double dipping. To circumvent the challenges associated with double dipping, one approach is to fit a model on one dataset, and then validate the model on another independent dataset. When we only have access to one dataset, we typically accomplish this via sample splitting. Unfortunately, in many unsupervised problems, sample splitting does not allow us to avoid double dipping. In this talk, we are motivated by unsupervised problems that arise in the analysis of single cell RNA sequencing data. We first propose Poisson count splitting, which splits a single observation drawn from a Poisson distribution into two independent components. We show that Poisson count splitting allows us to avoid double dipping in unsupervised settings. We next generalize the count splitting framework to a variety of distributions, and refer to the generalized framework as data thinning. Data thinning is a very general alternative to sample splitting that is useful far beyond the context of single-cell RNA sequencing data, and, unlike sample splitting, can be applied in both supervised and unsupervised settings.