Distribution Regression and its Application to Ecological Inference

Autor: LIZETTE LEMUS GONZALEZ
Jazyk: angličtina
Rok vydání: 2019
Předmět:
Zdroj: Centro de Investigación en Matemáticas
CIMAT
Repositorio Institucional CIMAT
Popis: Distribution Regression is about learning a relation between probability distributions and real-valued response variables. Frequently, each distribution is only observed through a sample. There are a wide variety of applications in which objects are represented as a collection of its components. For example, an image can be represented as a set of local descriptors, a 3D object as a set of coordinates and a text as a set of words. In this thesis we will focus on a novel application of Distribution Regression to predict voting behavior for demographic subgroups when we only have access to group level data proposed by Flaxman et al.(Who supported Obama? Ecological Inference through Distribution Regression, KDD 2015). This problem is known as Ecological Inference. The Ecological Inference problem has been subject to controversy since it is known about it and there aren’t many sources of clear information about the methods to solve it. We provide a historical review of the solutions that have been proposed, starting with the classic solutions and including the most recent advances. To present a solution to the Distribution Regression problem, we use a framework based on kernel mean embeddings of distributions of the Gaussian kernel and Ridge regression introduced by Szabo et al.(Two-stage sampled learning theory on distributions, AISTATS 2015). The objectives of this thesis are to understand the Gaussian kernel based similarity, to propose alternative similarity measures in order to improve the prediction results and to perform some experiments to compare these methods. We propose three similarity functions: the pyramid match kernel, the marginal kernel and the Wasserstein kernel. We also present an alternative to the kernel methods using neural networks. We generate two synthetic datasets to compare these methods in quality of prediction, computational time and parameter selection. Finally, we selected two methods to perform Ecological Inference for the US 2016 Presidential Elections and we show that Distribution Regression is a suitable approach to solve the Ecological Inference problem.
Databáze: OpenAIRE