We consider problems where an analyst aims to learn when data are strategically produced. This happens in many applications such as spam filtering where the analyst wants to learn which mails are spam and spammers want to avoid being recognized as spam (in this case the input data is the mail) or the prediction of quantity of products to give to franchised stores where stores manipulate their data to obtain a higher quantity of popular products (in this case the input data is the sales history of the store).

Abstract (long):

In this thesis, we consider the problem of learning when data are strategically produced. This challenges the widely used assumptions in machine learning that test data are independent from training data which has been proved to fail in many applications where the result of the learning problem has a strategic interest to some agents. We study the two ubiquitous problems of classification and linear regression and focus on fundamental learning properties on these problems when compared to the classical setting where data are not strategically produced.

We first consider the problem of finding optimal classifiers in an adversarial setting where the class-1 data is generated by an attacker whose objective is not known to the defender---an aspect that is key to realistic applications but has so far been overlooked in the literature.

To model this situation, we propose a Bayesian game framework where the defender chooses a classifier with no a priori restriction on the set of possible classifiers. The key difficulty in the proposed framework is that the set of possible classifiers is exponential in the set of possible data, which is itself exponential in the number of features used for classification. To counter this, we first show that Bayesian Nash equilibria can be characterized completely via functional threshold classifiers with a small number of parameters. We then show that this low-dimensional characterization enables us to develop a training method to compute provably approximately optimal classifiers in a scalable manner; and to develop a learning algorithm for the online setting with low regret (both independent of the dimension of the set of possible data).

We then consider the problem of linear regression from strategic data sources. In the classical setting where the precision of each data point is fixed, the famous Aitken/Gauss-Markov theorem in statistics states that generalized least squares (GLS) is a so-called ``Best Linear Unbiased Estimator'' (BLUE) and is consistent (the model is perfectly learned when the amount of data grows). In modern data science, however, one often faces strategic data sources, namely, individuals who incur a cost for providing high-precision data. We model this as learning from strategic data sources with a public good component, i.e., when data is provided by strategic agents who seek to minimize an individual provision cost for increasing their data's precision while benefiting from the model's overall precision. Our model tackles the case where there is uncertainty on the attributes characterizing the agents' data---a critical aspect of the problem when the number of agents is large. We show that, in general, Aitken's theorem does not hold under strategic data sources, though it does hold if individuals have identical provision costs (up to a multiplicative factor) . When individuals have non-identical costs, we derive a bound on the improvement of the equilibrium estimation cost that can be achieved by deviating from GLS, under mild assumptions on the provision cost functions and on the possible deviations from GLS. We also provide a characterization of the game's equilibrium, which reveals an interesting connection with optimal design. Subsequently, we focus on the asymptotic behavior of the covariance of the linear regression parameters estimated via generalized least squares as the number of data sources becomes large. We provide upper and lower bounds for this covariance matrix and we show that, when the agents' provision costs are superlinear, the model's covariance converges to zero but at a slower rate relative to virtually all learning problems with exogenous data. On the other hand, if the agents' provision costs are linear, this covariance fails to converge. This shows that even the basic property of consistency of generalized least squares estimators is compromised when the data sources are strategic.

### Date and place

Seminar room 1 and

### Organized by

**Benjamin ROUSSILLON**

### Jury members

**Mathias HUMBERT**

*Chercheur, Cyber Defence Campus, Suisse, Rapporteur*

**Rachid EL AZOUZI**

*Professeur des universités, LIA, Université d’Avignon, France, Rapporteur*

**Marie-Christine ROUSSET**

*Professeur des universités, LIG, Université Grenoble-Alpes, France, Examinateur*

**Johanne COHEN**

*Directrice de recherche, LRI, CNRS, France, Examinateur*

**Yann CHEVALEYRE**

*Professeur des universités, LAMSADE, université Paris Dauphine-PSL, France, Examinateur*

**Panayotis MERTIKOPOULOS**

*Chargé de recherche, LIG, CNRS, France, Co-encadrant de thèse*

**Patrick LOISEAU**

*Chargé de recherche, LIG, CNRS, France, Directeur de thèse*