Trade-offs Between Query Difficulty and Sample Complexity in Crowdsourced Data Acquisition

Autor: Hye Won Chung, Doyeon Kim, Alfred O. Hero, Ji Oon Lee
Rok vydání: 2018
Předmět:
Zdroj: Allerton
DOI: 10.1109/allerton.2018.8636012
Popis: Consider a crowdsourcing system whose task is to classify $k$ objects in a database into two groups depending on the binary attributes of the objects. Here we propose a parity response model: the worker is asked to check whether the number of objects having a given attribute in the chosen subset is even or odd. A worker either responds with a correct binary answer or declines to respond. We propose a method for designing the sequence of subsets of objects to be queried so that the attributes of the objects can be identified with high probability using few (${n}$) answers. The method is based on an analogy to the design of Fountain codes for erasure channels. We define the query difficulty $\overline {d}$ as the average size of the query subsets and we define the sample complexity $n$ as the minimum number of collected answers required to attain a given recovery accuracy. We obtain fundamental tradeoffs between recovery accuracy, query difficulty, and sample complexity. In particular, the necessary and sufficient sample complexity required for recovering all $k$ attributes with high probability is $n = c_{0}\max\{k, (k\,\log\, k)/\overline {d}\}$ and the sample complexity for recovering a fixed proportion $(1-\delta )k$ of the attributes for $\delta =o(1)$ is $n=c_{1} \max \{k, (\mathrm {k}\log (1/\delta ))/\overline {d}\}$, where $c_{0},\, c_{1} >0.$
Databáze: OpenAIRE