Trade-offs Between Query Difficulty and Sample Complexity in Crowdsourced Data Acquisition
Autor: | Hye Won Chung, Doyeon Kim, Alfred O. Hero, Ji Oon Lee |
---|---|
Rok vydání: | 2018 |
Předmět: |
021103 operations research
Sample complexity Trade offs 0211 other engineering and technologies Binary number 020206 networking & telecommunications 02 engineering and technology Combinatorics Data acquisition Average size Fountain code 0202 electrical engineering electronic engineering information engineering Erasure Parity (mathematics) Mathematics |
Zdroj: | Allerton |
DOI: | 10.1109/allerton.2018.8636012 |
Popis: | Consider a crowdsourcing system whose task is to classify $k$ objects in a database into two groups depending on the binary attributes of the objects. Here we propose a parity response model: the worker is asked to check whether the number of objects having a given attribute in the chosen subset is even or odd. A worker either responds with a correct binary answer or declines to respond. We propose a method for designing the sequence of subsets of objects to be queried so that the attributes of the objects can be identified with high probability using few (${n}$) answers. The method is based on an analogy to the design of Fountain codes for erasure channels. We define the query difficulty $\overline {d}$ as the average size of the query subsets and we define the sample complexity $n$ as the minimum number of collected answers required to attain a given recovery accuracy. We obtain fundamental tradeoffs between recovery accuracy, query difficulty, and sample complexity. In particular, the necessary and sufficient sample complexity required for recovering all $k$ attributes with high probability is $n = c_{0}\max\{k, (k\,\log\, k)/\overline {d}\}$ and the sample complexity for recovering a fixed proportion $(1-\delta )k$ of the attributes for $\delta =o(1)$ is $n=c_{1} \max \{k, (\mathrm {k}\log (1/\delta ))/\overline {d}\}$, where $c_{0},\, c_{1} >0.$ |
Databáze: | OpenAIRE |
Externí odkaz: |