Recovering Labels from Crowdsourced Data

An Optimal and Polynomial Time Method

Introduction

Workers are given binary tasks to which they have to give a response: YES or NO

Examples

Image Classification: “Does this image contain a dog?”
Text Moderation: “Is this comment toxic or offensive?”
Sentiment Analysis: “Does this review express positive sentiment?”

Questions

3 questions

recover the true label?
rank the workers?
estimate their abilities?

Main Quetion: How can we accurately recover the labels?

This Talk

Introducing the non-parametric isotonic model and main result
Presenting two already existing algorithms
Iterative Spectral Voting (ISV) algo and key insights

The Isotonic Model

Illustration

Two binary tasks with labels in \(\{\color{green}{-1},\color{blue}{+1}\}\)
Unknown vector \(x^*\) of labels with \(d=9\) tasks \(x^*= (\color{blue}{+1},\color{blue}{+1},\color{green}{-1},\color{blue}{+1},\color{green}{-1},\color{blue}{+1},\color{blue}{+1},\color{blue}{+1},\color{blue}{+1})\)

\[ Y=\left(\begin{array}{ccccccccc} \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} \\ \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} \\ \color{blue}{+1} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} \\ \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} \end{array}\right) \]

\[ Y=\left(\begin{array}{ccccccccc} \color{red}{0} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{red}{0} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} \\ \color{red}{0} & \color{blue}{+1} & \color{red}{0} & \color{blue}{+1} & \color{green}{-1} & \color{red}{0} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} \\ \color{blue}{+1} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{red}{0} & \color{blue}{+1} & \color{blue}{+1} \\ \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{red}{0} & \color{green}{-1} & \color{blue}{+1} & \color{red}{0} & \color{red}{0} & \color{red}{0} \end{array}\right) \]

Rate of observations: \(\color{red}{\lambda= 0.72}\)

Observation Model

We observe

\[Y = B\odot(M\mathrm{diag}(x^*) + E) \in \mathbb R^{n \times d}\]

where \(\odot\) is the Hadamard product.

\(x^*\in \{-1, 1\}^d\) is the vector of unknown labels
\(E\) is a matrix of independent sub-gaussian noise
\(M \in [0,1]^{n \times d}\) is the unknown ability matrix.
\(B\) is a Bernoulli \(\mathcal B(\lambda)\) “mask” matrix, modelling partial observations

Shape Constraints

\(M \in [0,1]^{n \times d}\) represents the ability matrix of worker/task pairs
We assume that \(M\) is isotonic up to a permutation \(\pi^*\) , that is

\[ M_{\pi^*}=\left(\begin{array}{ccccccccc} \color{#000099}{0.9} & \color{#000099}{0.8} & \color{#000099}{0.9} & \color{#000099}{1\phantom{.0}} \\ \color{#4B0082}{0.8} & \color{#4B0082}{0.7} & \color{#4B0082}{0.9} & \color{#4B0082}{0.8} \\ \color{#990066}{0.6} & \color{#990066}{0.7} & \color{#990066}{0.7} & \color{#990066}{0.6} \\ \color{#CC0000}{0.5} & \color{#CC0000}{0.7} & \color{#CC0000}{0.5} & \color{#CC0000}{0.6} \end{array}\right) \]

\[ M_{\phantom{{\pi^*}}}=\left(\begin{array}{ccccccccc} \color{#990066}{0.6} & \color{#990066}{0.7} & \color{#990066}{0.7} & \color{#990066}{0.6} \\ \color{#CC0000}{0.5} & \color{#CC0000}{0.7} & \color{#CC0000}{0.5} & \color{#CC0000}{0.6} \\ \color{#000099}{0.9} & \color{#000099}{0.8} & \color{#000099}{0.9} & \color{#000099}{1\phantom{.0}} \\ \color{#4B0082}{0.8} & \color{#4B0082}{0.7} & \color{#4B0082}{0.9} & \color{#4B0082}{0.8} \end{array}\right) \]

Square Norm Loss VS Hamming Loss

Hamming Loss: \[ \sum_{k=1}^d \mathbf 1\{\hat x_k \neq x_k^*\}\]

Square Norm Loss: \[ \|M \mathrm{diag}(x^* \neq \hat x)\|_F^2\]

If workers are bad (\(M \sim 0\)), \(M\) is small but Hamming Loss is large (\(\sim d\))
Square norm loss evaluates the quality of \(\hat x\) rather than performances of workers

Crowdsourcing Problems

Setting

\[ Y = B\odot(M \mathrm{diag}(x^*) + E)\] Shape constraint: \(\exists \pi^*\) s.t. \(M_{\pi^*} \in [0,1]^{n \times d}\) is isotonic

Objectives

Recovering the labels (\(x^*\))

Ranking the workers (\(\pi^*\))

Estimating their abilities (\(M\))

Losses

\(\phantom{\color{purple}{\mathbb E}}\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2\)

\(\phantom{\color{purple}{\mathbb E}}\|M_{\pi^*} - M_{\hat \pi}\|_F^2\)

\(\phantom{\color{purple}{\mathbb E}}\|M - \hat M\|_F^2\)

Crowdsourcing Problems

Setting

\[ Y = B\odot(M \mathrm{diag}(x^*) + E)\] Shape constraint: \(\exists \pi^*\) s.t. \(M_{\pi^*} \in [0,1]^{n \times d}\) is isotonic

Objectives

Recovering the labels (\(x^*\))

Ranking the workers (\(\pi^*\))

Estimating their abilities (\(M\))

Risks

\(\color{purple}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\)

\(\color{purple}{\mathbb E}[\|M_{\pi^*} - M_{\hat \pi}\|_F^2]\)

\(\color{purple}{\mathbb E}[\|M - \hat M\|_F^2]\)

MiniMax Risk for Recovering Labels

\[\mathcal R^*_{\mathrm{reco}}(n,d,\lambda)=\min_{\hat x}\max_{M,\pi^*, x^*}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\]

minimize on all estimator \(\hat x\)
maximize square norm loss on all \(M\), \(\pi^*\), \(x^*\) such that \(M_{\pi^*}\) is isotonic

Main Results

Theorem

If \(\tfrac{1}{\lambda} \leq n \leq d\), there exists a poly. time method \(\hat x\) achieving \(\mathcal R^*_{\mathrm{reco}}\) up to polylog factors, i.e. for all \(M\in[0,1]^{n\times d}\), \(\pi^*\) and \(x^* \in \{-1,1\}^d\), \[ \max_{M,\pi^*, x^*}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2] \lesssim \mathcal R^*_{\mathrm{reco}}(n,d,\lambda) \enspace . \]

Moreover, up to polylogs,

\[ \mathcal R^*_{\mathrm{reco}}(n,d,\lambda) \asymp \frac{d}{\lambda} \enspace . \]

Already Existing Methods

Majority Voting

\(\hat x^{(maj)}_k = \mathrm{sign} \left( \sum_{i=1}^n Y_{ik} \right)\)

Maximum risk of \(\hat x^{(maj)}\):

\[\max_{M,\pi^*, x^*}\mathbb E\|M\mathrm{diag}(\hat x^{maj} \neq x^*)\|_F^2 \asymp \tfrac{d \sqrt{n}}{\lambda}\]

OBI-WAN (Shah et al., 2020)

Idea: Population term \((M\mathrm{diag}(x^*))(M\mathrm{diag}(x^*))^T= MM^T\) is independent of \(x^*\)

PCA Step:

\[\hat v = \underset{\|v\|=1}{\mathrm{argmax}}\|v^T Y\|^2\]

Result on Obi-Wan Method

Theorem 1 (Shah et al., 2020)

In submodel where \(\mathrm{rk}(M)= 1\), \(\hat x^{(\mathrm{Obi-Wan})}\) achieves minimax risk up to polylogs: \[\mathbb E\|M\mathrm{diag}(\hat x^{\text{Obi-Wan}} \neq x^*)\|_F^2 \lesssim \frac{d}{\lambda} \quad \textbf{(minimax)}\]

Theorem 2 (Shah et al., 2020)

In the model where \(M_{\pi^*\eta^*}\) is bi-isotonic up to polylogs: \[\mathbb E\|M\mathrm{diag}(\hat x^{\text{Obi-Wan}} \neq x^*)\|_F^2 \lesssim \frac{d\sqrt{n}}{\lambda} \quad \textbf{(not minimax)}\]

Iterative Spectral Voting

PCA Step

\[\hat v = \underset{\|v\|=1}{\mathrm{argmax}}\|v^T Y\|^2\]

Main idea: Since \(M\) is isotonic, \(M\) is close to low rank

\[\|MM^T\|_{\mathrm{op}} \gtrsim \|M\|_F^2\]

Voting Step

Define the weighted vote vector (on second sample)

\[\hat w = \hat v^T Y\]

Define the estimated label as

\[\hat x_k^{(1)} = \mathrm{sign}(\hat w_k)\mathbf{1}\bigg\{|\hat w_k| \gg \sqrt{\sum_{i=1}^n \hat v_i B_{ik}}\bigg\}\]

Iterate these two Steps

Keep certain labels: if \(\hat x^{(t)}_k\neq 0\), set \(\hat x^{(t+1)}_k= \hat x^{(t)}_k\).

Restrict columns \(k\) of \(Y\) to uncertain labels \(\hat x^{(t)}_k=0\)

Repeat PCA Step + Voting Step on \(Y \mathrm{diag}(\hat x^{(t)} \neq 0)\) a polylogarithmic number of times.

Output last estimator \(\hat x^{(T)}\)

\(\newcommand{\and}{\quad \mathrm{and} \quad}\)

Ranking and Estimating Abilities

Use ISV to recover some labels \(\hat x\)
Restrict to columns \(k\) such that \(\hat x_k \in \{-1, 1\}\)
Use methods from Pilliat et al. (2024) to estimate \(\pi^*\) and \(M\)

Conclusion

Three connected problems: recovering labels, ranking workers and estimating their abilities
Non parametric isotonic model very flexible in crowdsourcing problems
No computational-statistical gap recovering labels, ranking or estimating ability