Recovering Labels from Crowdsourced Data

An Optimal and Polynomial Time Method

Crowdsourcing Problem with Binary Tasks

  • Image Classification: “Does this image contain a dog?”
  • Text Moderation: “Is this comment toxic or offensive?”
  • Sentiment Analysis: “Does this review express positive sentiment?”
  • Data Verification: “Is this information factually correct?”

Main Question

Given \(n\) workers and \(d\) binary tasks

A proportion \(\lambda\) of observations

How can we accurately recover the labels?

Illustration

  • Two binary tasks with labels in \(\{\color{green}{-1},\color{blue}{+1}\}\)
  • Unknown vector \(x^*\) of labels with \(d=9\) tasks \(x^*= (\color{blue}{+1},\color{blue}{+1},\color{green}{-1},\color{blue}{+1},\color{green}{-1},\color{blue}{+1},\color{blue}{+1},\color{blue}{+1},\color{blue}{+1})\)

\[ Y=\left(\begin{array}{ccccccccc} \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} \\ \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} \\ \color{blue}{+1} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} \\ \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} & \color{blue}{+1} \end{array}\right) \]

\[ Y=\left(\begin{array}{ccccccccc} \color{red}{0} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{red}{0} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} \\ \color{red}{0} & \color{blue}{+1} & \color{red}{0} & \color{blue}{+1} & \color{green}{-1} & \color{red}{0} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} \\ \color{blue}{+1} & \color{green}{-1} & \color{green}{-1} & \color{blue}{+1} & \color{green}{-1} & \color{blue}{+1} & \color{red}{0} & \color{blue}{+1} & \color{blue}{+1} \\ \color{green}{-1} & \color{blue}{+1} & \color{blue}{+1} & \color{red}{0} & \color{green}{-1} & \color{blue}{+1} & \color{red}{0} & \color{red}{0} & \color{red}{0} \end{array}\right) \]

  • Rate of observations: \(\color{red}{\lambda= 0.72}\)

Observation Model

Given workers \(i\in\{1, \dots, n\}\) and tasks \(k\in\{1, \dots, d\}\)

We observe

\[ Y_{ik} = B_{ik}(M_{ik}x_k^* + E_{ik}) \enspace \]

  • Where:

    1. \(M_{ik} \in [0,1]\)
    2. \(E_{ik}\) are independent and \(1\)-subGaussian
    3. \(B_{ik}\) are iid Bernoulli \(\lambda \in [0,1]\)

Observation Model

We observe

\[Y = B\odot(M\mathrm{diag}(x^*) + E) \in \mathbb R^{n \times d}\]

where \(\odot\) is the Hadamard product.

  • \(M \in [0,1]^{n \times d}\) is the ability matrix
  • \(x^*\) is the vector of unknown labels
  • \(E\) is a matrix of independent noise
  • \(B\) is a Bernoulli “mask” matrix, modelling partial observations

Bernoulli Observation SubModel

\[Y = B\odot(M\mathrm{diag}(x^*) + E) \in \mathbb R^{n \times d}\]

Bernoulli submodel (Shah et al., 2020)

\[ \begin{aligned}\label{eq:bernoulli_model} Y_{ik} = \begin{cases} x^*_k \text{ with proba } \lambda\left(\frac{1+M_{ik}}{2}\right) \\ -x^*_k \text{ with proba } \lambda\left(\frac{1-M_{ik}}{2}\right)\\ 0 \text{ with proba } 1-\lambda \end{cases} \end{aligned} \]

\(\frac{1+M_{ik}}{2}\) is the proba that \(i\) answers correctly to task \(k\).

\(\lambda\) is the probability of observing worker/task pair \((i,k)\)

Shape Constraints

  • \(M \in [0,1]^{n \times d}\) represents the ability matrix of worker/task pairs
  • We assume that \(M\) is isotonic up to a permutation \(\pi^*\) , that is

\[ M_{\pi^*}=\left(\begin{array}{ccccccccc} \color{#000099}{0.9} & \color{#000099}{0.8} & \color{#000099}{0.9} & \color{#000099}{1\phantom{.0}} \\ \color{#4B0082}{0.8} & \color{#4B0082}{0.7} & \color{#4B0082}{0.9} & \color{#4B0082}{0.8} \\ \color{#990066}{0.6} & \color{#990066}{0.7} & \color{#990066}{0.7} & \color{#990066}{0.6} \\ \color{#CC0000}{0.5} & \color{#CC0000}{0.7} & \color{#CC0000}{0.5} & \color{#CC0000}{0.6} \end{array}\right) \]

\[ M_{\phantom{{\pi^*}}}=\left(\begin{array}{ccccccccc} \color{#990066}{0.6} & \color{#990066}{0.7} & \color{#990066}{0.7} & \color{#990066}{0.6} \\ \color{#CC0000}{0.5} & \color{#CC0000}{0.7} & \color{#CC0000}{0.5} & \color{#CC0000}{0.6} \\ \color{#000099}{0.9} & \color{#000099}{0.8} & \color{#000099}{0.9} & \color{#000099}{1\phantom{.0}} \\ \color{#4B0082}{0.8} & \color{#4B0082}{0.7} & \color{#4B0082}{0.9} & \color{#4B0082}{0.8} \end{array}\right) \]

Shape Constraints

  • For all \(k = 1, \dots, d\), \[ M_{\pi^*(1), k}\geq \dots \geq M_{\pi^*(n),k} \]
  • It means that the rows are uniformly increasing
  • A worker \(i\) is better on average than \(j\), if it is better on all tasks on average

Illustration when \(x^*\) is Known

Crowdsourcing Problems

Setting

\[ Y = B\odot(M \mathrm{diag}(x^*) + E)\] Shape constraint: \(\exists \pi^*\) s.t. \(M_{\pi^*} \in [0,1]^{n \times d}\) is isotonic

Objectives

Recovering the labels

Ranking the workers

Estimating their abilities

Losses

\(\phantom{\color{purple}{\mathbb E}}\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2\)

\(\phantom{\color{purple}{\mathbb E}}\|M_{\pi^*} - M_{\hat \pi}\|_F^2\)

\(\phantom{\color{purple}{\mathbb E}}\|M - \hat M\|_F^2\)

These three objectives are closely intertwined!

Crowdsourcing Problems

Setting

\[ Y = B\odot(M \mathrm{diag}(x^*) + E)\] Shape constraint: \(\exists \pi^*\) s.t. \(M_{\pi^*} \in [0,1]^{n \times d}\) is isotonic

Objectives

Recovering the labels

Ranking the workers

Estimating their abilities

Risks

\(\color{purple}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\)

\(\color{purple}{\mathbb E}[\|M_{\pi^*} - M_{\hat \pi}\|_F^2]\)

\(\color{purple}{\mathbb E}[\|M - \hat M\|_F^2]\)

These three objectives are closely intertwined!

Square Norm Loss VS Hamming Loss

Hamming Loss: \[ \sum_{k=1}^d \mathbf 1\{\hat x_k \neq x_k^*\}\]

Square Norm Loss: \[ \|M \mathrm{diag}(x^* \neq \hat x)\|_F^2\]

  • If workers are bad, \(M\) is small but Hamming Loss is large
  • If \(M \sim 0\), Hamming loss \(\sim d\)
  • Square norm loss evaluates the quality of \(\hat x\) rather than workers

Other Parametric Models

  • DS: \(M_{ik} = q_i\) (Dawid & Skene, 1979), (Shah et al., 2020)
    • The abilities are independent of the tasks
  • BTL: \(M_{ik}=\phi(a_i-b_k)\) (Bradley & Terry, 1952)
    • \(a_i\): abilities of the workers
    • \(b_i\): difficulties of the tasks

These parametric models often fail to fit data well

Non-Parametric Models

  • known labels \(x^*\), \(M_{\pi^*\eta^*}\) is bi-isotonic (Mao et al., 2020) (Liu & Moitra, 2020)
  • known labels \(x^*\), \(M_{\pi^*}\) is isotonic (Flammarion et al., 2019) (Pilliat et al., 2024)
  • unknown labels \(x^*\), \(M_{\pi^*\eta^*}\) is bi-isotonic (Shah et al., 2020)
  • unknown labels \(x^*\), \(M_{\pi^*}\) is isotonic (this paper)

Risks Recap


Objectives


Recovering the labels

Ranking the workers

Estimating their abilities

Risks


\(\color{purple}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\)

\(\color{purple}{\mathbb E}[\|M_{\pi^*} - M_{\hat \pi}\|_F^2]\)

\(\color{purple}{\mathbb E}[\|M - \hat M\|_F^2]\)

MiniMax Recovering Risk

Max risk for recovering labels

\[\max_{M,\pi^*, x^*}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\]

  • maximize on all \(M\), \(\pi^*\), \(x^*\) such that \(M_{\pi^*}\) is isotonic

MiniMax risk for recovering labels

\[\mathcal R^*_{\mathrm{reco}}(n,d,\lambda)=\min_{\hat x}\max_{M,\pi^*, x^*}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\]

  • minimize on all estimator \(\hat x\)

Minimax Risks

Recovering labels

\[\mathcal R^*_{\mathrm{reco}}(n,d,\lambda)=\min_{\hat x}\max_{M,\pi^*, x^*}{\mathbb E}[\|M \mathrm{diag}(x^* \neq \hat x)\|_F^2]\]

Ranking workers

\[\mathcal R^*_{\mathrm{rk}}(n,d,\lambda)=\min_{\hat \pi}\max_{M,\pi^*, x^*}{\mathbb E}[\|M_{\pi^*} - M_{\hat \pi}\|_F^2]\]

Estimating abilities

\[\mathcal R^*_{\mathrm{est}}(n,d,\lambda)=\min_{\hat M}\max_{M,\pi^*, x^*}{\mathbb E}[\|M- \hat M\|_F^2]\]

Short Story

(Shah et al., 2020): recovering \(x^*\) optimally using a least square method, conjectured NP hard (\(x^*\) unknown, \(M_{\pi^*\eta^*}\) bi-isotonic).

(Mao et al., 2020): estimating abilities \(M\) of workers optimally with least square method (\(x^*\) known, \(M_{\pi^*\eta^*}\) bi-isotonic)

(Liu & Moitra, 2020): ranking \(\pi^*\) and estimating abilities \(M\): improve state of the art poly. time (\(x^*\) known, \(M_{\pi^*\eta^*}\) bi-isotonic)

(Pilliat et al., 2024): ranking \(\pi^*\) and estimating abilities \(M\): achieves rates of Liu & Moitra (2020) without bi-isotonic assumption (\(x^*\) known, \(M_{\pi^*}\) isotonic)

This paper: recovering \(x^*\), ranking \(\pi^*\) and estimating abilities \(M\) in poly. time when \(n=d\) (\(x^*\) unknown, \(M_{\pi^*}\) isotonic)

Main Results

Optimal poly. time method

If \(\tfrac{1}{\lambda} \leq n \leq d\), there exists a poly. time method \(\hat x\) achieving \(\mathcal R^*_{\mathrm{reco}}\) up to polylog factors, i.e. \[ \mathcal R_{\mathrm{reco}}(n,d,\lambda, \hat x) \lesssim \mathcal R^*_{\mathrm{reco}}(n,d,\lambda) \]

Minimax Risk

If \(\tfrac{1}{\lambda} \leq n \leq d\), up to polylogs, \[ \mathcal R^*_{\mathrm{reco}}(n,d,\lambda) \asymp \frac{d}{\lambda} \]

Methods in the Literature

Majority Vote

\[ \hat x^{(maj)}_k = \mathrm{sign} \left( \sum_{i=1}^n Y_{ik} \right) \enspace .\]

Max risk of majority vote

\[\mathcal R^*_{\mathrm{reco}}(n, d, \lambda, \hat x^{(maj)}) \asymp \tfrac{d \sqrt{n}}{\lambda}\]

Worst case (\(\lambda=1\)): \(M \asymp \frac{1}{\sqrt{n}}(\mathbf 1_{n\times d})\)

In this case, \(\hat x^{(maj)}\) is no better than random labelling and \(\|M\mathrm{diag}(\hat x \neq x^*)\|_F^2 \asymp d\sqrt{n}\)

Least square (conjectured NP Hard)

Minimize

\[\|Y- \lambda M' \mathrm{diag}(x)\|_F^2\]

  • over all \(x \in \{-1, 1\}^d\)
  • and \(M'\) isotonic, up to a permutation \(\pi^*\)

The set of isotonic matrices is convex

But not isotonic matrices up to a permutation \(\pi^*\)

It is minimax optimal (Shah et al., 2020)

OBI-WAN (Shah et al., 2020)

Idea: Population term \((M\mathrm{diag}(x^*))(M\mathrm{diag}(x^*))^T\) is independent of \(x^*\)

PCA Step:

\[\hat v = \underset{\|v\|=1}{\mathrm{argmax}}\|v^T Y\|^2\]

(Shah et al., 2020) sort \(|\hat v|\) to get a partial ranking

Aggregation: Majority vote on \(k\) top experts according to \(|\hat v|\)

Result on Obi-Wan Method

Theorem 1 (Shah et al., 2020)

In submodel where \(\mathrm{rk}(M)= 1\), \(\hat x^{(\mathrm{Obi-Wan})}\) achieves minimax risk up to polylogs: \[\mathcal R(n,d, \lambda, \hat x^{(\mathrm{Obi-Wan})}) \lesssim \frac{d}{\lambda} \quad \textbf{(minimax)}\]

Theorem 2 (Shah et al., 2020)

In the model where \(M_{\pi^*\eta^*}\) is bi-isotonic up to polylogs: \[\mathcal R(n,d, \lambda, \hat x^{(\mathrm{Obi-Wan})}) \lesssim \frac{\sqrt{n}d}{\lambda} \quad \textbf{(not minimax)}\]

Iterative Spectral Voting

Subsampling

Let \(T \geq 1\). We generate \(T\) samples \((Y^{(1)}, \dots, Y^{(T)})\) from \(Y\).

Put \(Y_{ik}\) uniformly at random into one of the \(Y^{(s)}\).

  • \(Y_{ik}^{(s)}= 0\) for all \(s\) except one, which is \(Y_{ik}\)
  • The \((Y^{(s)})\)’s are not independent!
  • Technical trick: condition on the sampling scheme

PCA Step

\[\hat v = \underset{\|v\|=1}{\mathrm{argmax}}\|v^T Y^{(1)}\|^2 \quad \text{and} \quad \tilde v = \hat v \land \sqrt{\lambda / T}\]

Main idea: if \(M\) is isotonic, then up to a polylog

\[\|MM^T\|_{\mathrm{op}} \gtrsim \|M\|_F^2\]

Idea for the proof: if \(\|M\|_F^2 \gg \frac{d}{\lambda}\), then \(\|\tilde v^T M\| \gtrsim \|M\|_F^2\)

Voting Step

Define the weighted vote vector

\[\hat w = \tilde v^T Y^{(2)}\]

Define the estimated label as

\[\hat x_k^{(1)} = \mathrm{sign}(\hat w_k)\mathbf{1}\bigg\{|\hat w_k| \gg \sqrt{\sum_{i=1}^n \tilde v_i B_{ik}^{(2)}}\bigg\}\]

Iterate

Repeat PCA Step + Voting Step on \(Y \mathrm{diag}(\hat x \neq 0)\) a polylogarithmic number of times.

We get \(\hat x^{(1)}, \hat x^{(2)}, \dots, \hat x^{(T)}\)

Output \(\hat x^{(T)}\)

\(\newcommand{\and}{\quad \mathrm{and} \quad}\)

Proof Idea

Let \(M(t) = M\mathrm{diag}(x^{(t-1)} = 0)\)

While \(M(t) \gg d/\lambda\), we prove that

\[\|\tilde v^TM(t)\|_2^2 \gtrsim \|M\|_F^2 \and \|\tilde v^TM(t+1)\|_2^2 \lesssim d/\lambda\]

By Pythagoeran Theorem, we have

\[\|M(t)\|_F^2 - \|M(t+1)\|_F^2 \geq \|\tilde v M(t)\|_2^2 - \|\tilde v^TM(t+1)\|_2^2\]

This leads to exponential decay of \(\|M(t)\|_F^2\) until \(M(t) \leq d/\lambda\)

Simulations

Synthetic Data

 

Black: \(M_{ik}=0\)
Blue: \(M_{ik} = h\)