# Deep Research: Can a phone webcam "read your mind" from your eyes?

This document grounds the prototype in the actual science. **Short answer: you cannot
decode the specific words or thoughts in someone's head from their eyes alone — that is
pseudoscience. But you *can* decode coarse, noisy signals: arousal / mental effort,
recognition ("have I seen this?"), and the *direction* of a binary choice or preference.**
This app is built to do the honest, real version of that — a *per-user-trained eye-signal
decoder* — not a magic mind reader.

---

## 1. In-browser eye tracking on phones

| Option | What it gives | Verdict |
|---|---|---|
| **MediaPipe Face Landmarker** (`@mediapipe/tasks-vision`) | 478 3D landmarks incl. **10 iris points**, plus 52 **blendshapes** (eyeLookIn/Out/Up/Down, eyeBlink…). ~30+ FPS on a modern phone with GPU. Actively maintained. | **Chosen engine.** Iris center → gaze ratio; blendshapes → gaze/blink; iris diameter → relative pupil proxy. |
| TensorFlow.js `face-landmarks-detection` | Wrapper around the same FaceMesh/iris model (`refineLandmarks:true` → 478 pts). | Equivalent; we use tasks-vision directly. |
| **WebGazer.js** | Regression from eye-patch → screen (x,y). ~**4° accuracy** (several cm), laggy on Android, click-calibration ill-suited to touch, **maintenance ended ~Feb 2026**. | Not used as primary. |

**Important:** MediaPipe iris gives landmark *positions*, **not** a true gaze vector — Google
states "iris tracking does not infer the location at which people are looking." We derive a
gaze signal ourselves from iris-position-within-the-eye ratios + blendshapes.

**"Auto-zoom and lock on the eyes":** front-camera **hardware** zoom (`applyConstraints({zoom})`)
is **unavailable on iPhone and most phones** — it's only present if the driver advertises it
(mostly Android rear cams). So the app does **software crop + upscale**: take the eye bounding
box from the landmarks, crop it out of the video frame, and draw it enlarged on a canvas. This
works on every device and is the reliable "locked & zoomed eye" view.

## 2. What eyes can and cannot reveal (the science)

**Robustly supported:**
- **Pupil = cognitive load / arousal** (task-evoked pupillary response; Hess 1964, Kahneman &
  Beatty 1966). Effect is *small* (~0.1–0.5 mm), latency ~200–400 ms, peak ~1–2 s, driven by the
  locus-coeruleus/noradrenergic system. Load classification ~75%.
- **Pupil predicts an upcoming binary choice / bias** before the overt response (de Gee, Knapen &
  Donner, PNAS 2014; Strauch et al., Sci Reports 2018 — *pupil* reveals decision formation,
  microsaccade rate does not). It reveals the *direction*, not the reasoning.
- **Pupil old/new effect** — recognized "old" items dilate the pupil more than correctly-rejected
  "new" ones; oddball/surprising targets dilate the pupil.
- **Gaze cascade** (Shimojo et al., Nature Neuroscience 2003) — when choosing the more attractive
  of two items, gaze progressively shifts toward the to-be-chosen one in the last ~0.6–1.5 s. It
  both *reflects and biases* preference. Predicting the chosen item from gaze alone reaches
  ~**68–81%** on 2-alternative choices (Bee et al. 2006).
- **Reading vs skimming** is classifiable from fixation/saccade patterns (~82%).
- **Concealed/recognition ("guilty knowledge")** via pupil + oculomotor inhibition (Cook et al.
  2012) — recognized items get larger pupil responses and *less* re-fixation. ML binary
  concealment ~74%.

**NOT supported (don't claim it):**
- Decoding **specific words, sentences, or arbitrary silent thoughts** from eye movements. There
  is no validated method. The "eye–mind link" in reading works only because the eyes are
  *physically scanning external text*; remove the stimulus and meaning-tracking breaks down.
- The NLP "looking left/right reveals lying / recall modality" eye-accessing-cues claim is
  **debunked** (Wiseman et al. 2012).
- Note: the classic "Guilty Knowledge" **P300 is an EEG signal, not eye tracking** — pupil/ocular
  CITs are weaker analogues, and the two are dissociable.

**Bottom line:** eyes are a *low-bandwidth window into states* (aroused/not, recognized/novel,
prefer-A/prefer-B) — decodable at *modest* accuracy under controlled conditions — not a readout
of propositional content.

## 3. Paradigms that actually work on a noisy webcam

| Paradigm | Signal | Accuracy | Webcam-robust? |
|---|---|---|---|
| **Dwell selection** | gaze position | >98% | **High** (best at low FPS) |
| **Gaze-cascade choice prediction** | gaze dwell ratio | ~68–81% (2-AFC) | **High** (slow accumulation) |
| Mind-writing pupil (Mathôt 2016, brightness-oscillation oddball) | pupil light-response phase | ~87–91% (2–8 opts) | Medium (timing-encoded → robust to absolute-size noise) |
| Mental-arithmetic yes/no (Stoll 2013, locked-in patients, *ran on a bedside camera*) | pupil dilation slope, 2 intervals | high, no calibration | Medium |
| Target-flash pupil oddball | stimulus-locked dilation | improves with averaging | Medium |

→ The prototype uses **gaze-cascade / dwell (Mode A: "Focus Duel")** as the reliable core, and a
**flash-the-words oddball (Mode B: "Word Oracle")** as the literal "words flashed on screen" mode,
with pupil-timing + recognition features and **trial averaging**.

## 4. ML pipeline (per-user, in the browser)

1. **Collect** raw gaze (x,y) + relative pupil per frame; mark invalid/blink samples.
2. **Clean**: deblink, smooth (EMA), **subtractive** baseline correction (divisive is artifact-prone,
   Mathôt 2018), reject implausible-baseline trials, ignore responses < ~220 ms (artifact).
3. **Features per trial** (~10–25 scalars): dwell ratios, gaze mean/half-difference (cascade slope),
   final-window gaze, fixation/saccade proxies, blink rate, pupil peak / time-to-peak / AUC / slope.
4. **Model**: start with **logistic regression / tiny 1-hidden-layer net** (L2 + dropout). At
   per-user data scale, simple models + good features beat deep nets. Standardize features (fit on
   train only).
5. **Personalize**: per-user training is **mandatory** (individual gaze offset & decision bias);
   optionally pre-train then fine-tune.
6. **Average N repetitions per decision** — the single biggest accuracy lever (≈log-linear gains).
7. **Persist** model to **IndexedDB**, scaler to localStorage; retrain as data accumulates.
8. **Evaluate** honestly: never split within a trial; baseline = chance (50% / 1-of-K); report
   single-trial *and* N-averaged accuracy with a binomial confidence interval.

**Realistic accuracy:** single-trial binary ~**60–75%**; with 3–7 averaged repeats it can climb into
the 80s. This is *above chance*, not clairvoyance.

## 5. Front-camera limits & mitigations

- ~720p/30fps typical; the **pupil spans only a handful-to-dozens of pixels** → **no absolute pupil
  diameter** on an RGB cam (IR lab trackers get a sharp high-contrast pupil). Use **relative**
  within-subject signals only.
- Ambient light / screen brightness drive the pupil light reflex and **contaminate** the cognitive
  signal → keep lighting and screen brightness constant; use timing/phase-encoded designs.
- iOS Safari: **no ImageCapture**, **no front-cam zoom**; needs HTTPS; `<video>` must be `muted` +
  `playsinline`; WebGPU only since iOS 26 (WebGL/WASM otherwise). → canvas crop, WebGL path.
- Mitigations: hold head still & fixed distance, per-session calibration, large well-separated
  targets, **average many trials**.

## 6. Privacy & honest framing (built into the app)

- **On-device only.** Video frames are **never uploaded**; all inference runs locally. Only derived
  numeric features + the model are stored, in your browser (IndexedDB/localStorage), and can be wiped.
- Eye data is **sensitive biometric data** (can encode identity, health, traits). Treated as such:
  explicit **consent** screen, local processing, easy reset. (cf. BIPA — iris scans are biometric
  identifiers requiring written consent; GDPR Art. 9 — biometrics are special-category data.)
- **Not "mind reading."** The **Barnum/Forer effect** (people rate generic feedback ~4.3/5 as
  personally accurate) and **cold reading** make such demos *feel* far more accurate than they are.
  The app states its real, measured accuracy vs chance and labels the ambitious mode "experimental."

---

---

## 7. Frontier upgrade — what makes this build genuinely SOTA

A second deep-research pass (pupillary frequency tagging, sequential decoding, robust webcam DSP)
pushed the design past typical hobby/демо projects:

**A. Pupil Resonance — calibration-free covert-attention decoding (the flagship).**
Implements **pupillary frequency tagging** (Naber, Alvarez & Nakayama 2013): each on-screen option
flickers in luminance at a distinct frequency; *covertly* attending one (eyes fixed on a central
cross) amplifies the pupil's entrainment at that option's frequency. We decode with the **Goertzel**
algorithm and an **SNR-per-tag-bin** metric (the SSVEP robustness trick, which also cancels the
pupil's 1/f amplitude bias). Parameters are taken straight from the literature:
- 4-target frequency set **1.5 / 1.75 / 2.0 / 2.25 Hz** (Naber Exp 2); 2-target **1.0 / 2.0 Hz**.
- Tags kept in the pupil's responsive band (**0.8–2.5 Hz**; it's a low-pass system, no following >~4 Hz).
- **Gamma-corrected sinusoidal** luminance, high modulation depth, **constant mean luminance**
  background; **central fixation + covert attention**.
- Pipeline: deblink (cubic/linear, ~66 ms pad) → drop first 1 s (onset PLR) → detrend/high-pass →
  z-score → Hann → Goertzel SNR → argmax. ~6 s windows; **coherent accumulation** across windows.
- 30 fps webcam is well above Nyquist for ≤2.5 Hz tags — sample rate is *not* the limiter; amplitude
  noise is. Lab accuracy 73% (4-opt FFT, Naber) to ~88–91% (Mathôt likelihood); webcam will be lower,
  so we lean on longer windows + accumulation. **Decoding a webcam PFT signal is itself novel** — no
  published study has demonstrated it, which is exactly where this sits ahead of the field.
- *Note:* Mathôt's famous "Mind-Writing Pupil" is often mislabeled as frequency tagging — it's
  single-frequency (0.8 Hz) **phase**-multiplexing with a time-domain likelihood decoder. The true
  FFT paradigm we implement is **Naber et al. 2013**.

**B. Sequential Bayesian stopping (θ=0.9) — across every mode.**
Instead of a fixed number of repeats, we accumulate a posterior over options and **stop when the
leader crosses 0.9** (or a max-window cap). In the BCI literature this yields **100–550% bit-rate
gains at equal accuracy** (Mainsah et al.; SPRT 84% vs fixed 72–76%). Focus Duel and Pupil Resonance
both use it, with live confidence feedback and ITR (Wolpaw bits/min) reporting.

**C. Honest validation built into the UI.**
A one-tap **Validate** panel runs **leave-one-out cross-validation + a 300-run label-permutation
test** on your own data and reports accuracy ± CI, chance, and a **p-value** — the accepted standard
(Combrisson & Jerbi) for claiming above-chance decoding, and the antidote to the Barnum illusion.
Per-modality classifiers can be **temperature/Platt-calibrated** so the Bayesian threshold is valid.

**D. Signal-processing rigor** (Mathôt 2018; mPRF 2023; Shah et al. 2024): subtractive baseline,
deblink with ~66 ms pad, **uniform resampling** of jittery webcam frames (use
`requestVideoFrameCallback` `mediaTime`), per-window z-scoring, edge trimming, and per-trial
**quality gating** (reject >30% track-loss), with relative pupil normalized by inter-ocular distance.

### Frontier sources
Naber, Alvarez & Nakayama 2013 "Tracking the allocation of attention using human pupillary oscillations" (Front. Psychol.) ·
Mathôt et al. 2016 "The Mind-Writing Pupil" (PLOS ONE) + `smathot/mind-writing-pupil` ·
Mathôt et al. 2013 "Pupillary light response reveals the focus of covert visual attention" ·
Schütz et al. (pupil frequency tagging in binocular rivalry, F1/phase method) ·
Mainsah et al. (P300 dynamic stopping, bit-rate gains) · SPRT-for-BCI (PMC5734001) · θ=0.9 Bayesian dynamic stopping (PMC3798004) ·
Krajbich & Rangel aDDM (PNAS 2011); GLAM/GLAMbox; Papadopoulos et al. (gaze→preference, last-fixation 76%, arXiv 2603.24849) ·
Kappel et al. JOV 2021 (pupil single-trial AUC 0.76) · Mathôt 2018 preprocessing (PMC5809553); mPRF (PMC11036301) ·
Shah et al. 2024 webcam pupil upscaling (arXiv 2408.10397); PupilSense (arXiv 2407.11204) ·
Combrisson & Jerbi (permutation testing of CV accuracy, PMC4053638); temperature scaling (arXiv 1909.10155).

### Selected sources
MediaPipe Face Landmarker (ai.google.dev) · TF.js face-landmarks-detection · WebGazer (webgazer.cs.brown.edu) ·
Kahneman & Beatty 1966; Mathôt 2018 "preprocessing & baseline correction of pupil data" (PMC5809553) ·
de Gee et al. PNAS 2014; Strauch et al. Sci Reports 2018 · Shimojo et al. Nat Neurosci 2003; Bee et al. 2006 ·
Mathôt et al. "The Mind-Writing Pupil" PLOS ONE 2016 · Stoll et al. Current Biology 2013 ·
PupilSense arXiv:2407.11204; "Pupil diameter prediction benefits from upscaling" arXiv:2408.10397 ·
Forer 1948 / Barnum effect · BIPA; GDPR Art. 9.
*(Full URL list in the research transcript that generated this summary.)*