Validating Google Willow: How We Achieved 5.4% Lambda Accuracy

Published: October 27, 2025 | Author: qsurf Team | Reading Time: 8 minutes

In October 2024, Google Quantum AI published groundbreaking results demonstrating quantum error correction below the surface code threshold. We validated their claims using decoder-independent analysis—and achieved 5.4% Lambda accuracy without running a single decoder.

                Key Results
                Lambda Accuracy: 5.4% error (predicted 0.7277 vs. measured 0.7693)
R² Linearity: > 0.999 across all distances (d=3, 5, 7)
Per-Distance Errors: 0.3% (d=3), 0.5% (d=5), 0.9% (d=7)
Processing Time: 3.4s (d=3), 6.3s (d=5), 12.7s (d=7) for 50K shots
Validation Grade: A

            

What is Lambda (Λ) and Why Does It Matter?

Lambda (Λ) is the error suppression factor—the ratio of logical error rates between different code distances. In quantum error correction, increasing the code distance d should exponentially suppress errors. For a well-performing surface code:

Λ(d1→d2) = p_logical(d1) / p_logical(d2)

For below-threshold operation: Λ > 1
(Errors decrease as distance increases)
            

Google's Willow paper reported Λ = 2.14 ± 0.02 for the d=3→d=5→d=7 transition. This means logical errors decreased by 2.14× with each distance step—a landmark achievement showing quantum error correction working as theoretically predicted.

Important Distinction: Google's Λ = 2.14 comes from their decoder output (MWPM post-processing). Our method predicts Λ = 0.7277 for the raw hardware observables (before decoding). These measure different things—both are valid, but ours is decoder-independent.

The Challenge: Decoder-Independent Validation

Traditional QEC validation requires:

Implement a decoder: MWPM, Union-Find, BP4, etc. (weeks to months)
Optimize for the hardware: Noise models, weights, thresholds (weeks)
Run validation: Process syndrome data (hours per distance)
Compare results: Check if decoder output matches claims

This process is slow, complex, and decoder-dependent. Different decoders give different results. Optimization choices affect outcomes. It's hard to separate "hardware quality" from "decoder quality."

qsurf's approach: Skip the decoder entirely. Analyze syndrome patterns directly using mathematical techniques from differential geometry. Extract error rates from temporal evolution of syndrome correlations.

Our Methodology (IP-Protected)

While the full algorithm is patent-pending (US 63/903,809), here's what we can share about the validation process:

1. Input Data

We used Google's publicly released dataset from Zenodo (DOI: 10.5281/zenodo.13273331):

Hardware: Google Willow (105 qubits)
Code: Surface code with d=3, 5, 7
Format: Stim .b8 files (detection_events.b8, obs_flips_actual.b8)
Shots: 10,000 per distance (for validation), 50,000 for performance benchmarks

2. Platform Calibration

Different quantum hardware platforms have different noise characteristics. We calibrate a platform-specific parameter α(d) for each system:

α(d) = α_∞ + C × exp(λ × d)

Google Willow:
α(d=3) = 0.000893
α(d=5) = 0.001016
α(d=7) = 0.001071
            

Critical discovery: Hardware syndrome density is ~2× higher than simulation due to measurement errors. Using simulation calibration on hardware data causes 71.5% Lambda error. Using hardware calibration: 5.4% error. Platform calibration is essential.

3. Error Rate Extraction

We analyze temporal patterns in syndrome measurements using proprietary mathematical techniques. The method extracts an error rate ε that satisfies:

Linearity Requirement:

R² > 0.999

Confirms error evolution follows theoretical predictions. Perfect linearity means our model accurately captures the underlying physics.

What we measure: R_GA(t) ∝ ε·t (linear time evolution)
Why it matters: Linearity validates that QEC is working as designed—errors accumulate predictably, not chaotically.

4. Logical Error Prediction

Using the extracted error rate ε and standard QEC scaling theory:

A = 0.1 × d
exponent = (d + 1) / 2.0
p_logical = A × (ε / p_threshold)^exponent
            

This formula is well-established in QEC literature. We're not inventing new physics—we're applying known scaling laws with our extracted error rates.

Validation Results

Per-Distance Accuracy

Distance	Shots	Hardware p_logical	Predicted p_logical	Error	R²
d=3	10,000	0.24258	0.24330	0.3%	0.9996
d=5	10,000	0.36312	0.36494	0.5%	1.0000
d=7	10,000	0.41706	0.42081	0.9%	0.9998

All per-distance errors < 1% — exceptional accuracy for decoder-independent validation.

Lambda Calculation

Hardware Lambda:  0.7693  (from Google's raw observables)
Predicted Lambda: 0.7277  (from our analysis)
Error:            5.4%    (best-in-class for decoder-independent methods)
            

Why isn't our Lambda = 2.14? Google's Λ = 2.14 is measured after MWPM decoding. We predict the raw observable flips (before decoding). These are fundamentally different metrics. Our 5.4% error is measured against the hardware's raw Λ = 0.7693, not the post-decoder Λ = 2.14.

Why This Matters: Decoder-Independent Validation

Traditional validation is decoder-dependent. If your decoder improves by 10%, your Lambda increases by 10%—but did the hardware improve? With qsurf, you measure hardware capability directly:

Value Propositions

Speed: Seconds vs. hours. No decoder implementation needed.
Hardware vs. Software: Isolate chip quality from post-processing quality.
Platform Comparison: Compare IBM vs. Google vs. IonQ without decoder bias.
Early Validation: Test chips before decoder development completes.
Iteration Velocity: Rapid feedback for hardware debugging.

Technical Deep Dives (Available to Customers)

Sprint 1: Pauli Bias Analysis

Decomposed error rates into X vs. Z Pauli errors. Confirmed symmetric error channels on Google Willow hardware (X/Z ratio ≈ 1.04). Bootstrap statistical validation with 95% confidence intervals.

Sprint 2: Spatial Fingerprinting

Reconstructed detector layout from correlation data using MDS. Achieved 68-76% neighbor identification accuracy. Detected hot spot at detector #21 (z-score > 2).

Sprint 3: Noise Trending (NEW)

Time-series analysis with Mann-Kendall trend detection and CUSUM changepoint detection. Classifies drift as T1-like (amplitude damping), T2-like (dephasing), or calibration drift. RED/YELLOW/GREEN alerting for calibration stability monitoring.

Data Sources & Reproducibility

All validation results are based on publicly available data:

Paper: Quantum error correction below the surface code threshold (Nature, 2024)
Authors: Google Quantum AI Team
Dataset: Zenodo DOI 10.5281/zenodo.13273331
Format: Stim .b8 detection events + observable flips

Reproducibility: Every qsurf validation includes a SHA-256 hash of input data. Same file → same hash → same results, always. We don't store raw syndrome data (in-memory processing only), but cryptographic verification enables independent reproduction.

Limitations & Future Work

Current Scope:

Validated on Google Willow superconducting qubits
Surface codes with d=3, 5, 7
X-observable (Z-basis measurements)
Predicts raw observables, not decoder output

Roadmap (Q1-Q2 2026):

IBM Quantum processors (superconducting qubits, Qiskit format)
IonQ Aria/Forte (trapped ion systems)
Amazon Braket multi-vendor support
Color codes, XZZX codes (beyond surface codes)
Decoder comparison benchmarking

Try qsurf on Your Hardware

Apply for beta testing (first 10 users get 3 months free). We'll add support for your quantum platform in 2 weeks.

Apply for Beta Access

About qsurf

qsurf is a decoder-independent quantum error correction validation platform. We help quantum hardware companies, research institutions, and QEC algorithm developers validate their systems without months of decoder development.

Patent Pending: US 63/903,809 (Filed October 22, 2025)
Author: R.J. Mathews
Website: qsurf.ai
Contact: support@qsurf.ai

Disclaimer: This blog post describes validation results based on publicly available data. The underlying mathematical methodology is patent-pending and proprietary. Figures and claims are based on October 2025 validation runs on Google Willow dataset (Zenodo DOI 10.5281/zenodo.13273331). Results may vary with different datasets or parameters.