the thesis · 01

The data problem is existential for AI.

Compute keeps scaling. Architectures keep improving. But the fuel — high-quality, legally-clean, physically-grounded training data — is running out. For embodied AI and robotics it never existed at scale: vision-only models are physics-blind. Whoever controls the supply of new data controls the next generation of models.

$1T+

Enterprise value exposed to data-provenance litigation

90%

Of usable public web text already in training sets

10×

Gap between projected data demand and supply by 2030

the legal minefield · 02

Scraped data is a liability you can’t insure.

Four exposures turn yesterday’s training advantage into tomorrow’s legal risk.

provenance

Unprovable provenance

Scraped training sets can’t prove where each frame came from. Under emerging regulation, that uncertainty is now a balance-sheet liability.

Rights-holders are litigating the use of their work in training corpora. Settlements and injunctions can halt a model mid-deployment.

consent

Consent & privacy

Faces, locations and private moments scraped without consent expose labs to privacy actions across multiple jurisdictions.

contamination

Dataset contamination

A single poisoned or mislabelled source propagates silently through a model. Scraped data offers no chain of custody to audit.

the comparison · 03

Three ways to feed a model.

capability	scraped web	simulation only	ARRAY
Copyright-safe provenance	✕	✔ yes	✔ yes
Real-world physics (inertia, torque)	partial	✕	✔ yes
Ground-truth kinetic labels	✕	partial	✔ yes
Controllable edge cases	✕	partial	✔ yes
Scale (datasets / year)	manual	limited	200M+
API-native delivery	✕	✕	✔ yes

how we got here · 04

The closing of the open web.

In five years the supply of training data flipped from abundant and free to scarce and contested. ARRAY exists because the old playbook stopped working.

2021

High-quality public web data is still abundant. Labs scale models by scaling scraped tokens — the cheap-data era.

2022

First copyright suits filed over training data. The legal status of scraped corpora moves from settled to contested.

2023

Major platforms wall off their data and sign exclusive licensing deals. The open web begins to close.

2024

Frontier labs report diminishing returns: usable public text is nearly exhausted. Synthetic data moves from hack to strategy.

2025

Embodied AI hits the wall. Vision-only models prove physics-blind and brittle in the real world — and the ground-truth physical data they need was never on the web to scrape.

2026

ARRAY scales supply. A physical data factory captures real-world physics and multiplies one real action into hundreds of millions of proprietary, physically-grounded datasets.

why both matter · 05

Quality vs. scale is a false choice.

approach a

Volume alone fails

Billions of noisy, unlabelled frames teach a model correlations, not understanding. Throughput without depth plateaus fast.

approach b

Quality alone fails

A small set of meticulously hand-labelled clips is rich but tiny. Manual annotation can’t reach the scale frontier models need.

ARRAY

ARRAY delivers both

Capture real-world physics once, then let the engine multiply it — ground-truth depth and industrial volume in the same dataset.

WHY

See the supply for yourself.

Request a sample and benchmark ARRAY datasets against your current corpus.

Request a Data Sample See the hardware