The Production Data Problem
Why Industrial AI fails without a unified data foundation, and the architecture that fixes it.
6 Chapters
The Illusion of Data Readiness
Every large industrial organization believes it has enough data. It doesn't have a data problem. It has a coherence problem.
A filling machine on Line 3 is called FL-003 in the SCADA system, FILLER_L3 in the MES, Equipment #4471 in SAP, and Llenadora Linea 3 in the maintenance logs. Same physical asset. Five identities. Zero interoperability.
This is not a technology gap. It is an ontological one. The data exists, but it lacks shared identity, taxonomy, and context. And every downstream initiative — automated production reporting, control towers, predictive maintenance, AI-driven quality — is blocked by the same root cause: the factory has no coherent model of itself.
Anatomy of the Production Data Stack
Production data does not move in a single pipeline. It flows through a stack, and each layer serves a distinct purpose. Skip one, and the entire system produces unreliable outputs.
PLCs, SCADA, HMIs, Sensors. The critical design decision is how you extract without disrupting.
Raw OT data needs temporal context. A temperature reading of 72.3 is a number that refers to nothing without knowing which asset, which process step, and which product was running.
This is the layer most organizations skip. Without it, a 'Line Stop' in Plant A and a 'Paro de Linea' in Plant B are invisible to each other.
The canonical identity layer. It knows that FL-003, FILLER_L3, and Equipment #4471 are the same physical asset.
Only with Layers 1-4 in place can you reliably build intelligence.
The Identity Resolution Problem
Identity resolution is the process of determining that multiple records across multiple systems refer to the same real-world entity. In consumer tech, this problem has been solved at scale. In industrial environments, it remains the primary blocker.
Deterministic matching works where naming conventions are enforced — resolving 40-60% of entities. For the rest, you need fuzzy string matching, NLP on maintenance descriptions, and ML-assisted entity resolution.
The practical approach combines both: deterministic first (fast, high confidence), non-deterministic second (slower, requires validation), human-in-the-loop for ambiguous cases.
Architecture Patterns That Scale
A software-based gateway connects to PLCs via OPC-UA, S7, Modbus. No hardware on the OT network. No changes to PLC programs.
For mature OT infrastructure, direct streaming via MQTT or HTTPS API eliminates the gateway entirely.
Edge processing handles time-critical logic while cloud handles analytics, ML training, and cross-plant benchmarking.
The right pattern depends on OT maturity, not aspiration. Start with Pattern 1 for the first plant. Validate end-to-end. Then evaluate Pattern 2 or 3 for scale-out.
From Foundation to Intelligence
Once the foundation is operational, the application layer unlocks at speed. Each use case is impossible without resolved data, and straightforward to build with it.
Real-Time OEE That Operators Trust — the shift from 'yesterday's OEE was 67%' to 'OEE dropped 4 points in the last 20 minutes because of filler jams on Line 3' changes how operators make decisions.
Predictive Models That Actually Predict — without resolved data, ML models learn noise. They correlate artifacts of inconsistent labeling with outcomes, producing predictions operators learn to ignore.
Control Towers That Route, Not Just Display — a control tower is not a dashboard with a bigger screen. It is a decision-routing system.
Conversational AI requires a deterministic data foundation beneath a probabilistic language layer.
The Playbook: How to Start
Audit the data landscape — map every source system per plant.
Pick one high-impact line, not twenty plants.
Build the identity layer first — before dashboards, before ML models.
Deploy a secure virtual gateway — validate data accuracy at the edge.
Execute a Data Accuracy Signature — key stakeholders validate that data reflects physical reality.
Launch the first application layer — real-time OEE, stoppage Pareto, shift reporting.
Scale with templates — same architecture, parameterized for the next line, next plant, next country.