Skip to content
Back to writing
|
events systems reliability

The Difference Between 'Working' and 'Reliable'

A system can work in testing and fail in production. The gap is stress, fatigue, and distraction. Reliable means it works when everything else is going wrong.

In testing, the audio system worked perfectly.

On show day, the wireless microphone dropped out during the CEO’s keynote. Nothing had changed — same equipment, same venue, same configuration. But there were 400 phones in the room creating RF interference. That variable wasn’t present during the sound check.

This is the gap between “working” and “reliable.”

What “Working” Means

When you test a system — any system — you’re testing under controlled conditions. The environment is calm. People are focused. There’s time to troubleshoot if something goes wrong.

In these conditions, most things work. The audio routes correctly. The projector displays the slides. The network handles the traffic. The process produces the expected output.

“Working” is the minimum bar. It’s necessary but not sufficient.

What “Reliable” Means

Reliability is a different standard. A reliable system works when:

The operator is tired. They’ve been running audio for twelve hours. Their attention span is gone. They’re working from muscle memory.

The environment is hostile. RF interference. Temperature extremes. Unpredictable loads. Network congestion. Conditions that weren’t present in testing.

Something else is broken. The reliable system keeps functioning while people scramble to fix the other problem. It doesn’t require attention it can’t get.

The user is distracted. They’re not focused on this system — they’re focused on their job. The system needs to work despite inattention.

Reliability means working when circumstances are against you. Not just when everything is perfect.

The Variables That Don’t Get Tested

Here’s what changes between testing and production:

Stress. People under pressure make different errors than people who are calm. Buttons get pressed in the wrong sequence. Steps get skipped. Assumptions go unchecked.

Fatigue. Capacity degrades over time. The configuration that was easy at 9 AM is error-prone at 9 PM. Reliable systems account for diminished human capacity.

Distraction. In testing, the system has everyone’s attention. In production, it has to compete with everything else happening. The operator is answering questions while managing the board. The user is watching the presentation, not the interface.

Environmental variance. Temperature, humidity, RF, electrical noise, network congestion — production environments contain variables that didn’t exist in the lab.

Scale. Testing usually happens with representative loads, not actual loads. The system works for 10 users. Does it work for 1,000? The video looks fine in preview. Does it look fine when every seat is filled and the HVAC is running full blast?

Edge cases. Testing covers expected scenarios. Production includes the scenarios no one expected. The slide deck with the unusual font. The speaker who brings their own laptop with non-standard video output. The file that’s slightly larger than the buffer.

The Failure Modes

Systems that work but aren’t reliable fail in predictable ways.

Operator error under pressure. The system requires precise inputs, and the operator under stress makes imprecise inputs. The training assumed focused attention that isn’t available during show conditions.

Environmental sensitivity. The system works within a narrow range of conditions. Production pushes it outside that range. Not dramatically — just enough to cause intermittent problems that are hard to diagnose.

Resource exhaustion. The system works until it runs out of something — memory, bandwidth, attention, patience. Testing didn’t run long enough or hard enough to hit the limit.

Hidden dependencies. The system works because something else is working. When that something else fails — or just degrades — the cascade begins. No one documented the dependency because in testing, it was always there.

Recovery failures. The system works until it doesn’t, and then it can’t recover. Rebooting takes too long. The reset procedure requires steps that can’t be performed under show conditions.

What Makes Systems Reliable

Reliable systems share common characteristics.

Margins. They don’t operate at the edge of their capacity. If the spec says the wireless microphone works at 200 feet, it’s used at 100 feet. If the network can handle 1,000 concurrent users, you design for 500. Margins absorb the unexpected.

Simplicity. Fewer components means fewer failure points. The clever solution with four integrations is less reliable than the obvious solution with one. Complexity is the enemy of reliability.

Fallbacks. When something fails, there’s a backup. Not documented somewhere — physically present and tested. The spare microphone is on the table. The backup presentation is loaded on a second laptop. The redundant path is already configured.

Observable state. You can tell what’s happening without opening a panel or running diagnostics. The status lights work. The monitoring actually monitors. Problems announce themselves instead of hiding.

Graceful degradation. When something fails, the system gets worse but doesn’t stop working. Audio drops to mono instead of cutting out. Video switches to a lower resolution instead of freezing. The fallback is automatic, not manual.

Recovery procedures. When things break, there’s a documented way to get back to functional. The procedure is written for someone who’s tired and stressed, not someone reading a manual in a quiet room.

The Testing Principle

The test that matters isn’t whether the system works. It’s whether it works when you’re tired, distracted, and something else is already broken.

This means testing under realistic conditions. Not just during the scheduled sound check when everyone is fresh. Also during the load-in chaos. Also after eight hours. Also when the network is congested with everyone’s phones.

It means testing recovery, not just operation. Can you restart the system in under two minutes? Can the backup be activated without documentation? Do people know what to do when the primary fails?

It means testing people, not just equipment. Does the operator know the system well enough to diagnose problems by ear? Have they practiced the failure scenarios? Is the knowledge distributed or concentrated in one person?

The Standard

“Working” is table stakes. Any system can work under ideal conditions.

“Reliable” is the standard that matters. The system that works when the operator is exhausted. When the environment is hostile. When something else is demanding attention. When the conditions aren’t what you planned for.

Reliable means it works when everything else is going wrong.

That’s the difference.


IB

Ivan Boban

Systems Architect

Related

If this is your problem in practice

Related case studies

Related Deep Dive

Press M to toggle | Click nodes to navigate