The Difference Between 'Working' and 'Reliable'
A system can work in testing and fail in production. The gap is stress, fatigue, and distraction. Reliable means it works when everything else is going wrong.
In testing, the audio system worked perfectly.
On show day, the wireless microphone dropped out during the CEO’s keynote. Nothing had changed — same equipment, same venue, same configuration. But there were 400 phones in the room creating RF interference. That variable wasn’t present during the sound check.
This is the gap between “working” and “reliable.”
What “Working” Means
When you test a system — any system — you’re testing under controlled conditions. The environment is calm. People are focused. There’s time to troubleshoot if something goes wrong.
In these conditions, most things work. The audio routes correctly. The projector displays the slides. The network handles the traffic. The process produces the expected output.
“Working” is the minimum bar. It’s necessary but not sufficient.
What “Reliable” Means
Reliability is a different standard. A reliable system works when:
The operator is tired. They’ve been running audio for twelve hours. Their attention span is gone. They’re working from muscle memory.
The environment is hostile. RF interference. Temperature extremes. Unpredictable loads. Network congestion. Conditions that weren’t present in testing.
Something else is broken. The reliable system keeps functioning while people scramble to fix the other problem. It doesn’t require attention it can’t get.
The user is distracted. They’re not focused on this system — they’re focused on their job. The system needs to work despite inattention.
Reliability means working when circumstances are against you. Not just when everything is perfect.
The Variables That Don’t Get Tested
Here’s what changes between testing and production:
Stress. People under pressure make different errors than people who are calm. Buttons get pressed in the wrong sequence. Steps get skipped. Assumptions go unchecked.
Fatigue. Capacity degrades over time. The configuration that was easy at 9 AM is error-prone at 9 PM. Reliable systems account for diminished human capacity.
Distraction. In testing, the system has everyone’s attention. In production, it has to compete with everything else happening. The operator is answering questions while managing the board. The user is watching the presentation, not the interface.
Environmental variance. Temperature, humidity, RF, electrical noise, network congestion — production environments contain variables that didn’t exist in the lab.
Scale. Testing usually happens with representative loads, not actual loads. The system works for 10 users. Does it work for 1,000? The video looks fine in preview. Does it look fine when every seat is filled and the HVAC is running full blast?
Edge cases. Testing covers expected scenarios. Production includes the scenarios no one expected. The slide deck with the unusual font. The speaker who brings their own laptop with non-standard video output. The file that’s slightly larger than the buffer.
The Failure Modes
Systems that work but aren’t reliable fail in predictable ways.
Operator error under pressure. The system requires precise inputs, and the operator under stress makes imprecise inputs. The training assumed focused attention that isn’t available during show conditions.
Environmental sensitivity. The system works within a narrow range of conditions. Production pushes it outside that range. Not dramatically — just enough to cause intermittent problems that are hard to diagnose.
Resource exhaustion. The system works until it runs out of something — memory, bandwidth, attention, patience. Testing didn’t run long enough or hard enough to hit the limit.
Hidden dependencies. The system works because something else is working. When that something else fails — or just degrades — the cascade begins. No one documented the dependency because in testing, it was always there.
Recovery failures. The system works until it doesn’t, and then it can’t recover. Rebooting takes too long. The reset procedure requires steps that can’t be performed under show conditions.
What Makes Systems Reliable
Reliable systems share common characteristics.
Margins. They don’t operate at the edge of their capacity. If the spec says the wireless microphone works at 200 feet, it’s used at 100 feet. If the network can handle 1,000 concurrent users, you design for 500. Margins absorb the unexpected.
Simplicity. Fewer components means fewer failure points. The clever solution with four integrations is less reliable than the obvious solution with one. Complexity is the enemy of reliability.
Fallbacks. When something fails, there’s a backup. Not documented somewhere — physically present and tested. The spare microphone is on the table. The backup presentation is loaded on a second laptop. The redundant path is already configured.
Observable state. You can tell what’s happening without opening a panel or running diagnostics. The status lights work. The monitoring actually monitors. Problems announce themselves instead of hiding.
Graceful degradation. When something fails, the system gets worse but doesn’t stop working. Audio drops to mono instead of cutting out. Video switches to a lower resolution instead of freezing. The fallback is automatic, not manual.
Recovery procedures. When things break, there’s a documented way to get back to functional. The procedure is written for someone who’s tired and stressed, not someone reading a manual in a quiet room.
The Testing Principle
The test that matters isn’t whether the system works. It’s whether it works when you’re tired, distracted, and something else is already broken.
This means testing under realistic conditions. Not just during the scheduled sound check when everyone is fresh. Also during the load-in chaos. Also after eight hours. Also when the network is congested with everyone’s phones.
It means testing recovery, not just operation. Can you restart the system in under two minutes? Can the backup be activated without documentation? Do people know what to do when the primary fails?
It means testing people, not just equipment. Does the operator know the system well enough to diagnose problems by ear? Have they practiced the failure scenarios? Is the knowledge distributed or concentrated in one person?
The Standard
“Working” is table stakes. Any system can work under ideal conditions.
“Reliable” is the standard that matters. The system that works when the operator is exhausted. When the environment is hostile. When something else is demanding attention. When the conditions aren’t what you planned for.
Reliable means it works when everything else is going wrong.
That’s the difference.
Related
- Article: What Live Events Teach You About Systems — Systems must survive pressure, fatigue, and imperfect people. That’s the real test.
- Deep Dive: Building Systems Inside Seasonal Chaos — How to create operational stability when your business has inherent unpredictability.
Croatian / Hrvatski
Na testu je audio sustav radio savršeno.
Na dan predstave, bežični mikrofon je prekinuo tijekom govora direktora. Ništa se nije promijenilo — ista oprema, isto mjesto, ista konfiguracija. Ali u prostoriji je bilo 400 telefona koji su stvarali RF smetnje. Ta varijabla nije bila prisutna tijekom probe zvuka.
Ovo je praznina između “radi” i “pouzdano.”
Što znači “radi”
Kad testirate sustav — bilo koji sustav — testirate pod kontroliranim uvjetima. Okruženje je mirno. Ljudi su fokusirani. Ima vremena za rješavanje problema ako nešto pođe po zlu.
U tim uvjetima većina stvari radi. Audio se ispravno usmjerava. Projektor prikazuje slajdove. Mreža podnosi promet. Proces proizvodi očekivani output.
“Radi” je minimalni prag. Nužan je, ali nije dovoljan.
Što znači “pouzdano”
Pouzdanost je drugačiji standard. Pouzdan sustav radi kad:
Operater je umoran. Vodi audio dvanaest sati. Njegova pažnja je nestala. Radi po mišićnoj memoriji.
Okruženje je neprijateljsko. RF smetnje. Ekstremne temperature. Nepredvidiva opterećenja. Zagušenje mreže. Uvjeti koji nisu bili prisutni na testu.
Nešto drugo je pokvareno. Pouzdan sustav nastavlja funkcionirati dok ljudi žurno popravljaju drugi problem. Ne zahtijeva pažnju koju ne može dobiti.
Korisnik je rastresen. Nisu fokusirani na ovaj sustav — fokusirani su na svoj posao. Sustav mora raditi unatoč nepažnji.
Pouzdanost znači raditi kad su okolnosti protiv vas. Ne samo kad je sve savršeno.
Varijable koje se ne testiraju
Evo što se mijenja između testiranja i produkcije:
Stres. Ljudi pod pritiskom prave drugačije greške od mirnih ljudi. Tipke se pritišću u krivom redoslijedu. Koraci se preskaču. Pretpostavke se ne provjeravaju.
Umor. Kapacitet se degradira s vremenom. Konfiguracija koja je bila laka u 9 ujutro sklona je greškama u 21 sat. Pouzdani sustavi računaju na smanjeni ljudski kapacitet.
Distrakcija. Na testu sustav ima pažnju svih. U produkciji se mora natjecati sa svime drugim što se događa. Operater odgovara na pitanja dok upravlja pultom. Korisnik gleda prezentaciju, ne sučelje.
Varijabilnost okoline. Temperatura, vlažnost, RF, električni šum, zagušenje mreže — produkcijske okoline sadrže varijable koje nisu postojale u laboratoriju.
Skaliranje. Testiranje se obično događa s reprezentativnim opterećenjima, ne stvarnim. Sustav radi za 10 korisnika. Radi li za 1000? Video izgleda dobro u pregledu. Izgleda li dobro kad je svako mjesto popunjeno i klimatizacija radi punom snagom?
Rubni slučajevi. Testiranje pokriva očekivane scenarije. Produkcija uključuje scenarije koje nitko nije očekivao. Prezentacija s neobičnim fontom. Govornik koji donosi vlastiti laptop s nestandardnim video izlazom. Datoteka koja je malo veća od međuspremnika.
Načini kvara
Sustavi koji rade, ali nisu pouzdani, kvare se na predvidljive načine.
Greška operatera pod pritiskom. Sustav zahtijeva precizne unose, a operater pod stresom unosi neprecizno. Obuka je pretpostavljala fokusiranu pažnju koja nije dostupna u uvjetima predstave.
Osjetljivost na okolinu. Sustav radi unutar uskog raspona uvjeta. Produkcija ga gura izvan tog raspona. Ne dramatično — samo dovoljno da uzrokuje povremene probleme koje je teško dijagnosticirati.
Iscrpljivanje resursa. Sustav radi dok mu nešto ne ponestane — memorije, propusnosti, pažnje, strpljenja. Testiranje nije trajalo dovoljno dugo ili dovoljno intenzivno da dosegne granicu.
Skrivene ovisnosti. Sustav radi jer nešto drugo radi. Kad to nešto drugo otkaže — ili se samo degradira — kaskada počinje. Nitko nije dokumentirao ovisnost jer je na testu uvijek bila tu.
Neuspjesi oporavka. Sustav radi dok ne prestane, a onda se ne može oporaviti. Ponovno pokretanje traje predugo. Procedura resetiranja zahtijeva korake koji se ne mogu izvesti u uvjetima predstave.
Što čini sustave pouzdanima
Pouzdani sustavi dijele zajedničke karakteristike.
Margine. Ne rade na rubu svog kapaciteta. Ako specifikacija kaže da bežični mikrofon radi na 60 metara, koristi se na 30 metara. Ako mreža može podnijeti 1000 istovremenih korisnika, dizajnirate za 500. Margine apsorbiraju neočekivano.
Jednostavnost. Manje komponenti znači manje točaka kvara. Pametno rješenje s četiri integracije je manje pouzdano od očitog rješenja s jednom. Složenost je neprijatelj pouzdanosti.
Rezerve. Kad nešto otkaže, postoji backup. Ne dokumentiran negdje — fizički prisutan i testiran. Rezervni mikrofon je na stolu. Rezervna prezentacija je učitana na drugom laptopu. Redundantni put je već konfiguriran.
Vidljivo stanje. Možete vidjeti što se događa bez otvaranja panela ili pokretanja dijagnostike. Statusne lampice rade. Nadzor stvarno nadzire. Problemi se sami najavljuju umjesto da se skrivaju.
Elegantna degradacija. Kad nešto otkaže, sustav se pogoršava, ali ne prestaje raditi. Audio prelazi na mono umjesto da se prekine. Video prelazi na nižu rezoluciju umjesto da se smrzne. Prelazak je automatski, ne ručni.
Procedure oporavka. Kad se stvari pokvare, postoji dokumentirani način da se vratite u funkcionalno stanje. Procedura je napisana za nekoga tko je umoran i pod stresom, ne za nekoga tko čita priručnik u mirnoj sobi.
Princip testiranja
Test koji je bitan nije radi li sustav. Nego radi li kad ste umorni, rastreseni i nešto drugo je već pokvareno.
To znači testiranje pod realističnim uvjetima. Ne samo tijekom zakazane probe zvuka kad su svi svježi. Također tijekom kaosa utovara. Također nakon osam sati. Također kad je mreža zagušena telefonima svih prisutnih.
To znači testiranje oporavka, ne samo rada. Možete li ponovno pokrenuti sustav za manje od dvije minute? Može li se backup aktivirati bez dokumentacije? Znaju li ljudi što učiniti kad primarni otkaže?
To znači testiranje ljudi, ne samo opreme. Poznaje li operater sustav dovoljno dobro da dijagnosticira probleme po sluhu? Jesu li vježbali scenarije kvara? Je li znanje distribuirano ili koncentrirano u jednoj osobi?
Standard
“Radi” je minimalni prag. Bilo koji sustav može raditi pod idealnim uvjetima.
“Pouzdano” je standard koji je bitan. Sustav koji radi kad je operater iscrpljen. Kad je okruženje neprijateljsko. Kad nešto drugo zahtijeva pažnju. Kad uvjeti nisu onakvi kakve ste planirali.
Pouzdano znači da radi kad sve ostalo ide po zlu.
To je razlika.