The problem
In production distributed systems, security often breaks when things are half working:
auth services degrade → retries explode
fallback paths widen access
recovery logic becomes the attack surface
Nothing is “exploited”, yet the system becomes unsafe.
Most security models assume stable components and clean failures. Real systems don’t behave that way.
Design assumptions
We assume:
correlated failures
retries are adversarial
timeouts are unsafe defaults
recovery paths matter as much as steady-state logic
We don’t assume:
global consistency
perfect identity
reliable clocks
centralized enforcement
Framework ideas (high level)
This work explores four ideas:
1. Failure-aware trust
Trust degrades under failure, not just compromise
Access narrows automatically during partial outages
2. Security invariants at runtime
Invariants are continuously enforced
Violations trigger containment, not alerts
3. Retry-safe security primitives
Idempotent, monotonic, side-effect bounded
Retries can’t escalate privilege
4. Security as observable state
Trust level, degradation, and containment are visible
If you can’t observe it, you can’t secure it
What this is not
Not zero trust marketing
Not compliance
Not a finished system
It’s an attempt to treat failure as the normal case, not an exception.
Why publish this early?
Because many real failures:
don’t fit clean research papers
happen during incidents, not attacks
are invisible outside production systems
We’re sharing design notes to get feedback before formalizing or evaluating further.
Feedback welcome
If you’ve seen security regressions during outages or retries causing unsafe behavior, I’d like to hear about it.
This is ongoing work. No claims of novelty or completeness.