The insight: Don't teach agents to resist attacks. Virtualize their perceived reality so attacks never enter their world. Like VMs hiding physical RAM → agents shouldn't see raw dangerous inputs.
ARCHITECTURE: - Input virtualization: Strip attacks at boundary (not after agent sees them) - Provenance tracking: Prevents contaminated learning (critical with continuous learning coming in 1-2 years per Amodei) - Taint propagation: Deterministic physics laws prevent data exfiltration - No LLM in critical path: Fully deterministic, testable
Working PoC demonstrates: - Prompt injection prevention (attacks stripped at virtualization boundary) - Taint containment (untrusted data can't escape system) - Deterministic decisions (same input = same output, always)
CRITICAL TIMING: Dario Amodei (Anthropic CEO, Feb 13): Continuous learning in 1-2 years [1] Problem: Memory poisoning + continuous learning = permanent compromise Solution: Provenance tracking prevents untrusted data from entering learning loop
Research context: - OpenAI: "unlikely to ever be fully solved" [2] - Anthropic: 1% ASR = "meaningful risk" - Academic research: 90-100% bypass rates on published defenses [3]
Seeking feedback on whether ontological security (does X exist?) beats permission security (can agent do X?) for agent systems.
Practical workarounds available in repo for immediate use while PoC matures.
Disclaimer: Personal project, not Radware-endorsed. References to published work only.
Happy to answer questions!
[1] https://www.dwarkesh.com/p/dario-amodei-2 [2] https://simonwillison.net/2024/Dec/9/openai-prompt-injection... [3] https://arxiv.org/abs/2310.12815