Every prompt passes through three detection layers before it reaches the LLM.

--- detection pipeline -------------------------------------------------

NodeGuarder inspects each prompt in three stages:

1. Regex (built-in) — 3 categories, ~15 patterns
    Matches API keys, database connection strings, and PII
    with specific redaction tags ([REDACTED_AWS_KEY], etc.).
    Runtime: ~2 ms.

2. ATR Community Rules — 652 rules across 7 categories
    Detects prompt injection, code execution, social engineering,
    skill compromise, excessive autonomy, model abuse, data poisoning.
    Rules auto-update from the community registry every 7 days.

3. Semantic Verification — DeBERTa-v3 (184M params)
    An ONNX model that confirms each flag before action is taken.
    The false-positive marker system overturns docs examples,
    tutorial code, placeholders, and security discussions.

If a flag survives all three checks, the configured action mode takes effect (modal, auto-redact, or auto-block).

--- built-in categories (regex) ---------------------------------------

*api_keys AWS (AKIA...), GitHub (ghp_...), Stripe (sk_live_/pk_live_), generic secrets
*db_credentials MongoDB, MySQL, PostgreSQL, Redis connection strings
*pii Email addresses, SSNs (XXX-XX-XXXX), credit card numbers

--- atr community rules ------------------------------------------------

The ATR (Agent Threat Rules) community maintains 652+ regex patterns covering 7 categories of agentic threats. NodeGuarder ships the full set and updates automatically.

*injection 219 rules — prompt injection, jailbreaks, system prompt overrides
*code_execution 211 rules — shell commands, eval abuse, reverse shells
*social_engineering 106 rules — goal hijacking, authority escalation, consent bypass
*skill_compromise 41 rules — supply chain attacks, skill impersonation, hidden capabilities
*model_abuse 39 rules — model extraction, malicious fine-tuning, security boundary violations
*excessive_autonomy 30 rules — runaway loops, resource exhaustion, unauthorized agent actions
*data_poisoning 6 rules — training data contamination, memory manipulation

Total: 652 rules

Each rule includes a severity level (critical, high, medium), a human-readable title, and one or more regex patterns scoped to specific message fields (user input, tool response, tool args, content).

--- false-positive marker system ---------------------------------------

To avoid interrupting legitimate workflows, NodeGuarder includes a two-level false-positive overturn system:

Level 1 — Strong markers (always override without model):
    Documentation examples, placeholder values (localhost, 127.0.0.1,
    password123, ****), tutorial markers (tutorial, guide, quick start).

Level 2 — Weak markers (gated by DeBERTa confidence):
    Code documentation ("npm install", "in your terminal"),
    security discussions (CVE-, vulnerability, educational purposes),
    creative context (story, novel, roleplay), code review (TODO, FIXME).

When a false-positive is detected, the audit log records detection_method: "FP_OVERTURN" and the prompt passes through without modification.

--- action modes -------------------------------------------------------

When a prompt is flagged, NodeGuarder follows the configured action mode:

ModeBehavior
permissiveShow modal — user chooses Allow, Redact, or Block
enforced_redactShow modal with Redact/Block only (no Allow)
enforced_blockShow modal with Block only
auto_redactNo modal — auto-redact and continue
auto_blockNo modal — return 403 Forbidden

The default mode is permissive. When enrolled in an enterprise portal, it auto-escalates to enforced_redact.

The HITL modal shows a 15-second countdown. On timeout, text is auto-redacted and attachments are auto-blocked.