Address Daria's review: tree-sitter, path matching, MCP injection

Changes based on reviewer feedback:

- Promote tree-sitter-bash to primary parser recommendation (over bash
  Hex package). Battle-tested grammar, robust against adversarial input.
- Fix path matching: directory-boundary matching instead of string prefix.
  reads_file("~/.ssh") no longer matches ~/.ssh_backup/key.
- Tilde expansion uses HOME from hook payload, not daemon environment.
  Correct for containers, remote SSH.
- MCP parameter injection: replaced regex with McpParameterInjection
  validator that runs string values through BashAnalyzer for consistency.
- Regex safety: add 1ms evaluation timeout to prevent catastrophic
  backtracking. Timed-out regex falls through to AST pass.
- Document that regex rules can only deny/suspect, never allow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Flo
2026-03-30 10:31:44 +02:00
parent 8be22dac33
commit a133271a6c

View File

@@ -178,7 +178,8 @@ security-hooks/
│ │ │ └── validators/
│ │ │ ├── unknown_executable.ex
│ │ │ ├── dependency_mutation.ex
│ │ │ ── secret_access.ex
│ │ │ ── secret_access.ex
│ │ │ └── mcp_parameter_injection.ex
│ │ └── security_hooks.ex
│ ├── mix.exs
│ └── test/
@@ -374,7 +375,9 @@ Matches when a path appears in any of these AST positions:
- Source/dot commands: `source ~/.bashrc`, `. ~/.profile`
- Here-string file references: `command <<< $(cat ~/.ssh/id_rsa)`
Paths are prefix-matched: `reads_file("~/.ssh")` matches `~/.ssh/id_rsa`, `~/.ssh/config`, etc. Tilde expansion is applied before matching.
Paths are matched by **directory boundary**, not string prefix: `reads_file("~/.ssh")` matches `~/.ssh/id_rsa` and `~/.ssh/config`, but does not match `~/.ssh_backup/key` or `~/.sshrc`. The rule path is treated as a directory — the candidate path must either equal it exactly or have it as an ancestor with a `/` separator.
Tilde expansion uses the `HOME` value from the hook payload (the user's environment), not the daemon's process environment. This ensures correct resolution in containerized or remote SSH scenarios.
#### `writes_file` semantics
@@ -383,7 +386,7 @@ Matches when a path appears in any of these AST positions:
- Command arguments to known file-writing commands: `tee /etc/hosts`, `cp src /etc/`
- The `dd` `of=` argument: `dd of=/dev/sda`
Same prefix-matching and tilde expansion as `reads_file`.
Same directory-boundary matching and tilde expansion as `reads_file`.
#### `sets_env` semantics
@@ -471,7 +474,9 @@ For bash rules, `{base_command}` is extracted as the first whitespace-delimited
Rules are evaluated in **two passes**, grouped by matching strategy. First match within either pass wins.
**Pass 1: Regex rules** — all regex-based rules are checked in file order (fast, microsecond-level). If any matches, the verdict is returned immediately and the AST parser is never invoked.
**Pass 1: Regex rules** — all regex-based rules are checked in file order (fast, microsecond-level). If any matches, the verdict is returned immediately and the AST parser is never invoked. Note: regex rules can only deny or flag as suspicious — they never produce an `allow` verdict. A regex false positive blocks a safe command (annoying but not a security hole); it can never greenlight a dangerous one.
All regex patterns are compiled with a 1ms evaluation timeout to prevent catastrophic backtracking from becoming a denial-of-service vector. If a regex times out, the rule is skipped (not matched), and the command falls through to the AST pass.
**Pass 2: AST rules** — if no regex rule matched, the command is parsed into an AST (once, cached for the request). All AST-based rules are then checked in file order against the parsed tree.
@@ -600,11 +605,7 @@ MCP rules match against `tool_name` (format: `mcp__servername__toolname`). The s
```
# Tier: block — injection patterns in MCP tool parameters
block "mcp-parameter-injection"
match_any
\$\(.*\)
`.*`
;\s*\w+
\|\s*\w+
validator SecurityHooks.Validators.McpParameterInjection
nudge "MCP tool parameters contain shell injection patterns"
# Tier: suspicious — external resource access
@@ -925,6 +926,10 @@ defmodule SecurityHooks.Validators.DependencyMutation do
end
```
### `McpParameterInjection` validator
Scans all string values in MCP `tool_input` for shell injection patterns. Rather than using regex on serialized JSON (which is fragile), it iterates over parameter values and runs any string that looks like it could be shell-executed through the BashAnalyzer. If the AST contains command substitutions, pipes, or semicolons, it denies. This gives MCP parameter checking the same evasion resistance as bash command checking.
## Bash Parser Strategy
The AST matching layer requires a bash parser that produces a traversable tree of command nodes. This is the highest-risk technical dependency in the system.
@@ -942,15 +947,15 @@ The parser must handle:
- Variable assignments: `FOO=bar cmd`, `export FOO=bar`
- Quoting: single quotes, double quotes, `$"..."`, `$'...'`
### Parser evaluation (to be validated in implementation spike)
### Parser options
**Option A: `bash` Hex package** — if it exposes a stable parse API with the node types above, this is the simplest integration. Risk: the package is primarily an interpreter and its internal AST may not be a stable public API.
**Option A: tree-sitter-bash via Rust NIF (recommended)** — the tree-sitter bash grammar is battle-tested and widely used (ShellCheck, GitHub syntax highlighting, every major editor). A Rust NIF wrapping `tree-sitter` + `tree-sitter-bash` provides a robust, well-documented AST with concrete node types (command, pipeline, subshell, redirected_statement, variable_assignment, etc.). The grammar is maintained by the tree-sitter community and handles adversarial inputs well. The Rust NIF compiles cleanly into the Burrito binary.
**Option B: tree-sitter-bash via NIF** — the tree-sitter bash grammar is battle-tested and widely used (ShellCheck, GitHub syntax highlighting). A Rust NIF wrapping `tree-sitter` + `tree-sitter-bash` would provide a robust, well-documented AST. Higher implementation effort but lower risk for adversarial inputs.
**Option B: `bash` Hex package** — simpler integration if it exposes a stable parse API. Risk: the package is primarily a bash interpreter and its internal AST may not be a stable public API. Suitable as a fallback if the tree-sitter NIF proves too complex to build.
**Option C: `shlex` for tokenization + custom parser** — use `shlex` (Rust NIF) for POSIX-compliant tokenization, then build a lightweight parser on top for the structural features we need (pipelines, redirections, subshells). Middle ground: less effort than tree-sitter, more robust than depending on `bash` internals.
**Option C: `shlex` for tokenization + custom parser** — use `shlex` (Rust NIF) for POSIX-compliant tokenization, then build a lightweight parser for structural features. Less robust than tree-sitter for deeply nested or adversarial inputs.
**Decision:** Validate Option A first during the implementation spike. If the `bash` package's parse API is insufficient or unstable, fall back to Option B (tree-sitter-bash). The rule engine's interface to the parser is abstracted behind `BashAnalyzer`, so the parser can be swapped without affecting rules or the rest of the system.
**Decision:** Use tree-sitter-bash via Rust NIF as the primary parser. The tree-sitter grammar is the most battle-tested option and is the standard choice for security-sensitive shell analysis. The rule engine's interface to the parser is abstracted behind `BashAnalyzer`, so the parser can be swapped if needed.
### Security considerations for the parser
@@ -975,4 +980,6 @@ Elixir/Hex packages required by the daemon:
- `toml` — TOML config parsing
- `file_system` — cross-platform file watcher for hot-reload
- `burrito` — compile to single-binary for distribution
- Bash parser — one of: `bash` (Hex), tree-sitter-bash (via Rust NIF), or `shlex` + custom parser (see Bash Parser Strategy)
Rust NIF (compiled into the Burrito binary):
- `tree-sitter` + `tree-sitter-bash` — bash AST parser for structural command analysis (primary parser, see Bash Parser Strategy)