From e5629454cf1e61cb6beb94e72d1f0bae8027b9aa Mon Sep 17 00:00:00 2001 From: Vladimir Moushkov Date: Fri, 31 Oct 2025 19:17:41 +0200 Subject: [PATCH] Create scanning-engine.md --- docs/dev/scanning-engine.md | 525 ++++++++++++++++++++++++++++++++++++ 1 file changed, 525 insertions(+) create mode 100644 docs/dev/scanning-engine.md diff --git a/docs/dev/scanning-engine.md b/docs/dev/scanning-engine.md new file mode 100644 index 00000000..cb3bba56 --- /dev/null +++ b/docs/dev/scanning-engine.md @@ -0,0 +1,525 @@ +## 0) Scope at a glance + +**Scan surfaces** + +* **Images (static):** every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history). +* **Runtime (live containers):** env vars, process args, mounted volumes (e.g., `/run/secrets`), logs, selected files created at runtime. + +**Detection methods** + +1. **Deterministic patterns (regex)** for known secret types. +2. **Heuristics**: entropy scoring for unknown/random secrets. +3. **Contextual signals**: filename/path, key names, nearby keywords, file type hints. +4. **Structural checks**: e.g., JWT decodable, cloud key prefix/length. +5. **(Optional) Lightweight validation**: local checksum/format (no network calls by default). + +**Reporting** + +* JSON (and optionally SARIF) with: *where*, *what rule matched*, *snippet masked*, *confidence*, *severity*, *layer/container process*, and *remediation hint*. + +--- + +## 1) Docker‑aware discovery workflow + +### A. Images (static, pre‑runtime) + +1. **Obtain filesystem + metadata** + + * Prefer **API**: Docker Engine (Docker.DotNet) to `Images.GetImageAsync` and **export/tar** (`docker save`) in memory. + * Parse `manifest.json` + `config.json`; capture: + + * `config.Env` (final env), + * **history**/`created_by` for `ENV`/`ARG`/`RUN` strings, + * labels. +2. **Scan every layer** + + * Stream‑extract each layer tar (e.g., SharpCompress). + * Track **added/modified paths** per layer (so you can report: *layer N, file X*). + * **Text‑only filter**: skip clearly binary files (e.g., sample N bytes; if >30% non‑printables, skip or downrank). +3. **File content & name/path analysis** + + * Apply **regex detectors** (Section 3) and **entropy** (Section 4). + * Weigh findings with **context** (Section 5). +4. **Dockerfile/History checks** + + * Flag secrets in `ENV`/`ARG`/`RUN` strings (e.g., `ENV MYSQL_ROOT_PASSWORD=...`). + * Flag **deleted‑later files** that were present in earlier layers (common leak). + * Highlight missing `.dockerignore` patterns when suspicious files (.env, .pem, .tfstate) entered any layer. + +### B. Running containers (runtime) + +1. **Enumerate** containers and **inspect**: + + * `InspectContainerAsync` → `Config.Env`, `HostConfig.Binds`, `Mounts`, image id. +2. **Env var scan** + + * Scan all `key=value` pairs with the same detectors (regex + entropy + context on the key name). +3. **Process args** + + * `docker top` or `/proc//cmdline` via `Exec` → scan args for `--password=...`, `--api-key=...`. +4. **Mounted secret paths** + + * Default locations: `/run/secrets/*`, `/var/run/secrets/*`, K8s secret volumes, config maps that may contain creds. + * Retrieve via `GetArchiveFromContainerAsync` and scan. +5. **Logs (optional but valuable)** + + * Attach/stream logs; scan lines for secret patterns; provide **live redaction** option. + +> **Note**: Memory forensics is possible but heavy; treat as optional/IR-only. + +--- + +## 2) High‑value filename/path heuristics (fast wins) + +Run these **glob/name** checks before content scanning to prioritize files: + +**Generic secret indicators** + +``` +**/*.env **/.env* **/*secret*.* **/*secr*.* +**/*credential*.* **/*creds*.* **/*passwd* +**/password* **/*token*.* **/*apikey*.* **/*api_key*.* +**/*.pem **/*.key **/*.pfx **/*.p12 +**/*.jks **/*.keystore **/id_rsa **/id_dsa +**/id_ecdsa **/id_ed25519 **/private.pem **/server.key +**/tls.key **/jwt*.key +``` + +**Common app/config** + +``` +**/appsettings*.json **/secrets*.json +**/application.{yml,yaml,properties} +**/application-*.{yml,yaml,properties} +**/config.yaml **/settings.yml **/settings.py +**/wp-config.php **/config.php **/settings.php +**/nuget.config **/settings.xml (Maven) **/gradle.properties +**/docker-compose*.yml **/compose*.yml +**/PublishProfiles/*.pubxml +``` + +**Cloud/CLI creds** + +``` +**/.aws/credentials **/.aws/config +**/gcloud/application_default_credentials.json +**/.azure/** **/doctl/config.yaml **/.oci/config +**/.docker/config.json **/.dockercfg +**/.npmrc **/.yarnrc **/.pypirc **/.gem/credentials **/.netrc +``` + +**Infra/IaC** + +``` +**/*.tfstate **/*.tfvars* **/kube/config **/.kube/config **/*kubeconfig* +**/service-account*.json **/*-sa.json **/*-key.json +``` + +**Orchestrator runtime** + +``` +/run/secrets/* /var/run/secrets/* +``` + +--- + +## 3) **Regex detector catalog** (battle‑tested patterns) + +> Use `RegexOptions.Compiled | RegexOptions.IgnoreCase` (case‑sensitive where needed). +> Always **mask** values in reports (e.g., show first 4 + last 4 chars). + +### 3.1 Private keys / certificates + +* **OpenSSH private key** + `@"-----BEGIN OPENSSH PRIVATE KEY-----"` +* **Generic PEM private key** + `@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"` +* **PGP private key** + `@"-----BEGIN PGP PRIVATE KEY BLOCK-----"` + +> (Public keys/certificates are *not* secrets: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` → downrank/ignore.) + +### 3.2 Cloud: AWS + +* **Access Key ID** + `@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"` +* **Secret Access Key (context‑aided)** + `@"\b[A-Za-z0-9/\+=]{40}\b"` + *Boost only if near `aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY` within ~50 chars.* +* **Credentials file lines** + + * `@"aws_access_key_id\s*=\s*[A-Z0-9]{20}"` + * `@"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"` + +### 3.3 Cloud: GCP / Google + +* **API key** + `@"\bAIza[0-9A-Za-z\-_]{35}\b"` +* **Service Account JSON** (two‑term signature) + + * `@"""type""\s*:\s*""service_account"""` + * `@"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"` + +### 3.4 Cloud: Azure + +* **Storage connection string** + `@"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"` +* **SAS token (simplified)** + `@"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"` + +### 3.5 Dev platforms / SCM + +* **GitHub PAT** + `@"\bgh[prusoa]_[A-Za-z0-9]{36}\b"` +* **GitLab PAT** + `@"\bglpat-[A-Za-z0-9\-_]{20,}\b"` +* **NPM token** + + * in `.npmrc`: `@"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"` + * raw form: `@"\bnpm_[A-Za-z0-9]{36}\b"` +* **PyPI token** + `@"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"` + +### 3.6 Messaging / SaaS + +* **Slack tokens (broad)** + `@"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"` +* **Stripe** + `@"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"` +* **SendGrid** + `@"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"` +* **Mailgun** + `@"\bkey-[0-9a-zA-Z]{32}\b"` +* **Twilio** + + * SID: `@"\bAC[0-9a-f]{32}\b"` + * Auth token (context aided): `@"\b[0-9a-f]{32}\b"` near `twilio|auth[_-]?token` +* **Discord bot** + `@"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"` + +### 3.7 Database / service connection strings + +* **PostgreSQL** + `@"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"` +* **MySQL** + `@"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"` +* **MongoDB** + `@"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"` +* **SQL Server (ADO.NET)** + `@"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"` +* **Redis** + `@"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"` +* **Basic auth in URL (generic)** + `@"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"` + +### 3.8 Docker / CLI auth artifacts + +* **Docker config.json auth** + `@"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""` +* **.netrc auth** + `@"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"` + +### 3.9 Tokens / JWT + +* **JWT (structural)** + `@"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"` + +### 3.10 Build tools / package managers + +* **NuGet (cleartext)** + `@"\s*[^<]+\s*[^<]+\s*[^<]+"` +* **Gradle** + `@"(?i)\bsigning\.password\s*=\s*.+"` + +> Keep regexes modular; associate each with: +> `{ Id, Name, Pattern, Severity, Examples, RecommendedRemediation }`. + +--- + +## 4) Entropy detector (catches “unknown” secrets) + +**Why:** Many org‑specific tokens won’t match known regexes. + +**Implementation** + +* Extract candidate tokens by character class: + + * base64/base64url: `[A-Za-z0-9/_\-\+=]{20,}` + * hex: `[A-Fa-f0-9]{32,}` + * general mixed: `[A-Za-z0-9]{24,}` +* Compute **Shannon entropy** per candidate. Use **alphabet‑aware thresholds**: + + * **base64/url**: ≥ **4.0** bits/char & length ≥ 24 + * **hex**: ≥ **3.0** bits/char & length ≥ 32 + * **alnum**: ≥ **4.0** bits/char & length ≥ 24 +* **Context boosts** (raise confidence) if **within 64 chars** of: + `password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer` +* **Context suppressors** (lower confidence/ignore): + + * File/path contains: `example|sample|test|fixture|dummy` + * Surrounding line contains: `REDACTED||changeme` + * Known non‑secret blocks: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` +* Cap **N findings per file** (e.g., 50) to avoid log floods. + +--- + +## 5) Scoring & de‑duping + +Combine signals into a **confidence score**: + +* +0.9 Regex “hard” match (e.g., OpenSSH private key) +* +0.7 Regex “soft” match (e.g., AWS secret 40‑char near keyword) +* +0.4 Entropy pass +* +0.2 Suspicious filename/path +* –0.5 Suppressor keyword/file +* +0.2 Structural check passes (e.g., JWT decodes) + +**Severity** + +* **Critical**: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys. +* **High**: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history. +* **Medium**: high‑entropy candidates with strong context. +* **Low**: weak context/entropy only, or likely sample values. + +**De‑dupe** same value across files/layers/envs; keep a single canonical record with **occurrence list**. + +--- + +## 6) Docker‑specific checks you must implement + +* **ENV/ARG leakage in history** + Parse `config.History[].CreatedBy` or `docker history --no-trunc`. + Flag any `ENV/ARG` with suspicious key names or values matching detectors. +* **Deleted‑later files** + If a file existed in an earlier layer and got deleted later (common `.env` mishap), still flag it and report **layer** + **instruction** that introduced it. +* **`.dockerignore` advisory** + If high‑risk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest `.dockerignore` entries. + +--- + +## 7) Runtime inspection rules + +* **Environment** + + * Scan all `Env` pairs; **boost** hits for keys containing: + `PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING` +* **Process args** + + * Flag `--password`, `--api-key`, `--token`, `--secret`, `--connection-string`. +* **Mounted secrets** + + * Enumerate `/run/secrets/*`, `/var/run/secrets/*` (Swarm/K8s). + * Ensure permissions are restrictive; still **scan contents** (apps sometimes copy them elsewhere). +* **Logs** + + * Tail & scan. Provide **optional redaction** pipeline. + +--- + +## 8) Reporting format (JSON) + +Example JSON for one finding: + +```json +{ + "detectorId": "aws.accessKeyId", + "name": "AWS Access Key ID", + "severity": "HIGH", + "confidence": 0.92, + "valueSample": "AKIA************WXYZ", + "locations": [ + { + "type": "image-layer-file", + "image": "repo/app:1.4.2", + "layerDigest": "sha256:...abc", + "path": "/app/.env", + "line": 12 + }, + { + "type": "container-env", + "containerId": "f3e9d...", + "envKey": "AWS_ACCESS_KEY_ID" + } + ], + "context": { + "filePathScore": 0.2, + "regexMatch": true, + "entropy": null, + "nearbyKeywords": ["AWS_ACCESS_KEY_ID"] + }, + "remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key." +} +``` + +> Optionally also emit **SARIF** to plug into code‑scanning dashboards. + +--- + +## 9) C# implementation sketch + +### Project layout + +``` +SecretsScanner/ + Core/ + IDetector.cs // interface: Detect(stream|text, path, context) -> Findings + RegexDetector.cs // holds Pattern, Hints, Confidence rules + EntropyDetector.cs // Shannon entropy + JwtDetector.cs // structural decoding check + FileClassifier.cs // text/binary check, ext-based hints + Scoring.cs // combine signals; severity + PathsHeuristics.cs // globs & filename rules + ReportModel.cs // JSON schema / SARIF + Docker/ + ImageReader.cs // reads image tars, layers via Docker.DotNet or stream + HistoryParser.cs // extracts ENV/ARG from history + ContainerInspector.cs // env, args, mounts, logs (Docker.DotNet) + Catalog/ + RegexCatalog.cs // patterns (section 3), per-detector metadata + Keywords.cs // boost/suppress lists + Cli/ + Program.cs // options: image, container, path; json output; fail-on +``` + +### C# snippets (illustrative) + +**Regex catalog** + +```csharp +public static class RegexCatalog +{ + public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules = + { + ("pem.openssh", "OpenSSH Private Key", + new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled), + "CRITICAL", "Remove private keys from images; use mounts or vault."), + ("pem.private", "PEM Private Key", + new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled), + "CRITICAL", "Remove private keys; rotate credentials."), + ("aws.akid", "AWS Access Key ID", + new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled), + "HIGH", "Rotate; use IAM roles/STS; remove from code/config."), + ("github.pat", "GitHub Personal Access Token", + new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled), + "HIGH", "Revoke PAT; use fine-grained tokens; remove from image."), + // ... add remaining patterns from Section 3 + }; +} +``` + +**Entropy** + +```csharp +public static class Entropy +{ + public static double Shannon(ReadOnlySpan s, ReadOnlySpan alphabet) + { + Span counts = stackalloc int[256]; + int n = 0; + foreach (var ch in s) + { + if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; } + } + if (n == 0) return 0.0; + double H = 0.0; + for (int i = 0; i < counts.Length; i++) + { + if (counts[i] == 0) continue; + double p = counts[i] / (double)n; + H -= p * Math.Log(p, 2); + } + return H; + } +} +``` + +**Candidate extraction (simplified)** + +```csharp +static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled); +static readonly Regex HexToken = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled); + +IEnumerable ExtractCandidates(string line) +{ + foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line); + foreach (Match m in HexToken.Matches(line)) yield return new Candidate(m.Value, "hex", line); +} +``` + +**Scoring** + +```csharp +double Score(DetectionSignals s) +{ + double score = 0; + if (s.RegexHard) score += 0.9; + if (s.RegexSoft) score += 0.7; + if (s.EntropyHit) score += 0.4; + if (s.SuspiciousPath) score += 0.2; + if (s.StructuralOk) score += 0.2; + if (s.Suppressor) score -= 0.5; + return Math.Clamp(score, 0, 1); +} +``` + +**Docker (Docker.DotNet)** + +* Images: `IImageOperations.GetImageHistoryAsync`, `Images.GetImageAsync` + tar unpack. +* Containers: `Containers.InspectContainerAsync`, `Exec.ExecCreateContainerAsync` + `ExecStart`, `GetArchiveFromContainerAsync`, `Logs.GetContainerLogsAsync`. + +--- + +## 10) False‑positive control & hygiene + +* **Ignore lists**: file globs (`test/**`, `**/*.example.*`), value lists (`REDACTED`, `example`, `dummy`, `changeme`). +* **Public materials**: downrank matches inside `BEGIN PUBLIC KEY`/`BEGIN CERTIFICATE`. +* **Thresholds**: tune entropy and minimum lengths to your codebase; keep per‑detector knobs in config. +* **Masking**: never print full values; keep secure logs. +* **Rate‑limits**: cap per‑file matches; cap per‑container to avoid spam. + +--- + +## 11) CI/CD and policy + +* **Build step**: after `docker build`, run image scan; **fail** on High/Critical (configurable). +* **Pre‑deploy**: scan runtime env for env/args/mounts (read‑only). +* **Baselining**: allow a first pass to **baseline known leftovers**, then block any **new** secrets. +* **Rotation**: auto‑emit per‑type remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager). + +--- + +## 12) Optional enhancements + +* **SBOM‑guided scanning**: use SBOM/file inventory to prioritize text/config assets; cache base layers. +* **JWT structural checks**: base64url‑decode header/payload; verify JSON; flag if plausible. +* **Checksum checks**: Luhn for CCNs (if in scope); simple format checks for cloud tokens. +* **Interactive audit**: CLI `--audit` mode to triage and write an “allowlist/baseline”. + +--- + +## 13) Minimal “first list” your dev can paste today + +**Start with these detectors (high ROI):** + +* PEM/OPENSSH private keys +* AWS AKID + secret (context‑aided) +* GitHub PAT, GitLab PAT, NPM, PyPI +* Slack, Stripe, SendGrid, Twilio +* Docker config `auth` field +* DB connection strings (Postgres/MySQL/Mongo/SQLServer) +* JWT +* `.aws/credentials`, `.npmrc`, `.docker/config.json`, `appsettings*.json`, `.env*`, `*.tfstate`, `*kubeconfig*` (path heuristics) +* Entropy (base64/hex/alnum) with context boosts/suppressors + +That set alone catches the overwhelming majority of real‑world leaks. + +--- + +### Final note + +This blueprint keeps everything **offline** (no external calls), so it’s safe in CI and reproducible. If you later want to add **credential validation** (e.g., confirm an AWS key via STS), make it opt‑in and heavily rate‑limited. + +If you want, I can package these regexes and the scaffolding into a **starter C# repo** with a CLI (`scan image | scan container | scan path `) and JSON output.