From e5629454cf1e61cb6beb94e72d1f0bae8027b9aa Mon Sep 17 00:00:00 2001
From: Vladimir Moushkov <vladimir.moushkov@ablera.com>
Date: Fri, 31 Oct 2025 19:17:41 +0200
Subject: [PATCH] Create scanning-engine.md

---
 docs/dev/scanning-engine.md | 525 ++++++++++++++++++++++++++++++++++++
 1 file changed, 525 insertions(+)
 create mode 100644 docs/dev/scanning-engine.md
diff --git a/docs/dev/scanning-engine.md b/docs/dev/scanning-engine.md
new file mode 100644
index 00000000..cb3bba56
--- /dev/null
+++ b/docs/dev/scanning-engine.md
@@ -0,0 +1,525 @@
+## 0) Scope at a glance
+
+**Scan surfaces**
+
+* **Images (static):** every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history).
+* **Runtime (live containers):** env vars, process args, mounted volumes (e.g., `/run/secrets`), logs, selected files created at runtime.
+
+**Detection methods**
+
+1. **Deterministic patterns (regex)** for known secret types.
+2. **Heuristics**: entropy scoring for unknown/random secrets.
+3. **Contextual signals**: filename/path, key names, nearby keywords, file type hints.
+4. **Structural checks**: e.g., JWT decodable, cloud key prefix/length.
+5. **(Optional) Lightweight validation**: local checksum/format (no network calls by default).
+
+**Reporting**
+
+* JSON (and optionally SARIF) with: *where*, *what rule matched*, *snippet masked*, *confidence*, *severity*, *layer/container process*, and *remediation hint*.
+
+---
+
+## 1) Docker‑aware discovery workflow
+
+### A. Images (static, pre‑runtime)
+
+1. **Obtain filesystem + metadata**
+
+   * Prefer **API**: Docker Engine (Docker.DotNet) to `Images.GetImageAsync` and **export/tar** (`docker save`) in memory.
+   * Parse `manifest.json` + `config.json`; capture:
+
+     * `config.Env` (final env),
+     * **history**/`created_by` for `ENV`/`ARG`/`RUN` strings,
+     * labels.
+2. **Scan every layer**
+
+   * Stream‑extract each layer tar (e.g., SharpCompress).
+   * Track **added/modified paths** per layer (so you can report: *layer N, file X*).
+   * **Text‑only filter**: skip clearly binary files (e.g., sample N bytes; if >30% non‑printables, skip or downrank).
+3. **File content & name/path analysis**
+
+   * Apply **regex detectors** (Section 3) and **entropy** (Section 4).
+   * Weigh findings with **context** (Section 5).
+4. **Dockerfile/History checks**
+
+   * Flag secrets in `ENV`/`ARG`/`RUN` strings (e.g., `ENV MYSQL_ROOT_PASSWORD=...`).
+   * Flag **deleted‑later files** that were present in earlier layers (common leak).
+   * Highlight missing `.dockerignore` patterns when suspicious files (.env, .pem, .tfstate) entered any layer.
+
+### B. Running containers (runtime)
+
+1. **Enumerate** containers and **inspect**:
+
+   * `InspectContainerAsync` → `Config.Env`, `HostConfig.Binds`, `Mounts`, image id.
+2. **Env var scan**
+
+   * Scan all `key=value` pairs with the same detectors (regex + entropy + context on the key name).
+3. **Process args**
+
+   * `docker top` or `/proc/<pid>/cmdline` via `Exec` → scan args for `--password=...`, `--api-key=...`.
+4. **Mounted secret paths**
+
+   * Default locations: `/run/secrets/*`, `/var/run/secrets/*`, K8s secret volumes, config maps that may contain creds.
+   * Retrieve via `GetArchiveFromContainerAsync` and scan.
+5. **Logs (optional but valuable)**
+
+   * Attach/stream logs; scan lines for secret patterns; provide **live redaction** option.
+
+> **Note**: Memory forensics is possible but heavy; treat as optional/IR-only.
+
+---
+
+## 2) High‑value filename/path heuristics (fast wins)
+
+Run these **glob/name** checks before content scanning to prioritize files:
+
+**Generic secret indicators**
+
+```
+**/*.env        **/.env*           **/*secret*.*      **/*secr*.* 
+**/*credential*.*                 **/*creds*.*       **/*passwd* 
+**/password*   **/*token*.*       **/*apikey*.*      **/*api_key*.* 
+**/*.pem       **/*.key           **/*.pfx           **/*.p12 
+**/*.jks       **/*.keystore      **/id_rsa          **/id_dsa 
+**/id_ecdsa    **/id_ed25519      **/private.pem     **/server.key 
+**/tls.key     **/jwt*.key
+```
+
+**Common app/config**
+
+```
+**/appsettings*.json              **/secrets*.json
+**/application.{yml,yaml,properties}
+**/application-*.{yml,yaml,properties}
+**/config.yaml  **/settings.yml   **/settings.py
+**/wp-config.php **/config.php     **/settings.php
+**/nuget.config  **/settings.xml (Maven)  **/gradle.properties
+**/docker-compose*.yml   **/compose*.yml
+**/PublishProfiles/*.pubxml
+```
+
+**Cloud/CLI creds**
+
+```
+**/.aws/credentials  **/.aws/config
+**/gcloud/application_default_credentials.json
+**/.azure/**         **/doctl/config.yaml     **/.oci/config
+**/.docker/config.json  **/.dockercfg
+**/.npmrc  **/.yarnrc  **/.pypirc  **/.gem/credentials  **/.netrc
+```
+
+**Infra/IaC**
+
+```
+**/*.tfstate  **/*.tfvars*   **/kube/config  **/.kube/config  **/*kubeconfig*
+**/service-account*.json     **/*-sa.json    **/*-key.json
+```
+
+**Orchestrator runtime**
+
+```
+/run/secrets/*     /var/run/secrets/*
+```
+
+---
+
+## 3) **Regex detector catalog** (battle‑tested patterns)
+
+> Use `RegexOptions.Compiled | RegexOptions.IgnoreCase` (case‑sensitive where needed).
+> Always **mask** values in reports (e.g., show first 4 + last 4 chars).
+
+### 3.1 Private keys / certificates
+
+* **OpenSSH private key**
+  `@"-----BEGIN OPENSSH PRIVATE KEY-----"`
+* **Generic PEM private key**
+  `@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"`
+* **PGP private key**
+  `@"-----BEGIN PGP PRIVATE KEY BLOCK-----"`
+
+> (Public keys/certificates are *not* secrets: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` → downrank/ignore.)
+
+### 3.2 Cloud: AWS
+
+* **Access Key ID**
+  `@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"`
+* **Secret Access Key (context‑aided)**
+  `@"\b[A-Za-z0-9/\+=]{40}\b"`
+  *Boost only if near `aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY` within ~50 chars.*
+* **Credentials file lines**
+
+  * `@"aws_access_key_id\s*=\s*[A-Z0-9]{20}"`
+  * `@"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"`
+
+### 3.3 Cloud: GCP / Google
+
+* **API key**
+  `@"\bAIza[0-9A-Za-z\-_]{35}\b"`
+* **Service Account JSON** (two‑term signature)
+
+  * `@"""type""\s*:\s*""service_account"""`
+  * `@"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"`
+
+### 3.4 Cloud: Azure
+
+* **Storage connection string**
+  `@"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"`
+* **SAS token (simplified)**
+  `@"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"`
+
+### 3.5 Dev platforms / SCM
+
+* **GitHub PAT**
+  `@"\bgh[prusoa]_[A-Za-z0-9]{36}\b"`
+* **GitLab PAT**
+  `@"\bglpat-[A-Za-z0-9\-_]{20,}\b"`
+* **NPM token**
+
+  * in `.npmrc`: `@"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"`
+  * raw form: `@"\bnpm_[A-Za-z0-9]{36}\b"`
+* **PyPI token**
+  `@"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"`
+
+### 3.6 Messaging / SaaS
+
+* **Slack tokens (broad)**
+  `@"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"`
+* **Stripe**
+  `@"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"`
+* **SendGrid**
+  `@"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"`
+* **Mailgun**
+  `@"\bkey-[0-9a-zA-Z]{32}\b"`
+* **Twilio**
+
+  * SID: `@"\bAC[0-9a-f]{32}\b"`
+  * Auth token (context aided): `@"\b[0-9a-f]{32}\b"` near `twilio|auth[_-]?token`
+* **Discord bot**
+  `@"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"`
+
+### 3.7 Database / service connection strings
+
+* **PostgreSQL**
+  `@"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"`
+* **MySQL**
+  `@"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"`
+* **MongoDB**
+  `@"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"`
+* **SQL Server (ADO.NET)**
+  `@"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"`
+* **Redis**
+  `@"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"`
+* **Basic auth in URL (generic)**
+  `@"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"`
+
+### 3.8 Docker / CLI auth artifacts
+
+* **Docker config.json auth**
+  `@"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""`
+* **.netrc auth**
+  `@"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"`
+
+### 3.9 Tokens / JWT
+
+* **JWT (structural)**
+  `@"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"`
+
+### 3.10 Build tools / package managers
+
+* **NuGet (cleartext)**
+  `@"<add\s+key=""ClearTextPassword""\s+value=""[^""]+"""`
+  `@"<add\s+key=""Password""\s+value=""[^""]+"""`  *(base64 ‑ still secret)*
+* **Maven settings.xml**
+  `@"<server>\s*<id>[^<]+</id>\s*<username>[^<]+</username>\s*<password>[^<]+</password>"`
+* **Gradle**
+  `@"(?i)\bsigning\.password\s*=\s*.+"`
+
+> Keep regexes modular; associate each with:
+> `{ Id, Name, Pattern, Severity, Examples, RecommendedRemediation }`.
+
+---
+
+## 4) Entropy detector (catches “unknown” secrets)
+
+**Why:** Many org‑specific tokens won’t match known regexes.
+
+**Implementation**
+
+* Extract candidate tokens by character class:
+
+  * base64/base64url: `[A-Za-z0-9/_\-\+=]{20,}`
+  * hex: `[A-Fa-f0-9]{32,}`
+  * general mixed: `[A-Za-z0-9]{24,}`
+* Compute **Shannon entropy** per candidate. Use **alphabet‑aware thresholds**:
+
+  * **base64/url**: ≥ **4.0** bits/char & length ≥ 24
+  * **hex**: ≥ **3.0** bits/char & length ≥ 32
+  * **alnum**: ≥ **4.0** bits/char & length ≥ 24
+* **Context boosts** (raise confidence) if **within 64 chars** of:
+  `password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer`
+* **Context suppressors** (lower confidence/ignore):
+
+  * File/path contains: `example|sample|test|fixture|dummy`
+  * Surrounding line contains: `REDACTED|<redacted>|changeme`
+  * Known non‑secret blocks: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE`
+* Cap **N findings per file** (e.g., 50) to avoid log floods.
+
+---
+
+## 5) Scoring & de‑duping
+
+Combine signals into a **confidence score**:
+
+* +0.9 Regex “hard” match (e.g., OpenSSH private key)
+* +0.7 Regex “soft” match (e.g., AWS secret 40‑char near keyword)
+* +0.4 Entropy pass
+* +0.2 Suspicious filename/path
+* –0.5 Suppressor keyword/file
+* +0.2 Structural check passes (e.g., JWT decodes)
+
+**Severity**
+
+* **Critical**: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys.
+* **High**: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history.
+* **Medium**: high‑entropy candidates with strong context.
+* **Low**: weak context/entropy only, or likely sample values.
+
+**De‑dupe** same value across files/layers/envs; keep a single canonical record with **occurrence list**.
+
+---
+
+## 6) Docker‑specific checks you must implement
+
+* **ENV/ARG leakage in history**
+  Parse `config.History[].CreatedBy` or `docker history --no-trunc`.
+  Flag any `ENV/ARG` with suspicious key names or values matching detectors.
+* **Deleted‑later files**
+  If a file existed in an earlier layer and got deleted later (common `.env` mishap), still flag it and report **layer** + **instruction** that introduced it.
+* **`.dockerignore` advisory**
+  If high‑risk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest `.dockerignore` entries.
+
+---
+
+## 7) Runtime inspection rules
+
+* **Environment**
+
+  * Scan all `Env` pairs; **boost** hits for keys containing:
+    `PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING`
+* **Process args**
+
+  * Flag `--password`, `--api-key`, `--token`, `--secret`, `--connection-string`.
+* **Mounted secrets**
+
+  * Enumerate `/run/secrets/*`, `/var/run/secrets/*` (Swarm/K8s).
+  * Ensure permissions are restrictive; still **scan contents** (apps sometimes copy them elsewhere).
+* **Logs**
+
+  * Tail & scan. Provide **optional redaction** pipeline.
+
+---
+
+## 8) Reporting format (JSON)
+
+Example JSON for one finding:
+
+```json
+{
+  "detectorId": "aws.accessKeyId",
+  "name": "AWS Access Key ID",
+  "severity": "HIGH",
+  "confidence": 0.92,
+  "valueSample": "AKIA************WXYZ",
+  "locations": [
+    {
+      "type": "image-layer-file",
+      "image": "repo/app:1.4.2",
+      "layerDigest": "sha256:...abc",
+      "path": "/app/.env",
+      "line": 12
+    },
+    {
+      "type": "container-env",
+      "containerId": "f3e9d...",
+      "envKey": "AWS_ACCESS_KEY_ID"
+    }
+  ],
+  "context": {
+    "filePathScore": 0.2,
+    "regexMatch": true,
+    "entropy": null,
+    "nearbyKeywords": ["AWS_ACCESS_KEY_ID"]
+  },
+  "remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key."
+}
+```
+
+> Optionally also emit **SARIF** to plug into code‑scanning dashboards.
+
+---
+
+## 9) C# implementation sketch
+
+### Project layout
+
+```
+SecretsScanner/
+  Core/
+    IDetector.cs                 // interface: Detect(stream|text, path, context) -> Findings
+    RegexDetector.cs             // holds Pattern, Hints, Confidence rules
+    EntropyDetector.cs           // Shannon entropy
+    JwtDetector.cs               // structural decoding check
+    FileClassifier.cs            // text/binary check, ext-based hints
+    Scoring.cs                   // combine signals; severity
+    PathsHeuristics.cs           // globs & filename rules
+    ReportModel.cs               // JSON schema / SARIF
+  Docker/
+    ImageReader.cs               // reads image tars, layers via Docker.DotNet or stream
+    HistoryParser.cs             // extracts ENV/ARG from history
+    ContainerInspector.cs        // env, args, mounts, logs (Docker.DotNet)
+  Catalog/
+    RegexCatalog.cs              // patterns (section 3), per-detector metadata
+    Keywords.cs                  // boost/suppress lists
+  Cli/
+    Program.cs                   // options: image, container, path; json output; fail-on
+```
+
+### C# snippets (illustrative)
+
+**Regex catalog**
+
+```csharp
+public static class RegexCatalog
+{
+    public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules =
+    {
+        ("pem.openssh", "OpenSSH Private Key",
+            new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled),
+            "CRITICAL", "Remove private keys from images; use mounts or vault."),
+        ("pem.private", "PEM Private Key",
+            new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled),
+            "CRITICAL", "Remove private keys; rotate credentials."),
+        ("aws.akid", "AWS Access Key ID",
+            new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled),
+            "HIGH", "Rotate; use IAM roles/STS; remove from code/config."),
+        ("github.pat", "GitHub Personal Access Token",
+            new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled),
+            "HIGH", "Revoke PAT; use fine-grained tokens; remove from image."),
+        // ... add remaining patterns from Section 3
+    };
+}
+```
+
+**Entropy**
+
+```csharp
+public static class Entropy
+{
+    public static double Shannon(ReadOnlySpan<char> s, ReadOnlySpan<char> alphabet)
+    {
+        Span<int> counts = stackalloc int[256];
+        int n = 0;
+        foreach (var ch in s)
+        {
+            if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; }
+        }
+        if (n == 0) return 0.0;
+        double H = 0.0;
+        for (int i = 0; i < counts.Length; i++)
+        {
+            if (counts[i] == 0) continue;
+            double p = counts[i] / (double)n;
+            H -= p * Math.Log(p, 2);
+        }
+        return H;
+    }
+}
+```
+
+**Candidate extraction (simplified)**
+
+```csharp
+static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled);
+static readonly Regex HexToken    = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled);
+
+IEnumerable<Candidate> ExtractCandidates(string line)
+{
+    foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line);
+    foreach (Match m in HexToken.Matches(line))    yield return new Candidate(m.Value, "hex", line);
+}
+```
+
+**Scoring**
+
+```csharp
+double Score(DetectionSignals s)
+{
+    double score = 0;
+    if (s.RegexHard) score += 0.9;
+    if (s.RegexSoft) score += 0.7;
+    if (s.EntropyHit) score += 0.4;
+    if (s.SuspiciousPath) score += 0.2;
+    if (s.StructuralOk) score += 0.2;
+    if (s.Suppressor) score -= 0.5;
+    return Math.Clamp(score, 0, 1);
+}
+```
+
+**Docker (Docker.DotNet)**
+
+* Images: `IImageOperations.GetImageHistoryAsync`, `Images.GetImageAsync` + tar unpack.
+* Containers: `Containers.InspectContainerAsync`, `Exec.ExecCreateContainerAsync` + `ExecStart`, `GetArchiveFromContainerAsync`, `Logs.GetContainerLogsAsync`.
+
+---
+
+## 10) False‑positive control & hygiene
+
+* **Ignore lists**: file globs (`test/**`, `**/*.example.*`), value lists (`REDACTED`, `example`, `dummy`, `changeme`).
+* **Public materials**: downrank matches inside `BEGIN PUBLIC KEY`/`BEGIN CERTIFICATE`.
+* **Thresholds**: tune entropy and minimum lengths to your codebase; keep per‑detector knobs in config.
+* **Masking**: never print full values; keep secure logs.
+* **Rate‑limits**: cap per‑file matches; cap per‑container to avoid spam.
+
+---
+
+## 11) CI/CD and policy
+
+* **Build step**: after `docker build`, run image scan; **fail** on High/Critical (configurable).
+* **Pre‑deploy**: scan runtime env for env/args/mounts (read‑only).
+* **Baselining**: allow a first pass to **baseline known leftovers**, then block any **new** secrets.
+* **Rotation**: auto‑emit per‑type remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager).
+
+---
+
+## 12) Optional enhancements
+
+* **SBOM‑guided scanning**: use SBOM/file inventory to prioritize text/config assets; cache base layers.
+* **JWT structural checks**: base64url‑decode header/payload; verify JSON; flag if plausible.
+* **Checksum checks**: Luhn for CCNs (if in scope); simple format checks for cloud tokens.
+* **Interactive audit**: CLI `--audit` mode to triage and write an “allowlist/baseline”.
+
+---
+
+## 13) Minimal “first list” your dev can paste today
+
+**Start with these detectors (high ROI):**
+
+* PEM/OPENSSH private keys
+* AWS AKID + secret (context‑aided)
+* GitHub PAT, GitLab PAT, NPM, PyPI
+* Slack, Stripe, SendGrid, Twilio
+* Docker config `auth` field
+* DB connection strings (Postgres/MySQL/Mongo/SQLServer)
+* JWT
+* `.aws/credentials`, `.npmrc`, `.docker/config.json`, `appsettings*.json`, `.env*`, `*.tfstate`, `*kubeconfig*` (path heuristics)
+* Entropy (base64/hex/alnum) with context boosts/suppressors
+
+That set alone catches the overwhelming majority of real‑world leaks.
+
+---
+
+### Final note
+
+This blueprint keeps everything **offline** (no external calls), so it’s safe in CI and reproducible. If you later want to add **credential validation** (e.g., confirm an AWS key via STS), make it opt‑in and heavily rate‑limited.
+
+If you want, I can package these regexes and the scaffolding into a **starter C# repo** with a CLI (`scan image <ref> | scan container <id> | scan path <dir>`) and JSON output.