## 0) Scope at a glance **Scan surfaces** * **Images (static):** every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history). * **Runtime (live containers):** env vars, process args, mounted volumes (e.g., `/run/secrets`), logs, selected files created at runtime. **Detection methods** 1. **Deterministic patterns (regex)** for known secret types. 2. **Heuristics**: entropy scoring for unknown/random secrets. 3. **Contextual signals**: filename/path, key names, nearby keywords, file type hints. 4. **Structural checks**: e.g., JWT decodable, cloud key prefix/length. 5. **(Optional) Lightweight validation**: local checksum/format (no network calls by default). **Reporting** * JSON (and optionally SARIF) with: *where*, *what rule matched*, *snippet masked*, *confidence*, *severity*, *layer/container process*, and *remediation hint*. --- ## 1) Docker‑aware discovery workflow ### A. Images (static, pre‑runtime) 1. **Obtain filesystem + metadata** * Prefer **API**: Docker Engine (Docker.DotNet) to `Images.GetImageAsync` and **export/tar** (`docker save`) in memory. * Parse `manifest.json` + `config.json`; capture: * `config.Env` (final env), * **history**/`created_by` for `ENV`/`ARG`/`RUN` strings, * labels. 2. **Scan every layer** * Stream‑extract each layer tar (e.g., SharpCompress). * Track **added/modified paths** per layer (so you can report: *layer N, file X*). * **Text‑only filter**: skip clearly binary files (e.g., sample N bytes; if >30% non‑printables, skip or downrank). 3. **File content & name/path analysis** * Apply **regex detectors** (Section 3) and **entropy** (Section 4). * Weigh findings with **context** (Section 5). 4. **Dockerfile/History checks** * Flag secrets in `ENV`/`ARG`/`RUN` strings (e.g., `ENV MYSQL_ROOT_PASSWORD=...`). * Flag **deleted‑later files** that were present in earlier layers (common leak). * Highlight missing `.dockerignore` patterns when suspicious files (.env, .pem, .tfstate) entered any layer. ### B. Running containers (runtime) 1. **Enumerate** containers and **inspect**: * `InspectContainerAsync` → `Config.Env`, `HostConfig.Binds`, `Mounts`, image id. 2. **Env var scan** * Scan all `key=value` pairs with the same detectors (regex + entropy + context on the key name). 3. **Process args** * `docker top` or `/proc//cmdline` via `Exec` → scan args for `--password=...`, `--api-key=...`. 4. **Mounted secret paths** * Default locations: `/run/secrets/*`, `/var/run/secrets/*`, K8s secret volumes, config maps that may contain creds. * Retrieve via `GetArchiveFromContainerAsync` and scan. 5. **Logs (optional but valuable)** * Attach/stream logs; scan lines for secret patterns; provide **live redaction** option. > **Note**: Memory forensics is possible but heavy; treat as optional/IR-only. --- ## 2) High‑value filename/path heuristics (fast wins) Run these **glob/name** checks before content scanning to prioritize files: **Generic secret indicators** ``` **/*.env **/.env* **/*secret*.* **/*secr*.* **/*credential*.* **/*creds*.* **/*passwd* **/password* **/*token*.* **/*apikey*.* **/*api_key*.* **/*.pem **/*.key **/*.pfx **/*.p12 **/*.jks **/*.keystore **/id_rsa **/id_dsa **/id_ecdsa **/id_ed25519 **/private.pem **/server.key **/tls.key **/jwt*.key ``` **Common app/config** ``` **/appsettings*.json **/secrets*.json **/application.{yml,yaml,properties} **/application-*.{yml,yaml,properties} **/config.yaml **/settings.yml **/settings.py **/wp-config.php **/config.php **/settings.php **/nuget.config **/settings.xml (Maven) **/gradle.properties **/docker-compose*.yml **/compose*.yml **/PublishProfiles/*.pubxml ``` **Cloud/CLI creds** ``` **/.aws/credentials **/.aws/config **/gcloud/application_default_credentials.json **/.azure/** **/doctl/config.yaml **/.oci/config **/.docker/config.json **/.dockercfg **/.npmrc **/.yarnrc **/.pypirc **/.gem/credentials **/.netrc ``` **Infra/IaC** ``` **/*.tfstate **/*.tfvars* **/kube/config **/.kube/config **/*kubeconfig* **/service-account*.json **/*-sa.json **/*-key.json ``` **Orchestrator runtime** ``` /run/secrets/* /var/run/secrets/* ``` --- ## 3) **Regex detector catalog** (battle‑tested patterns) > Use `RegexOptions.Compiled | RegexOptions.IgnoreCase` (case‑sensitive where needed). > Always **mask** values in reports (e.g., show first 4 + last 4 chars). ### 3.1 Private keys / certificates * **OpenSSH private key** `@"-----BEGIN OPENSSH PRIVATE KEY-----"` * **Generic PEM private key** `@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"` * **PGP private key** `@"-----BEGIN PGP PRIVATE KEY BLOCK-----"` > (Public keys/certificates are *not* secrets: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` → downrank/ignore.) ### 3.2 Cloud: AWS * **Access Key ID** `@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"` * **Secret Access Key (context‑aided)** `@"\b[A-Za-z0-9/\+=]{40}\b"` *Boost only if near `aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY` within ~50 chars.* * **Credentials file lines** * `@"aws_access_key_id\s*=\s*[A-Z0-9]{20}"` * `@"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"` ### 3.3 Cloud: GCP / Google * **API key** `@"\bAIza[0-9A-Za-z\-_]{35}\b"` * **Service Account JSON** (two‑term signature) * `@"""type""\s*:\s*""service_account"""` * `@"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"` ### 3.4 Cloud: Azure * **Storage connection string** `@"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"` * **SAS token (simplified)** `@"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"` ### 3.5 Dev platforms / SCM * **GitHub PAT** `@"\bgh[prusoa]_[A-Za-z0-9]{36}\b"` * **GitLab PAT** `@"\bglpat-[A-Za-z0-9\-_]{20,}\b"` * **NPM token** * in `.npmrc`: `@"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"` * raw form: `@"\bnpm_[A-Za-z0-9]{36}\b"` * **PyPI token** `@"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"` ### 3.6 Messaging / SaaS * **Slack tokens (broad)** `@"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"` * **Stripe** `@"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"` * **SendGrid** `@"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"` * **Mailgun** `@"\bkey-[0-9a-zA-Z]{32}\b"` * **Twilio** * SID: `@"\bAC[0-9a-f]{32}\b"` * Auth token (context aided): `@"\b[0-9a-f]{32}\b"` near `twilio|auth[_-]?token` * **Discord bot** `@"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"` ### 3.7 Database / service connection strings * **PostgreSQL** `@"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"` * **MySQL** `@"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"` * **MongoDB** `@"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"` * **SQL Server (ADO.NET)** `@"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"` * **Redis** `@"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"` * **Basic auth in URL (generic)** `@"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"` ### 3.8 Docker / CLI auth artifacts * **Docker config.json auth** `@"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""` * **.netrc auth** `@"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"` ### 3.9 Tokens / JWT * **JWT (structural)** `@"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"` ### 3.10 Build tools / package managers * **NuGet (cleartext)** `@"\s*[^<]+\s*[^<]+\s*[^<]+"` * **Gradle** `@"(?i)\bsigning\.password\s*=\s*.+"` > Keep regexes modular; associate each with: > `{ Id, Name, Pattern, Severity, Examples, RecommendedRemediation }`. --- ## 4) Entropy detector (catches “unknown” secrets) **Why:** Many org‑specific tokens won’t match known regexes. **Implementation** * Extract candidate tokens by character class: * base64/base64url: `[A-Za-z0-9/_\-\+=]{20,}` * hex: `[A-Fa-f0-9]{32,}` * general mixed: `[A-Za-z0-9]{24,}` * Compute **Shannon entropy** per candidate. Use **alphabet‑aware thresholds**: * **base64/url**: ≥ **4.0** bits/char & length ≥ 24 * **hex**: ≥ **3.0** bits/char & length ≥ 32 * **alnum**: ≥ **4.0** bits/char & length ≥ 24 * **Context boosts** (raise confidence) if **within 64 chars** of: `password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer` * **Context suppressors** (lower confidence/ignore): * File/path contains: `example|sample|test|fixture|dummy` * Surrounding line contains: `REDACTED||changeme` * Known non‑secret blocks: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` * Cap **N findings per file** (e.g., 50) to avoid log floods. --- ## 5) Scoring & de‑duping Combine signals into a **confidence score**: * +0.9 Regex “hard” match (e.g., OpenSSH private key) * +0.7 Regex “soft” match (e.g., AWS secret 40‑char near keyword) * +0.4 Entropy pass * +0.2 Suspicious filename/path * –0.5 Suppressor keyword/file * +0.2 Structural check passes (e.g., JWT decodes) **Severity** * **Critical**: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys. * **High**: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history. * **Medium**: high‑entropy candidates with strong context. * **Low**: weak context/entropy only, or likely sample values. **De‑dupe** same value across files/layers/envs; keep a single canonical record with **occurrence list**. --- ## 6) Docker‑specific checks you must implement * **ENV/ARG leakage in history** Parse `config.History[].CreatedBy` or `docker history --no-trunc`. Flag any `ENV/ARG` with suspicious key names or values matching detectors. * **Deleted‑later files** If a file existed in an earlier layer and got deleted later (common `.env` mishap), still flag it and report **layer** + **instruction** that introduced it. * **`.dockerignore` advisory** If high‑risk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest `.dockerignore` entries. --- ## 7) Runtime inspection rules * **Environment** * Scan all `Env` pairs; **boost** hits for keys containing: `PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING` * **Process args** * Flag `--password`, `--api-key`, `--token`, `--secret`, `--connection-string`. * **Mounted secrets** * Enumerate `/run/secrets/*`, `/var/run/secrets/*` (Swarm/K8s). * Ensure permissions are restrictive; still **scan contents** (apps sometimes copy them elsewhere). * **Logs** * Tail & scan. Provide **optional redaction** pipeline. --- ## 8) Reporting format (JSON) Example JSON for one finding: ```json { "detectorId": "aws.accessKeyId", "name": "AWS Access Key ID", "severity": "HIGH", "confidence": 0.92, "valueSample": "AKIA************WXYZ", "locations": [ { "type": "image-layer-file", "image": "repo/app:1.4.2", "layerDigest": "sha256:...abc", "path": "/app/.env", "line": 12 }, { "type": "container-env", "containerId": "f3e9d...", "envKey": "AWS_ACCESS_KEY_ID" } ], "context": { "filePathScore": 0.2, "regexMatch": true, "entropy": null, "nearbyKeywords": ["AWS_ACCESS_KEY_ID"] }, "remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key." } ``` > Optionally also emit **SARIF** to plug into code‑scanning dashboards. --- ## 9) C# implementation sketch ### Project layout ``` SecretsScanner/ Core/ IDetector.cs // interface: Detect(stream|text, path, context) -> Findings RegexDetector.cs // holds Pattern, Hints, Confidence rules EntropyDetector.cs // Shannon entropy JwtDetector.cs // structural decoding check FileClassifier.cs // text/binary check, ext-based hints Scoring.cs // combine signals; severity PathsHeuristics.cs // globs & filename rules ReportModel.cs // JSON schema / SARIF Docker/ ImageReader.cs // reads image tars, layers via Docker.DotNet or stream HistoryParser.cs // extracts ENV/ARG from history ContainerInspector.cs // env, args, mounts, logs (Docker.DotNet) Catalog/ RegexCatalog.cs // patterns (section 3), per-detector metadata Keywords.cs // boost/suppress lists Cli/ Program.cs // options: image, container, path; json output; fail-on ``` ### C# snippets (illustrative) **Regex catalog** ```csharp public static class RegexCatalog { public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules = { ("pem.openssh", "OpenSSH Private Key", new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled), "CRITICAL", "Remove private keys from images; use mounts or vault."), ("pem.private", "PEM Private Key", new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled), "CRITICAL", "Remove private keys; rotate credentials."), ("aws.akid", "AWS Access Key ID", new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled), "HIGH", "Rotate; use IAM roles/STS; remove from code/config."), ("github.pat", "GitHub Personal Access Token", new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled), "HIGH", "Revoke PAT; use fine-grained tokens; remove from image."), // ... add remaining patterns from Section 3 }; } ``` **Entropy** ```csharp public static class Entropy { public static double Shannon(ReadOnlySpan s, ReadOnlySpan alphabet) { Span counts = stackalloc int[256]; int n = 0; foreach (var ch in s) { if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; } } if (n == 0) return 0.0; double H = 0.0; for (int i = 0; i < counts.Length; i++) { if (counts[i] == 0) continue; double p = counts[i] / (double)n; H -= p * Math.Log(p, 2); } return H; } } ``` **Candidate extraction (simplified)** ```csharp static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled); static readonly Regex HexToken = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled); IEnumerable ExtractCandidates(string line) { foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line); foreach (Match m in HexToken.Matches(line)) yield return new Candidate(m.Value, "hex", line); } ``` **Scoring** ```csharp double Score(DetectionSignals s) { double score = 0; if (s.RegexHard) score += 0.9; if (s.RegexSoft) score += 0.7; if (s.EntropyHit) score += 0.4; if (s.SuspiciousPath) score += 0.2; if (s.StructuralOk) score += 0.2; if (s.Suppressor) score -= 0.5; return Math.Clamp(score, 0, 1); } ``` **Docker (Docker.DotNet)** * Images: `IImageOperations.GetImageHistoryAsync`, `Images.GetImageAsync` + tar unpack. * Containers: `Containers.InspectContainerAsync`, `Exec.ExecCreateContainerAsync` + `ExecStart`, `GetArchiveFromContainerAsync`, `Logs.GetContainerLogsAsync`. --- ## 10) False‑positive control & hygiene * **Ignore lists**: file globs (`test/**`, `**/*.example.*`), value lists (`REDACTED`, `example`, `dummy`, `changeme`). * **Public materials**: downrank matches inside `BEGIN PUBLIC KEY`/`BEGIN CERTIFICATE`. * **Thresholds**: tune entropy and minimum lengths to your codebase; keep per‑detector knobs in config. * **Masking**: never print full values; keep secure logs. * **Rate‑limits**: cap per‑file matches; cap per‑container to avoid spam. --- ## 11) CI/CD and policy * **Build step**: after `docker build`, run image scan; **fail** on High/Critical (configurable). * **Pre‑deploy**: scan runtime env for env/args/mounts (read‑only). * **Baselining**: allow a first pass to **baseline known leftovers**, then block any **new** secrets. * **Rotation**: auto‑emit per‑type remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager). --- ## 12) Optional enhancements * **SBOM‑guided scanning**: use SBOM/file inventory to prioritize text/config assets; cache base layers. * **JWT structural checks**: base64url‑decode header/payload; verify JSON; flag if plausible. * **Checksum checks**: Luhn for CCNs (if in scope); simple format checks for cloud tokens. * **Interactive audit**: CLI `--audit` mode to triage and write an “allowlist/baseline”. --- ## 13) Minimal “first list” your dev can paste today **Start with these detectors (high ROI):** * PEM/OPENSSH private keys * AWS AKID + secret (context‑aided) * GitHub PAT, GitLab PAT, NPM, PyPI * Slack, Stripe, SendGrid, Twilio * Docker config `auth` field * DB connection strings (Postgres/MySQL/Mongo/SQLServer) * JWT * `.aws/credentials`, `.npmrc`, `.docker/config.json`, `appsettings*.json`, `.env*`, `*.tfstate`, `*kubeconfig*` (path heuristics) * Entropy (base64/hex/alnum) with context boosts/suppressors That set alone catches the overwhelming majority of real‑world leaks. --- ### Final note This blueprint keeps everything **offline** (no external calls), so it’s safe in CI and reproducible. If you later want to add **credential validation** (e.g., confirm an AWS key via STS), make it opt‑in and heavily rate‑limited. If you want, I can package these regexes and the scaffolding into a **starter C# repo** with a CLI (`scan image | scan container | scan path `) and JSON output.