git.stella-ops.org/scanning-engine.md at 1d962ee6fc98ad8c53715bbff6d513246f5b1361

stella-ops.org/git.stella-ops.org

Fork 0

Files

Vladimir Moushkov e5629454cf

Docs CI / lint-and-preview (push) Has been cancelled

Details

Create scanning-engine.md

2025-10-31 19:17:41 +02:00

18 KiB

Raw Blame History

0) Scope at a glance

Scan surfaces

Images (static): every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history).
Runtime (live containers): env vars, process args, mounted volumes (e.g., /run/secrets), logs, selected files created at runtime.

Detection methods

Deterministic patterns (regex) for known secret types.
Heuristics: entropy scoring for unknown/random secrets.
Contextual signals: filename/path, key names, nearby keywords, file type hints.
Structural checks: e.g., JWT decodable, cloud key prefix/length.
(Optional) Lightweight validation: local checksum/format (no network calls by default).

Reporting

JSON (and optionally SARIF) with: where, what rule matched, snippet masked, confidence, severity, layer/container process, and remediation hint.

1) Docker‑aware discovery workflow

A. Images (static, pre‑runtime)

Obtain filesystem + metadata
- Prefer API: Docker Engine (Docker.DotNet) to Images.GetImageAsync and export/tar (docker save) in memory.
- Parse manifest.json + config.json; capture:
  - config.Env (final env),
  - history/created_by for ENV/ARG/RUN strings,
  - labels.
Scan every layer
- Stream‑extract each layer tar (e.g., SharpCompress).
- Track added/modified paths per layer (so you can report: layer N, file X).
- Text‑only filter: skip clearly binary files (e.g., sample N bytes; if >30% non‑printables, skip or downrank).
File content & name/path analysis
- Apply regex detectors (Section 3) and entropy (Section 4).
- Weigh findings with context (Section 5).
Dockerfile/History checks
- Flag secrets in ENV/ARG/RUN strings (e.g., ENV MYSQL_ROOT_PASSWORD=...).
- Flag deleted‑later files that were present in earlier layers (common leak).
- Highlight missing .dockerignore patterns when suspicious files (.env, .pem, .tfstate) entered any layer.

B. Running containers (runtime)

Enumerate containers and inspect:
- InspectContainerAsync → Config.Env, HostConfig.Binds, Mounts, image id.
Env var scan
- Scan all key=value pairs with the same detectors (regex + entropy + context on the key name).
Process args
- docker top or /proc/<pid>/cmdline via Exec → scan args for --password=..., --api-key=....
Mounted secret paths
- Default locations: /run/secrets/*, /var/run/secrets/*, K8s secret volumes, config maps that may contain creds.
- Retrieve via GetArchiveFromContainerAsync and scan.
Logs (optional but valuable)
- Attach/stream logs; scan lines for secret patterns; provide live redaction option.

Note

: Memory forensics is possible but heavy; treat as optional/IR-only.

2) High‑value filename/path heuristics (fast wins)

Run these glob/name checks before content scanning to prioritize files:

Generic secret indicators

**/*.env        **/.env*           **/*secret*.*      **/*secr*.* 
**/*credential*.*                 **/*creds*.*       **/*passwd* 
**/password*   **/*token*.*       **/*apikey*.*      **/*api_key*.* 
**/*.pem       **/*.key           **/*.pfx           **/*.p12 
**/*.jks       **/*.keystore      **/id_rsa          **/id_dsa 
**/id_ecdsa    **/id_ed25519      **/private.pem     **/server.key 
**/tls.key     **/jwt*.key

Common app/config

**/appsettings*.json              **/secrets*.json
**/application.{yml,yaml,properties}
**/application-*.{yml,yaml,properties}
**/config.yaml  **/settings.yml   **/settings.py
**/wp-config.php **/config.php     **/settings.php
**/nuget.config  **/settings.xml (Maven)  **/gradle.properties
**/docker-compose*.yml   **/compose*.yml
**/PublishProfiles/*.pubxml

Cloud/CLI creds

**/.aws/credentials  **/.aws/config
**/gcloud/application_default_credentials.json
**/.azure/**         **/doctl/config.yaml     **/.oci/config
**/.docker/config.json  **/.dockercfg
**/.npmrc  **/.yarnrc  **/.pypirc  **/.gem/credentials  **/.netrc

Infra/IaC

**/*.tfstate  **/*.tfvars*   **/kube/config  **/.kube/config  **/*kubeconfig*
**/service-account*.json     **/*-sa.json    **/*-key.json

Orchestrator runtime

/run/secrets/*     /var/run/secrets/*

3) Regex detector catalog (battle‑tested patterns)

Use RegexOptions.Compiled | RegexOptions.IgnoreCase (case‑sensitive where needed). Always mask values in reports (e.g., show first 4 + last 4 chars).

3.1 Private keys / certificates

OpenSSH private key @"-----BEGIN OPENSSH PRIVATE KEY-----"
Generic PEM private key @"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"
PGP private key @"-----BEGIN PGP PRIVATE KEY BLOCK-----"

(Public keys/certificates are not secrets: BEGIN PUBLIC KEY, BEGIN CERTIFICATE → downrank/ignore.)

3.2 Cloud: AWS

Access Key ID @"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"
Secret Access Key (context‑aided) @"\b[A-Za-z0-9/\+=]{40}\b" Boost only if near aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY within ~50 chars.
Credentials file lines
- @"aws_access_key_id\s*=\s*[A-Z0-9]{20}"
- @"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"

3.3 Cloud: GCP / Google

API key @"\bAIza[0-9A-Za-z\-_]{35}\b"
Service Account JSON (two‑term signature)
- @"""type""\s*:\s*""service_account"""
- @"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"

3.4 Cloud: Azure

Storage connection string @"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"
SAS token (simplified) @"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"

3.5 Dev platforms / SCM

GitHub PAT @"\bgh[prusoa]_[A-Za-z0-9]{36}\b"
GitLab PAT @"\bglpat-[A-Za-z0-9\-_]{20,}\b"
NPM token
- in .npmrc: @"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"
- raw form: @"\bnpm_[A-Za-z0-9]{36}\b"
PyPI token @"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"

3.6 Messaging / SaaS

Slack tokens (broad) @"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"
Stripe @"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"
SendGrid @"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"
Mailgun @"\bkey-[0-9a-zA-Z]{32}\b"
Twilio
- SID: @"\bAC[0-9a-f]{32}\b"
- Auth token (context aided): @"\b[0-9a-f]{32}\b" near twilio|auth[_-]?token
Discord bot @"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"

3.7 Database / service connection strings

PostgreSQL @"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"
MySQL @"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"
MongoDB @"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"
SQL Server (ADO.NET) @"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"
Redis @"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"
Basic auth in URL (generic) @"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"

3.8 Docker / CLI auth artifacts

Docker config.json auth @"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""
.netrc auth @"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"

3.9 Tokens / JWT

JWT (structural) @"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"

3.10 Build tools / package managers

NuGet (cleartext) @"<add\s+key=""ClearTextPassword""\s+value=""[^""]+""" @"<add\s+key=""Password""\s+value=""[^""]+""" (base64 ‑ still secret)
Maven settings.xml @"<server>\s*<id>[^<]+</id>\s*<username>[^<]+</username>\s*<password>[^<]+</password>"
Gradle @"(?i)\bsigning\.password\s*=\s*.+"

Keep regexes modular; associate each with: { Id, Name, Pattern, Severity, Examples, RecommendedRemediation }.

4) Entropy detector (catches “unknown” secrets)

Why: Many org‑specific tokens won’t match known regexes.

Implementation

Extract candidate tokens by character class:
- base64/base64url: [A-Za-z0-9/_\-\+=]{20,}
- hex: [A-Fa-f0-9]{32,}
- general mixed: [A-Za-z0-9]{24,}
Compute Shannon entropy per candidate. Use alphabet‑aware thresholds:
- base64/url: ≥ 4.0 bits/char & length ≥ 24
- hex: ≥ 3.0 bits/char & length ≥ 32
- alnum: ≥ 4.0 bits/char & length ≥ 24
Context boosts (raise confidence) if within 64 chars of: password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer
Context suppressors (lower confidence/ignore):
- File/path contains: example|sample|test|fixture|dummy
- Surrounding line contains: REDACTED|<redacted>|changeme
- Known non‑secret blocks: BEGIN PUBLIC KEY, BEGIN CERTIFICATE
Cap N findings per file (e.g., 50) to avoid log floods.

5) Scoring & de‑duping

Combine signals into a confidence score:

+0.9 Regex “hard” match (e.g., OpenSSH private key)
+0.7 Regex “soft” match (e.g., AWS secret 40‑char near keyword)
+0.4 Entropy pass
+0.2 Suspicious filename/path
–0.5 Suppressor keyword/file
+0.2 Structural check passes (e.g., JWT decodes)

Severity

Critical: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys.
High: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history.
Medium: high‑entropy candidates with strong context.
Low: weak context/entropy only, or likely sample values.

De‑dupe same value across files/layers/envs; keep a single canonical record with occurrence list.

6) Docker‑specific checks you must implement

ENV/ARG leakage in history Parse config.History[].CreatedBy or docker history --no-trunc. Flag any ENV/ARG with suspicious key names or values matching detectors.
Deleted‑later files If a file existed in an earlier layer and got deleted later (common .env mishap), still flag it and report layer + instruction that introduced it.
.dockerignore advisory If high‑risk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest .dockerignore entries.

7) Runtime inspection rules

Environment
- Scan all Env pairs; boost hits for keys containing: PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING
Process args
- Flag --password, --api-key, --token, --secret, --connection-string.
Mounted secrets
- Enumerate /run/secrets/*, /var/run/secrets/* (Swarm/K8s).
- Ensure permissions are restrictive; still scan contents (apps sometimes copy them elsewhere).
Logs
- Tail & scan. Provide optional redaction pipeline.

8) Reporting format (JSON)

Example JSON for one finding:

{
  "detectorId": "aws.accessKeyId",
  "name": "AWS Access Key ID",
  "severity": "HIGH",
  "confidence": 0.92,
  "valueSample": "AKIA************WXYZ",
  "locations": [
    {
      "type": "image-layer-file",
      "image": "repo/app:1.4.2",
      "layerDigest": "sha256:...abc",
      "path": "/app/.env",
      "line": 12
    },
    {
      "type": "container-env",
      "containerId": "f3e9d...",
      "envKey": "AWS_ACCESS_KEY_ID"
    }
  ],
  "context": {
    "filePathScore": 0.2,
    "regexMatch": true,
    "entropy": null,
    "nearbyKeywords": ["AWS_ACCESS_KEY_ID"]
  },
  "remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key."
}

Optionally also emit SARIF to plug into code‑scanning dashboards.

9) C# implementation sketch

Project layout

SecretsScanner/
  Core/
    IDetector.cs                 // interface: Detect(stream|text, path, context) -> Findings
    RegexDetector.cs             // holds Pattern, Hints, Confidence rules
    EntropyDetector.cs           // Shannon entropy
    JwtDetector.cs               // structural decoding check
    FileClassifier.cs            // text/binary check, ext-based hints
    Scoring.cs                   // combine signals; severity
    PathsHeuristics.cs           // globs & filename rules
    ReportModel.cs               // JSON schema / SARIF
  Docker/
    ImageReader.cs               // reads image tars, layers via Docker.DotNet or stream
    HistoryParser.cs             // extracts ENV/ARG from history
    ContainerInspector.cs        // env, args, mounts, logs (Docker.DotNet)
  Catalog/
    RegexCatalog.cs              // patterns (section 3), per-detector metadata
    Keywords.cs                  // boost/suppress lists
  Cli/
    Program.cs                   // options: image, container, path; json output; fail-on

C# snippets (illustrative)

Regex catalog

public static class RegexCatalog
{
    public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules =
    {
        ("pem.openssh", "OpenSSH Private Key",
            new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled),
            "CRITICAL", "Remove private keys from images; use mounts or vault."),
        ("pem.private", "PEM Private Key",
            new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled),
            "CRITICAL", "Remove private keys; rotate credentials."),
        ("aws.akid", "AWS Access Key ID",
            new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled),
            "HIGH", "Rotate; use IAM roles/STS; remove from code/config."),
        ("github.pat", "GitHub Personal Access Token",
            new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled),
            "HIGH", "Revoke PAT; use fine-grained tokens; remove from image."),
        // ... add remaining patterns from Section 3
    };
}

Entropy

public static class Entropy
{
    public static double Shannon(ReadOnlySpan<char> s, ReadOnlySpan<char> alphabet)
    {
        Span<int> counts = stackalloc int[256];
        int n = 0;
        foreach (var ch in s)
        {
            if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; }
        }
        if (n == 0) return 0.0;
        double H = 0.0;
        for (int i = 0; i < counts.Length; i++)
        {
            if (counts[i] == 0) continue;
            double p = counts[i] / (double)n;
            H -= p * Math.Log(p, 2);
        }
        return H;
    }
}

Candidate extraction (simplified)

static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled);
static readonly Regex HexToken    = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled);

IEnumerable<Candidate> ExtractCandidates(string line)
{
    foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line);
    foreach (Match m in HexToken.Matches(line))    yield return new Candidate(m.Value, "hex", line);
}

Scoring

double Score(DetectionSignals s)
{
    double score = 0;
    if (s.RegexHard) score += 0.9;
    if (s.RegexSoft) score += 0.7;
    if (s.EntropyHit) score += 0.4;
    if (s.SuspiciousPath) score += 0.2;
    if (s.StructuralOk) score += 0.2;
    if (s.Suppressor) score -= 0.5;
    return Math.Clamp(score, 0, 1);
}

Docker (Docker.DotNet)

Images: IImageOperations.GetImageHistoryAsync, Images.GetImageAsync + tar unpack.
Containers: Containers.InspectContainerAsync, Exec.ExecCreateContainerAsync + ExecStart, GetArchiveFromContainerAsync, Logs.GetContainerLogsAsync.

10) False‑positive control & hygiene

Ignore lists: file globs (test/**, **/*.example.*), value lists (REDACTED, example, dummy, changeme).
Public materials: downrank matches inside BEGIN PUBLIC KEY/BEGIN CERTIFICATE.
Thresholds: tune entropy and minimum lengths to your codebase; keep per‑detector knobs in config.
Masking: never print full values; keep secure logs.
Rate‑limits: cap per‑file matches; cap per‑container to avoid spam.

11) CI/CD and policy

Build step: after docker build, run image scan; fail on High/Critical (configurable).
Pre‑deploy: scan runtime env for env/args/mounts (read‑only).
Baselining: allow a first pass to baseline known leftovers, then block any new secrets.
Rotation: auto‑emit per‑type remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager).

12) Optional enhancements

SBOM‑guided scanning: use SBOM/file inventory to prioritize text/config assets; cache base layers.
JWT structural checks: base64url‑decode header/payload; verify JSON; flag if plausible.
Checksum checks: Luhn for CCNs (if in scope); simple format checks for cloud tokens.
Interactive audit: CLI --audit mode to triage and write an “allowlist/baseline”.

13) Minimal “first list” your dev can paste today

Start with these detectors (high ROI):

PEM/OPENSSH private keys
AWS AKID + secret (context‑aided)
GitHub PAT, GitLab PAT, NPM, PyPI
Slack, Stripe, SendGrid, Twilio
Docker config auth field
DB connection strings (Postgres/MySQL/Mongo/SQLServer)
JWT
.aws/credentials, .npmrc, .docker/config.json, appsettings*.json, .env*, *.tfstate, *kubeconfig* (path heuristics)
Entropy (base64/hex/alnum) with context boosts/suppressors

That set alone catches the overwhelming majority of real‑world leaks.

Final note

This blueprint keeps everything offline (no external calls), so it’s safe in CI and reproducible. If you later want to add credential validation (e.g., confirm an AWS key via STS), make it opt‑in and heavily rate‑limited.

If you want, I can package these regexes and the scaffolding into a starter C# repo with a CLI (scan image <ref> | scan container <id> | scan path <dir>) and JSON output.

18 KiB Raw Blame History Unescape Escape