git.stella-ops.org/docs/dev/scanning-engine.md

## 0) Scope at a glance

**Scan surfaces**

* **Images (static):** every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history).
* **Runtime (live containers):** env vars, process args, mounted volumes (e.g., `/run/secrets`), logs, selected files created at runtime.

**Detection methods**

1. **Deterministic patterns (regex)** for known secret types.
2. **Heuristics**: entropy scoring for unknown/random secrets.
3. **Contextual signals**: filename/path, key names, nearby keywords, file type hints.
4. **Structural checks**: e.g., JWT decodable, cloud key prefix/length.
5. **(Optional) Lightweight validation**: local checksum/format (no network calls by default).

**Reporting**

* JSON (and optionally SARIF) with: *where*, *what rule matched*, *snippet masked*, *confidence*, *severity*, *layer/container process*, and *remediation hint*.

---

## 1) Docker‑aware discovery workflow

### A. Images (static, pre‑runtime)

1. **Obtain filesystem + metadata**

   * Prefer **API**: Docker Engine (Docker.DotNet) to `Images.GetImageAsync` and **export/tar** (`docker save`) in memory.
   * Parse `manifest.json` + `config.json`; capture:

     * `config.Env` (final env),
     * **history**/`created_by` for `ENV`/`ARG`/`RUN` strings,
     * labels.
2. **Scan every layer**

   * Stream‑extract each layer tar (e.g., SharpCompress).
   * Track **added/modified paths** per layer (so you can report: *layer N, file X*).
   * **Text‑only filter**: skip clearly binary files (e.g., sample N bytes; if >30% non‑printables, skip or downrank).
3. **File content & name/path analysis**

   * Apply **regex detectors** (Section 3) and **entropy** (Section 4).
   * Weigh findings with **context** (Section 5).
4. **Dockerfile/History checks**

   * Flag secrets in `ENV`/`ARG`/`RUN` strings (e.g., `ENV MYSQL_ROOT_PASSWORD=...`).
   * Flag **deleted‑later files** that were present in earlier layers (common leak).
   * Highlight missing `.dockerignore` patterns when suspicious files (.env, .pem, .tfstate) entered any layer.

### B. Running containers (runtime)

1. **Enumerate** containers and **inspect**:

   * `InspectContainerAsync` → `Config.Env`, `HostConfig.Binds`, `Mounts`, image id.
2. **Env var scan**

   * Scan all `key=value` pairs with the same detectors (regex + entropy + context on the key name).
3. **Process args**

   * `docker top` or `/proc/<pid>/cmdline` via `Exec` → scan args for `--password=...`, `--api-key=...`.
4. **Mounted secret paths**

   * Default locations: `/run/secrets/*`, `/var/run/secrets/*`, K8s secret volumes, config maps that may contain creds.
   * Retrieve via `GetArchiveFromContainerAsync` and scan.
5. **Logs (optional but valuable)**

   * Attach/stream logs; scan lines for secret patterns; provide **live redaction** option.

> **Note**: Memory forensics is possible but heavy; treat as optional/IR-only.

---

## 2) High‑value filename/path heuristics (fast wins)

Run these **glob/name** checks before content scanning to prioritize files:

**Generic secret indicators**

```
**/*.env        **/.env*           **/*secret*.*      **/*secr*.*
**/*credential*.*                 **/*creds*.*       **/*passwd*
**/password*   **/*token*.*       **/*apikey*.*      **/*api_key*.*
**/*.pem       **/*.key           **/*.pfx           **/*.p12
**/*.jks       **/*.keystore      **/id_rsa          **/id_dsa
**/id_ecdsa    **/id_ed25519      **/private.pem     **/server.key
**/tls.key     **/jwt*.key
```

**Common app/config**

```
**/appsettings*.json              **/secrets*.json
**/application.{yml,yaml,properties}
**/application-*.{yml,yaml,properties}
**/config.yaml  **/settings.yml   **/settings.py
**/wp-config.php **/config.php     **/settings.php
**/nuget.config  **/settings.xml (Maven)  **/gradle.properties
**/docker-compose*.yml   **/compose*.yml
**/PublishProfiles/*.pubxml
```

**Cloud/CLI creds**

```
**/.aws/credentials  **/.aws/config
**/gcloud/application_default_credentials.json
**/.azure/**         **/doctl/config.yaml     **/.oci/config
**/.docker/config.json  **/.dockercfg
**/.npmrc  **/.yarnrc  **/.pypirc  **/.gem/credentials  **/.netrc
```

**Infra/IaC**

```
**/*.tfstate  **/*.tfvars*   **/kube/config  **/.kube/config  **/*kubeconfig*
**/service-account*.json     **/*-sa.json    **/*-key.json
```

**Orchestrator runtime**

```
/run/secrets/*     /var/run/secrets/*
```

---

## 3) **Regex detector catalog** (battle‑tested patterns)

> Use `RegexOptions.Compiled | RegexOptions.IgnoreCase` (case‑sensitive where needed).
> Always **mask** values in reports (e.g., show first 4 + last 4 chars).

### 3.1 Private keys / certificates

* **OpenSSH private key**
  `@"-----BEGIN OPENSSH PRIVATE KEY-----"`
* **Generic PEM private key**
  `@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"`
* **PGP private key**
  `@"-----BEGIN PGP PRIVATE KEY BLOCK-----"`

> (Public keys/certificates are *not* secrets: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` → downrank/ignore.)

### 3.2 Cloud: AWS

* **Access Key ID**
  `@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"`
* **Secret Access Key (context‑aided)**
  `@"\b[A-Za-z0-9/\+=]{40}\b"`
  *Boost only if near `aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY` within ~50 chars.*
* **Credentials file lines**

  * `@"aws_access_key_id\s*=\s*[A-Z0-9]{20}"`
  * `@"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"`

### 3.3 Cloud: GCP / Google

* **API key**
  `@"\bAIza[0-9A-Za-z\-_]{35}\b"`
* **Service Account JSON** (two‑term signature)

  * `@"""type""\s*:\s*""service_account"""`
  * `@"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"`

### 3.4 Cloud: Azure

* **Storage connection string**
  `@"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"`
* **SAS token (simplified)**
  `@"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"`

### 3.5 Dev platforms / SCM

* **GitHub PAT**
  `@"\bgh[prusoa]_[A-Za-z0-9]{36}\b"`
* **GitLab PAT**
  `@"\bglpat-[A-Za-z0-9\-_]{20,}\b"`
* **NPM token**

  * in `.npmrc`: `@"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"`
  * raw form: `@"\bnpm_[A-Za-z0-9]{36}\b"`
* **PyPI token**
  `@"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"`

### 3.6 Messaging / SaaS

* **Slack tokens (broad)**
  `@"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"`
* **Stripe**
  `@"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"`
* **SendGrid**
  `@"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"`
* **Mailgun**
  `@"\bkey-[0-9a-zA-Z]{32}\b"`
* **Twilio**

  * SID: `@"\bAC[0-9a-f]{32}\b"`
  * Auth token (context aided): `@"\b[0-9a-f]{32}\b"` near `twilio|auth[_-]?token`
* **Discord bot**
  `@"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"`

### 3.7 Database / service connection strings

* **PostgreSQL**
  `@"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"`
* **MySQL**
  `@"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"`
* **MongoDB**
  `@"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"`
* **SQL Server (ADO.NET)**
  `@"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"`
* **Redis**
  `@"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"`
* **Basic auth in URL (generic)**
  `@"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"`

### 3.8 Docker / CLI auth artifacts

* **Docker config.json auth**
  `@"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""`
* **.netrc auth**
  `@"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"`

### 3.9 Tokens / JWT

* **JWT (structural)**
  `@"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"`

### 3.10 Build tools / package managers

* **NuGet (cleartext)**
  `@"<add\s+key=""ClearTextPassword""\s+value=""[^""]+"""`
  `@"<add\s+key=""Password""\s+value=""[^""]+"""`  *(base64 ‑ still secret)*
* **Maven settings.xml**
  `@"<server>\s*<id>[^<]+</id>\s*<username>[^<]+</username>\s*<password>[^<]+</password>"`
* **Gradle**
  `@"(?i)\bsigning\.password\s*=\s*.+"`

> Keep regexes modular; associate each with:
> `{ Id, Name, Pattern, Severity, Examples, RecommendedRemediation }`.

---

## 4) Entropy detector (catches “unknown” secrets)

**Why:** Many org‑specific tokens won’t match known regexes.

**Implementation**

* Extract candidate tokens by character class:

  * base64/base64url: `[A-Za-z0-9/_\-\+=]{20,}`
  * hex: `[A-Fa-f0-9]{32,}`
  * general mixed: `[A-Za-z0-9]{24,}`
* Compute **Shannon entropy** per candidate. Use **alphabet‑aware thresholds**:

  * **base64/url**: ≥ **4.0** bits/char & length ≥ 24
  * **hex**: ≥ **3.0** bits/char & length ≥ 32
  * **alnum**: ≥ **4.0** bits/char & length ≥ 24
* **Context boosts** (raise confidence) if **within 64 chars** of:
  `password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer`
* **Context suppressors** (lower confidence/ignore):

  * File/path contains: `example|sample|test|fixture|dummy`
  * Surrounding line contains: `REDACTED|<redacted>|changeme`
  * Known non‑secret blocks: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE`
* Cap **N findings per file** (e.g., 50) to avoid log floods.

---

## 5) Scoring & de‑duping

Combine signals into a **confidence score**:

* +0.9 Regex “hard” match (e.g., OpenSSH private key)
* +0.7 Regex “soft” match (e.g., AWS secret 40‑char near keyword)
* +0.4 Entropy pass
* +0.2 Suspicious filename/path
* –0.5 Suppressor keyword/file
* +0.2 Structural check passes (e.g., JWT decodes)

**Severity**

* **Critical**: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys.
* **High**: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history.
* **Medium**: high‑entropy candidates with strong context.
* **Low**: weak context/entropy only, or likely sample values.

**De‑dupe** same value across files/layers/envs; keep a single canonical record with **occurrence list**.

---

## 6) Docker‑specific checks you must implement

* **ENV/ARG leakage in history**
  Parse `config.History[].CreatedBy` or `docker history --no-trunc`.
  Flag any `ENV/ARG` with suspicious key names or values matching detectors.
* **Deleted‑later files**
  If a file existed in an earlier layer and got deleted later (common `.env` mishap), still flag it and report **layer** + **instruction** that introduced it.
* **`.dockerignore` advisory**
  If high‑risk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest `.dockerignore` entries.

---

## 7) Runtime inspection rules

* **Environment**

  * Scan all `Env` pairs; **boost** hits for keys containing:
    `PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING`
* **Process args**

  * Flag `--password`, `--api-key`, `--token`, `--secret`, `--connection-string`.
* **Mounted secrets**

  * Enumerate `/run/secrets/*`, `/var/run/secrets/*` (Swarm/K8s).
  * Ensure permissions are restrictive; still **scan contents** (apps sometimes copy them elsewhere).
* **Logs**

  * Tail & scan. Provide **optional redaction** pipeline.

---

## 8) Reporting format (JSON)

Example JSON for one finding:

```json
{
  "detectorId": "aws.accessKeyId",
  "name": "AWS Access Key ID",
  "severity": "HIGH",
  "confidence": 0.92,
  "valueSample": "AKIA************WXYZ",
  "locations": [
    {
      "type": "image-layer-file",
      "image": "repo/app:1.4.2",
      "layerDigest": "sha256:...abc",
      "path": "/app/.env",
      "line": 12
    },
    {
      "type": "container-env",
      "containerId": "f3e9d...",
      "envKey": "AWS_ACCESS_KEY_ID"
    }
  ],
  "context": {
    "filePathScore": 0.2,
    "regexMatch": true,
    "entropy": null,
    "nearbyKeywords": ["AWS_ACCESS_KEY_ID"]
  },
  "remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key."
}
```

> Optionally also emit **SARIF** to plug into code‑scanning dashboards.

---

## 9) C# implementation sketch

### Project layout

```
SecretsScanner/
  Core/
    IDetector.cs                 // interface: Detect(stream|text, path, context) -> Findings
    RegexDetector.cs             // holds Pattern, Hints, Confidence rules
    EntropyDetector.cs           // Shannon entropy
    JwtDetector.cs               // structural decoding check
    FileClassifier.cs            // text/binary check, ext-based hints
    Scoring.cs                   // combine signals; severity
    PathsHeuristics.cs           // globs & filename rules
    ReportModel.cs               // JSON schema / SARIF
  Docker/
    ImageReader.cs               // reads image tars, layers via Docker.DotNet or stream
    HistoryParser.cs             // extracts ENV/ARG from history
    ContainerInspector.cs        // env, args, mounts, logs (Docker.DotNet)
  Catalog/
    RegexCatalog.cs              // patterns (section 3), per-detector metadata
    Keywords.cs                  // boost/suppress lists
  Cli/
    Program.cs                   // options: image, container, path; json output; fail-on
```

### C# snippets (illustrative)

**Regex catalog**

```csharp
public static class RegexCatalog
{
    public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules =
    {
        ("pem.openssh", "OpenSSH Private Key",
            new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled),
            "CRITICAL", "Remove private keys from images; use mounts or vault."),
        ("pem.private", "PEM Private Key",
            new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled),
            "CRITICAL", "Remove private keys; rotate credentials."),
        ("aws.akid", "AWS Access Key ID",
            new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled),
            "HIGH", "Rotate; use IAM roles/STS; remove from code/config."),
        ("github.pat", "GitHub Personal Access Token",
            new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled),
            "HIGH", "Revoke PAT; use fine-grained tokens; remove from image."),
        // ... add remaining patterns from Section 3
    };
}
```

**Entropy**

```csharp
public static class Entropy
{
    public static double Shannon(ReadOnlySpan<char> s, ReadOnlySpan<char> alphabet)
    {
        Span<int> counts = stackalloc int[256];
        int n = 0;
        foreach (var ch in s)
        {
            if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; }
        }
        if (n == 0) return 0.0;
        double H = 0.0;
        for (int i = 0; i < counts.Length; i++)
        {
            if (counts[i] == 0) continue;
            double p = counts[i] / (double)n;
            H -= p * Math.Log(p, 2);
        }
        return H;
    }
}
```

**Candidate extraction (simplified)**

```csharp
static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled);
static readonly Regex HexToken    = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled);

IEnumerable<Candidate> ExtractCandidates(string line)
{
    foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line);
    foreach (Match m in HexToken.Matches(line))    yield return new Candidate(m.Value, "hex", line);
}
```

**Scoring**

```csharp
double Score(DetectionSignals s)
{
    double score = 0;
    if (s.RegexHard) score += 0.9;
    if (s.RegexSoft) score += 0.7;
    if (s.EntropyHit) score += 0.4;
    if (s.SuspiciousPath) score += 0.2;
    if (s.StructuralOk) score += 0.2;
    if (s.Suppressor) score -= 0.5;
    return Math.Clamp(score, 0, 1);
}
```

**Docker (Docker.DotNet)**

* Images: `IImageOperations.GetImageHistoryAsync`, `Images.GetImageAsync` + tar unpack.
* Containers: `Containers.InspectContainerAsync`, `Exec.ExecCreateContainerAsync` + `ExecStart`, `GetArchiveFromContainerAsync`, `Logs.GetContainerLogsAsync`.

---

## 10) False‑positive control & hygiene

* **Ignore lists**: file globs (`test/**`, `**/*.example.*`), value lists (`REDACTED`, `example`, `dummy`, `changeme`).
* **Public materials**: downrank matches inside `BEGIN PUBLIC KEY`/`BEGIN CERTIFICATE`.
* **Thresholds**: tune entropy and minimum lengths to your codebase; keep per‑detector knobs in config.
* **Masking**: never print full values; keep secure logs.
* **Rate‑limits**: cap per‑file matches; cap per‑container to avoid spam.

---

## 11) CI/CD and policy

* **Build step**: after `docker build`, run image scan; **fail** on High/Critical (configurable).
* **Pre‑deploy**: scan runtime env for env/args/mounts (read‑only).
* **Baselining**: allow a first pass to **baseline known leftovers**, then block any **new** secrets.
* **Rotation**: auto‑emit per‑type remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager).

---

## 12) Optional enhancements

* **SBOM‑guided scanning**: use SBOM/file inventory to prioritize text/config assets; cache base layers.
* **JWT structural checks**: base64url‑decode header/payload; verify JSON; flag if plausible.
* **Checksum checks**: Luhn for CCNs (if in scope); simple format checks for cloud tokens.
* **Interactive audit**: CLI `--audit` mode to triage and write an “allowlist/baseline”.

---

## 13) Minimal “first list” your dev can paste today

**Start with these detectors (high ROI):**

* PEM/OPENSSH private keys
* AWS AKID + secret (context‑aided)
* GitHub PAT, GitLab PAT, NPM, PyPI
* Slack, Stripe, SendGrid, Twilio
* Docker config `auth` field
* DB connection strings (Postgres/MySQL/Mongo/SQLServer)
* JWT
* `.aws/credentials`, `.npmrc`, `.docker/config.json`, `appsettings*.json`, `.env*`, `*.tfstate`, `*kubeconfig*` (path heuristics)
* Entropy (base64/hex/alnum) with context boosts/suppressors

That set alone catches the overwhelming majority of real‑world leaks.

---

### Final note

This blueprint keeps everything **offline** (no external calls), so it’s safe in CI and reproducible. If you later want to add **credential validation** (e.g., confirm an AWS key via STS), make it opt‑in and heavily rate‑limited.

If you want, I can package these regexes and the scaffolding into a **starter C# repo** with a CLI (`scan image <ref> | scan container <id> | scan path <dir>`) and JSON output.