Files
git.stella-ops.org/docs/dev/scanning-engine.md
Vladimir Moushkov e5629454cf
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Create scanning-engine.md
2025-10-31 19:17:41 +02:00

526 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## 0) Scope at a glance
**Scan surfaces**
* **Images (static):** every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history).
* **Runtime (live containers):** env vars, process args, mounted volumes (e.g., `/run/secrets`), logs, selected files created at runtime.
**Detection methods**
1. **Deterministic patterns (regex)** for known secret types.
2. **Heuristics**: entropy scoring for unknown/random secrets.
3. **Contextual signals**: filename/path, key names, nearby keywords, file type hints.
4. **Structural checks**: e.g., JWT decodable, cloud key prefix/length.
5. **(Optional) Lightweight validation**: local checksum/format (no network calls by default).
**Reporting**
* JSON (and optionally SARIF) with: *where*, *what rule matched*, *snippet masked*, *confidence*, *severity*, *layer/container process*, and *remediation hint*.
---
## 1) Dockeraware discovery workflow
### A. Images (static, preruntime)
1. **Obtain filesystem + metadata**
* Prefer **API**: Docker Engine (Docker.DotNet) to `Images.GetImageAsync` and **export/tar** (`docker save`) in memory.
* Parse `manifest.json` + `config.json`; capture:
* `config.Env` (final env),
* **history**/`created_by` for `ENV`/`ARG`/`RUN` strings,
* labels.
2. **Scan every layer**
* Streamextract each layer tar (e.g., SharpCompress).
* Track **added/modified paths** per layer (so you can report: *layer N, file X*).
* **Textonly filter**: skip clearly binary files (e.g., sample N bytes; if >30% nonprintables, skip or downrank).
3. **File content & name/path analysis**
* Apply **regex detectors** (Section 3) and **entropy** (Section 4).
* Weigh findings with **context** (Section 5).
4. **Dockerfile/History checks**
* Flag secrets in `ENV`/`ARG`/`RUN` strings (e.g., `ENV MYSQL_ROOT_PASSWORD=...`).
* Flag **deletedlater files** that were present in earlier layers (common leak).
* Highlight missing `.dockerignore` patterns when suspicious files (.env, .pem, .tfstate) entered any layer.
### B. Running containers (runtime)
1. **Enumerate** containers and **inspect**:
* `InspectContainerAsync``Config.Env`, `HostConfig.Binds`, `Mounts`, image id.
2. **Env var scan**
* Scan all `key=value` pairs with the same detectors (regex + entropy + context on the key name).
3. **Process args**
* `docker top` or `/proc/<pid>/cmdline` via `Exec` → scan args for `--password=...`, `--api-key=...`.
4. **Mounted secret paths**
* Default locations: `/run/secrets/*`, `/var/run/secrets/*`, K8s secret volumes, config maps that may contain creds.
* Retrieve via `GetArchiveFromContainerAsync` and scan.
5. **Logs (optional but valuable)**
* Attach/stream logs; scan lines for secret patterns; provide **live redaction** option.
> **Note**: Memory forensics is possible but heavy; treat as optional/IR-only.
---
## 2) Highvalue filename/path heuristics (fast wins)
Run these **glob/name** checks before content scanning to prioritize files:
**Generic secret indicators**
```
**/*.env **/.env* **/*secret*.* **/*secr*.*
**/*credential*.* **/*creds*.* **/*passwd*
**/password* **/*token*.* **/*apikey*.* **/*api_key*.*
**/*.pem **/*.key **/*.pfx **/*.p12
**/*.jks **/*.keystore **/id_rsa **/id_dsa
**/id_ecdsa **/id_ed25519 **/private.pem **/server.key
**/tls.key **/jwt*.key
```
**Common app/config**
```
**/appsettings*.json **/secrets*.json
**/application.{yml,yaml,properties}
**/application-*.{yml,yaml,properties}
**/config.yaml **/settings.yml **/settings.py
**/wp-config.php **/config.php **/settings.php
**/nuget.config **/settings.xml (Maven) **/gradle.properties
**/docker-compose*.yml **/compose*.yml
**/PublishProfiles/*.pubxml
```
**Cloud/CLI creds**
```
**/.aws/credentials **/.aws/config
**/gcloud/application_default_credentials.json
**/.azure/** **/doctl/config.yaml **/.oci/config
**/.docker/config.json **/.dockercfg
**/.npmrc **/.yarnrc **/.pypirc **/.gem/credentials **/.netrc
```
**Infra/IaC**
```
**/*.tfstate **/*.tfvars* **/kube/config **/.kube/config **/*kubeconfig*
**/service-account*.json **/*-sa.json **/*-key.json
```
**Orchestrator runtime**
```
/run/secrets/* /var/run/secrets/*
```
---
## 3) **Regex detector catalog** (battletested patterns)
> Use `RegexOptions.Compiled | RegexOptions.IgnoreCase` (casesensitive where needed).
> Always **mask** values in reports (e.g., show first 4 + last 4 chars).
### 3.1 Private keys / certificates
* **OpenSSH private key**
`@"-----BEGIN OPENSSH PRIVATE KEY-----"`
* **Generic PEM private key**
`@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"`
* **PGP private key**
`@"-----BEGIN PGP PRIVATE KEY BLOCK-----"`
> (Public keys/certificates are *not* secrets: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` → downrank/ignore.)
### 3.2 Cloud: AWS
* **Access Key ID**
`@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"`
* **Secret Access Key (contextaided)**
`@"\b[A-Za-z0-9/\+=]{40}\b"`
*Boost only if near `aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY` within ~50 chars.*
* **Credentials file lines**
* `@"aws_access_key_id\s*=\s*[A-Z0-9]{20}"`
* `@"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"`
### 3.3 Cloud: GCP / Google
* **API key**
`@"\bAIza[0-9A-Za-z\-_]{35}\b"`
* **Service Account JSON** (twoterm signature)
* `@"""type""\s*:\s*""service_account"""`
* `@"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"`
### 3.4 Cloud: Azure
* **Storage connection string**
`@"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"`
* **SAS token (simplified)**
`@"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"`
### 3.5 Dev platforms / SCM
* **GitHub PAT**
`@"\bgh[prusoa]_[A-Za-z0-9]{36}\b"`
* **GitLab PAT**
`@"\bglpat-[A-Za-z0-9\-_]{20,}\b"`
* **NPM token**
* in `.npmrc`: `@"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"`
* raw form: `@"\bnpm_[A-Za-z0-9]{36}\b"`
* **PyPI token**
`@"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"`
### 3.6 Messaging / SaaS
* **Slack tokens (broad)**
`@"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"`
* **Stripe**
`@"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"`
* **SendGrid**
`@"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"`
* **Mailgun**
`@"\bkey-[0-9a-zA-Z]{32}\b"`
* **Twilio**
* SID: `@"\bAC[0-9a-f]{32}\b"`
* Auth token (context aided): `@"\b[0-9a-f]{32}\b"` near `twilio|auth[_-]?token`
* **Discord bot**
`@"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"`
### 3.7 Database / service connection strings
* **PostgreSQL**
`@"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"`
* **MySQL**
`@"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"`
* **MongoDB**
`@"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"`
* **SQL Server (ADO.NET)**
`@"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"`
* **Redis**
`@"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"`
* **Basic auth in URL (generic)**
`@"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"`
### 3.8 Docker / CLI auth artifacts
* **Docker config.json auth**
`@"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""`
* **.netrc auth**
`@"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"`
### 3.9 Tokens / JWT
* **JWT (structural)**
`@"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"`
### 3.10 Build tools / package managers
* **NuGet (cleartext)**
`@"<add\s+key=""ClearTextPassword""\s+value=""[^""]+"""`
`@"<add\s+key=""Password""\s+value=""[^""]+"""` *(base64 still secret)*
* **Maven settings.xml**
`@"<server>\s*<id>[^<]+</id>\s*<username>[^<]+</username>\s*<password>[^<]+</password>"`
* **Gradle**
`@"(?i)\bsigning\.password\s*=\s*.+"`
> Keep regexes modular; associate each with:
> `{ Id, Name, Pattern, Severity, Examples, RecommendedRemediation }`.
---
## 4) Entropy detector (catches “unknown” secrets)
**Why:** Many orgspecific tokens wont match known regexes.
**Implementation**
* Extract candidate tokens by character class:
* base64/base64url: `[A-Za-z0-9/_\-\+=]{20,}`
* hex: `[A-Fa-f0-9]{32,}`
* general mixed: `[A-Za-z0-9]{24,}`
* Compute **Shannon entropy** per candidate. Use **alphabetaware thresholds**:
* **base64/url**: ≥ **4.0** bits/char & length ≥ 24
* **hex**: ≥ **3.0** bits/char & length ≥ 32
* **alnum**: ≥ **4.0** bits/char & length ≥ 24
* **Context boosts** (raise confidence) if **within 64 chars** of:
`password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer`
* **Context suppressors** (lower confidence/ignore):
* File/path contains: `example|sample|test|fixture|dummy`
* Surrounding line contains: `REDACTED|<redacted>|changeme`
* Known nonsecret blocks: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE`
* Cap **N findings per file** (e.g., 50) to avoid log floods.
---
## 5) Scoring & deduping
Combine signals into a **confidence score**:
* +0.9 Regex “hard” match (e.g., OpenSSH private key)
* +0.7 Regex “soft” match (e.g., AWS secret 40char near keyword)
* +0.4 Entropy pass
* +0.2 Suspicious filename/path
* 0.5 Suppressor keyword/file
* +0.2 Structural check passes (e.g., JWT decodes)
**Severity**
* **Critical**: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys.
* **High**: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history.
* **Medium**: highentropy candidates with strong context.
* **Low**: weak context/entropy only, or likely sample values.
**Dedupe** same value across files/layers/envs; keep a single canonical record with **occurrence list**.
---
## 6) Dockerspecific checks you must implement
* **ENV/ARG leakage in history**
Parse `config.History[].CreatedBy` or `docker history --no-trunc`.
Flag any `ENV/ARG` with suspicious key names or values matching detectors.
* **Deletedlater files**
If a file existed in an earlier layer and got deleted later (common `.env` mishap), still flag it and report **layer** + **instruction** that introduced it.
* **`.dockerignore` advisory**
If highrisk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest `.dockerignore` entries.
---
## 7) Runtime inspection rules
* **Environment**
* Scan all `Env` pairs; **boost** hits for keys containing:
`PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING`
* **Process args**
* Flag `--password`, `--api-key`, `--token`, `--secret`, `--connection-string`.
* **Mounted secrets**
* Enumerate `/run/secrets/*`, `/var/run/secrets/*` (Swarm/K8s).
* Ensure permissions are restrictive; still **scan contents** (apps sometimes copy them elsewhere).
* **Logs**
* Tail & scan. Provide **optional redaction** pipeline.
---
## 8) Reporting format (JSON)
Example JSON for one finding:
```json
{
"detectorId": "aws.accessKeyId",
"name": "AWS Access Key ID",
"severity": "HIGH",
"confidence": 0.92,
"valueSample": "AKIA************WXYZ",
"locations": [
{
"type": "image-layer-file",
"image": "repo/app:1.4.2",
"layerDigest": "sha256:...abc",
"path": "/app/.env",
"line": 12
},
{
"type": "container-env",
"containerId": "f3e9d...",
"envKey": "AWS_ACCESS_KEY_ID"
}
],
"context": {
"filePathScore": 0.2,
"regexMatch": true,
"entropy": null,
"nearbyKeywords": ["AWS_ACCESS_KEY_ID"]
},
"remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key."
}
```
> Optionally also emit **SARIF** to plug into codescanning dashboards.
---
## 9) C# implementation sketch
### Project layout
```
SecretsScanner/
Core/
IDetector.cs // interface: Detect(stream|text, path, context) -> Findings
RegexDetector.cs // holds Pattern, Hints, Confidence rules
EntropyDetector.cs // Shannon entropy
JwtDetector.cs // structural decoding check
FileClassifier.cs // text/binary check, ext-based hints
Scoring.cs // combine signals; severity
PathsHeuristics.cs // globs & filename rules
ReportModel.cs // JSON schema / SARIF
Docker/
ImageReader.cs // reads image tars, layers via Docker.DotNet or stream
HistoryParser.cs // extracts ENV/ARG from history
ContainerInspector.cs // env, args, mounts, logs (Docker.DotNet)
Catalog/
RegexCatalog.cs // patterns (section 3), per-detector metadata
Keywords.cs // boost/suppress lists
Cli/
Program.cs // options: image, container, path; json output; fail-on
```
### C# snippets (illustrative)
**Regex catalog**
```csharp
public static class RegexCatalog
{
public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules =
{
("pem.openssh", "OpenSSH Private Key",
new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled),
"CRITICAL", "Remove private keys from images; use mounts or vault."),
("pem.private", "PEM Private Key",
new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled),
"CRITICAL", "Remove private keys; rotate credentials."),
("aws.akid", "AWS Access Key ID",
new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled),
"HIGH", "Rotate; use IAM roles/STS; remove from code/config."),
("github.pat", "GitHub Personal Access Token",
new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled),
"HIGH", "Revoke PAT; use fine-grained tokens; remove from image."),
// ... add remaining patterns from Section 3
};
}
```
**Entropy**
```csharp
public static class Entropy
{
public static double Shannon(ReadOnlySpan<char> s, ReadOnlySpan<char> alphabet)
{
Span<int> counts = stackalloc int[256];
int n = 0;
foreach (var ch in s)
{
if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; }
}
if (n == 0) return 0.0;
double H = 0.0;
for (int i = 0; i < counts.Length; i++)
{
if (counts[i] == 0) continue;
double p = counts[i] / (double)n;
H -= p * Math.Log(p, 2);
}
return H;
}
}
```
**Candidate extraction (simplified)**
```csharp
static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled);
static readonly Regex HexToken = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled);
IEnumerable<Candidate> ExtractCandidates(string line)
{
foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line);
foreach (Match m in HexToken.Matches(line)) yield return new Candidate(m.Value, "hex", line);
}
```
**Scoring**
```csharp
double Score(DetectionSignals s)
{
double score = 0;
if (s.RegexHard) score += 0.9;
if (s.RegexSoft) score += 0.7;
if (s.EntropyHit) score += 0.4;
if (s.SuspiciousPath) score += 0.2;
if (s.StructuralOk) score += 0.2;
if (s.Suppressor) score -= 0.5;
return Math.Clamp(score, 0, 1);
}
```
**Docker (Docker.DotNet)**
* Images: `IImageOperations.GetImageHistoryAsync`, `Images.GetImageAsync` + tar unpack.
* Containers: `Containers.InspectContainerAsync`, `Exec.ExecCreateContainerAsync` + `ExecStart`, `GetArchiveFromContainerAsync`, `Logs.GetContainerLogsAsync`.
---
## 10) Falsepositive control & hygiene
* **Ignore lists**: file globs (`test/**`, `**/*.example.*`), value lists (`REDACTED`, `example`, `dummy`, `changeme`).
* **Public materials**: downrank matches inside `BEGIN PUBLIC KEY`/`BEGIN CERTIFICATE`.
* **Thresholds**: tune entropy and minimum lengths to your codebase; keep perdetector knobs in config.
* **Masking**: never print full values; keep secure logs.
* **Ratelimits**: cap perfile matches; cap percontainer to avoid spam.
---
## 11) CI/CD and policy
* **Build step**: after `docker build`, run image scan; **fail** on High/Critical (configurable).
* **Predeploy**: scan runtime env for env/args/mounts (readonly).
* **Baselining**: allow a first pass to **baseline known leftovers**, then block any **new** secrets.
* **Rotation**: autoemit pertype remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager).
---
## 12) Optional enhancements
* **SBOMguided scanning**: use SBOM/file inventory to prioritize text/config assets; cache base layers.
* **JWT structural checks**: base64urldecode header/payload; verify JSON; flag if plausible.
* **Checksum checks**: Luhn for CCNs (if in scope); simple format checks for cloud tokens.
* **Interactive audit**: CLI `--audit` mode to triage and write an “allowlist/baseline”.
---
## 13) Minimal “first list” your dev can paste today
**Start with these detectors (high ROI):**
* PEM/OPENSSH private keys
* AWS AKID + secret (contextaided)
* GitHub PAT, GitLab PAT, NPM, PyPI
* Slack, Stripe, SendGrid, Twilio
* Docker config `auth` field
* DB connection strings (Postgres/MySQL/Mongo/SQLServer)
* JWT
* `.aws/credentials`, `.npmrc`, `.docker/config.json`, `appsettings*.json`, `.env*`, `*.tfstate`, `*kubeconfig*` (path heuristics)
* Entropy (base64/hex/alnum) with context boosts/suppressors
That set alone catches the overwhelming majority of realworld leaks.
---
### Final note
This blueprint keeps everything **offline** (no external calls), so its safe in CI and reproducible. If you later want to add **credential validation** (e.g., confirm an AWS key via STS), make it optin and heavily ratelimited.
If you want, I can package these regexes and the scaffolding into a **starter C# repo** with a CLI (`scan image <ref> | scan container <id> | scan path <dir>`) and JSON output.