Create scanning-engine.md
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
Vladimir Moushkov
2025-10-31 19:17:41 +02:00
parent c63749d535
commit e5629454cf

525
docs/dev/scanning-engine.md Normal file
View File

@@ -0,0 +1,525 @@
## 0) Scope at a glance
**Scan surfaces**
* **Images (static):** every file in every layer, plus Dockerfile metadata (ENV/ARG/LABEL, history).
* **Runtime (live containers):** env vars, process args, mounted volumes (e.g., `/run/secrets`), logs, selected files created at runtime.
**Detection methods**
1. **Deterministic patterns (regex)** for known secret types.
2. **Heuristics**: entropy scoring for unknown/random secrets.
3. **Contextual signals**: filename/path, key names, nearby keywords, file type hints.
4. **Structural checks**: e.g., JWT decodable, cloud key prefix/length.
5. **(Optional) Lightweight validation**: local checksum/format (no network calls by default).
**Reporting**
* JSON (and optionally SARIF) with: *where*, *what rule matched*, *snippet masked*, *confidence*, *severity*, *layer/container process*, and *remediation hint*.
---
## 1) Dockeraware discovery workflow
### A. Images (static, preruntime)
1. **Obtain filesystem + metadata**
* Prefer **API**: Docker Engine (Docker.DotNet) to `Images.GetImageAsync` and **export/tar** (`docker save`) in memory.
* Parse `manifest.json` + `config.json`; capture:
* `config.Env` (final env),
* **history**/`created_by` for `ENV`/`ARG`/`RUN` strings,
* labels.
2. **Scan every layer**
* Streamextract each layer tar (e.g., SharpCompress).
* Track **added/modified paths** per layer (so you can report: *layer N, file X*).
* **Textonly filter**: skip clearly binary files (e.g., sample N bytes; if >30% nonprintables, skip or downrank).
3. **File content & name/path analysis**
* Apply **regex detectors** (Section 3) and **entropy** (Section 4).
* Weigh findings with **context** (Section 5).
4. **Dockerfile/History checks**
* Flag secrets in `ENV`/`ARG`/`RUN` strings (e.g., `ENV MYSQL_ROOT_PASSWORD=...`).
* Flag **deletedlater files** that were present in earlier layers (common leak).
* Highlight missing `.dockerignore` patterns when suspicious files (.env, .pem, .tfstate) entered any layer.
### B. Running containers (runtime)
1. **Enumerate** containers and **inspect**:
* `InspectContainerAsync``Config.Env`, `HostConfig.Binds`, `Mounts`, image id.
2. **Env var scan**
* Scan all `key=value` pairs with the same detectors (regex + entropy + context on the key name).
3. **Process args**
* `docker top` or `/proc/<pid>/cmdline` via `Exec` → scan args for `--password=...`, `--api-key=...`.
4. **Mounted secret paths**
* Default locations: `/run/secrets/*`, `/var/run/secrets/*`, K8s secret volumes, config maps that may contain creds.
* Retrieve via `GetArchiveFromContainerAsync` and scan.
5. **Logs (optional but valuable)**
* Attach/stream logs; scan lines for secret patterns; provide **live redaction** option.
> **Note**: Memory forensics is possible but heavy; treat as optional/IR-only.
---
## 2) Highvalue filename/path heuristics (fast wins)
Run these **glob/name** checks before content scanning to prioritize files:
**Generic secret indicators**
```
**/*.env **/.env* **/*secret*.* **/*secr*.*
**/*credential*.* **/*creds*.* **/*passwd*
**/password* **/*token*.* **/*apikey*.* **/*api_key*.*
**/*.pem **/*.key **/*.pfx **/*.p12
**/*.jks **/*.keystore **/id_rsa **/id_dsa
**/id_ecdsa **/id_ed25519 **/private.pem **/server.key
**/tls.key **/jwt*.key
```
**Common app/config**
```
**/appsettings*.json **/secrets*.json
**/application.{yml,yaml,properties}
**/application-*.{yml,yaml,properties}
**/config.yaml **/settings.yml **/settings.py
**/wp-config.php **/config.php **/settings.php
**/nuget.config **/settings.xml (Maven) **/gradle.properties
**/docker-compose*.yml **/compose*.yml
**/PublishProfiles/*.pubxml
```
**Cloud/CLI creds**
```
**/.aws/credentials **/.aws/config
**/gcloud/application_default_credentials.json
**/.azure/** **/doctl/config.yaml **/.oci/config
**/.docker/config.json **/.dockercfg
**/.npmrc **/.yarnrc **/.pypirc **/.gem/credentials **/.netrc
```
**Infra/IaC**
```
**/*.tfstate **/*.tfvars* **/kube/config **/.kube/config **/*kubeconfig*
**/service-account*.json **/*-sa.json **/*-key.json
```
**Orchestrator runtime**
```
/run/secrets/* /var/run/secrets/*
```
---
## 3) **Regex detector catalog** (battletested patterns)
> Use `RegexOptions.Compiled | RegexOptions.IgnoreCase` (casesensitive where needed).
> Always **mask** values in reports (e.g., show first 4 + last 4 chars).
### 3.1 Private keys / certificates
* **OpenSSH private key**
`@"-----BEGIN OPENSSH PRIVATE KEY-----"`
* **Generic PEM private key**
`@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----"`
* **PGP private key**
`@"-----BEGIN PGP PRIVATE KEY BLOCK-----"`
> (Public keys/certificates are *not* secrets: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE` → downrank/ignore.)
### 3.2 Cloud: AWS
* **Access Key ID**
`@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b"`
* **Secret Access Key (contextaided)**
`@"\b[A-Za-z0-9/\+=]{40}\b"`
*Boost only if near `aws|secret|access[_-]?key|AWS_SECRET_ACCESS_KEY` within ~50 chars.*
* **Credentials file lines**
* `@"aws_access_key_id\s*=\s*[A-Z0-9]{20}"`
* `@"aws_secret_access_key\s*=\s*[A-Za-z0-9/\+=]{40}"`
### 3.3 Cloud: GCP / Google
* **API key**
`@"\bAIza[0-9A-Za-z\-_]{35}\b"`
* **Service Account JSON** (twoterm signature)
* `@"""type""\s*:\s*""service_account"""`
* `@"""private_key""\s*:\s*""-----BEGIN PRIVATE KEY-----"`
### 3.4 Cloud: Azure
* **Storage connection string**
`@"DefaultEndpointsProtocol=https;AccountName=[^;]+;AccountKey=[A-Za-z0-9\+/=]{88};EndpointSuffix=core\.windows\.net"`
* **SAS token (simplified)**
`@"\bsv=\d{4}-\d{2}-\d{2}[^ ]*?&sig=[A-Za-z0-9%/\+=]{40,}\b"`
### 3.5 Dev platforms / SCM
* **GitHub PAT**
`@"\bgh[prusoa]_[A-Za-z0-9]{36}\b"`
* **GitLab PAT**
`@"\bglpat-[A-Za-z0-9\-_]{20,}\b"`
* **NPM token**
* in `.npmrc`: `@"//registry\.npmjs\.org/:_authToken=\s*(npm_[A-Za-z0-9]{36})"`
* raw form: `@"\bnpm_[A-Za-z0-9]{36}\b"`
* **PyPI token**
`@"\bpypi-AgEIcHlwaS5vcmc[A-Za-z0-9\-_]{50,}\b"`
### 3.6 Messaging / SaaS
* **Slack tokens (broad)**
`@"\bxox[a-z]-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}-[A-Za-z0-9-]{8,}(?:-[A-Za-z0-9-]{8,})?\b"`
* **Stripe**
`@"\bsk_(?:live|test)_[0-9a-zA-Z]{24}\b"`
* **SendGrid**
`@"\bSG\.[A-Za-z0-9\-_]{16,32}\.[A-Za-z0-9\-_]{16,64}\b"`
* **Mailgun**
`@"\bkey-[0-9a-zA-Z]{32}\b"`
* **Twilio**
* SID: `@"\bAC[0-9a-f]{32}\b"`
* Auth token (context aided): `@"\b[0-9a-f]{32}\b"` near `twilio|auth[_-]?token`
* **Discord bot**
`@"\b[A-Za-z\d]{24}\.[A-Za-z\d\-_]{6}\.[A-Za-z\d\-_]{27}\b"`
### 3.7 Database / service connection strings
* **PostgreSQL**
`@"\bpostgres(?:ql)?://[^:\s]+:[^@\s]+@[^/\s]+"`
* **MySQL**
`@"\bmysql://[^:\s]+:[^@\s]+@[^/\s]+"`
* **MongoDB**
`@"\bmongodb(?:\+srv)?://[^:\s]+:[^@\s]+@[^/\s]+"`
* **SQL Server (ADO.NET)**
`@"\bData Source=[^;]+;Initial Catalog=[^;]+;User ID=[^;]+;Password=[^;]+;"`
* **Redis**
`@"\bredis(?:\+ssl)?://(?::[^@]+@)?[^/\s]+"`
* **Basic auth in URL (generic)**
`@"[a-zA-Z][a-zA-Z0-9+\-.]*://[^:/\s]+:[^@/\s]+@[^/\s]+"`
### 3.8 Docker / CLI auth artifacts
* **Docker config.json auth**
`@"""auth""\s*:\s*""[A-Za-z0-9\+/=]{20,}"""`
* **.netrc auth**
`@"(?mi)^machine\s+\S+\s+login\s+\S+\s+password\s+\S+"`
### 3.9 Tokens / JWT
* **JWT (structural)**
`@"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"`
### 3.10 Build tools / package managers
* **NuGet (cleartext)**
`@"<add\s+key=""ClearTextPassword""\s+value=""[^""]+"""`
`@"<add\s+key=""Password""\s+value=""[^""]+"""` *(base64 still secret)*
* **Maven settings.xml**
`@"<server>\s*<id>[^<]+</id>\s*<username>[^<]+</username>\s*<password>[^<]+</password>"`
* **Gradle**
`@"(?i)\bsigning\.password\s*=\s*.+"`
> Keep regexes modular; associate each with:
> `{ Id, Name, Pattern, Severity, Examples, RecommendedRemediation }`.
---
## 4) Entropy detector (catches “unknown” secrets)
**Why:** Many orgspecific tokens wont match known regexes.
**Implementation**
* Extract candidate tokens by character class:
* base64/base64url: `[A-Za-z0-9/_\-\+=]{20,}`
* hex: `[A-Fa-f0-9]{32,}`
* general mixed: `[A-Za-z0-9]{24,}`
* Compute **Shannon entropy** per candidate. Use **alphabetaware thresholds**:
* **base64/url**: ≥ **4.0** bits/char & length ≥ 24
* **hex**: ≥ **3.0** bits/char & length ≥ 32
* **alnum**: ≥ **4.0** bits/char & length ≥ 24
* **Context boosts** (raise confidence) if **within 64 chars** of:
`password|passwd|pwd|secret|token|apikey|api_key|api-key|client[_-]?secret|private[_-]?key|connectionstring|conn[_-]?str|bearer`
* **Context suppressors** (lower confidence/ignore):
* File/path contains: `example|sample|test|fixture|dummy`
* Surrounding line contains: `REDACTED|<redacted>|changeme`
* Known nonsecret blocks: `BEGIN PUBLIC KEY`, `BEGIN CERTIFICATE`
* Cap **N findings per file** (e.g., 50) to avoid log floods.
---
## 5) Scoring & deduping
Combine signals into a **confidence score**:
* +0.9 Regex “hard” match (e.g., OpenSSH private key)
* +0.7 Regex “soft” match (e.g., AWS secret 40char near keyword)
* +0.4 Entropy pass
* +0.2 Suspicious filename/path
* 0.5 Suppressor keyword/file
* +0.2 Structural check passes (e.g., JWT decodes)
**Severity**
* **Critical**: private keys, cloud root creds, Docker auth, DB creds in URLs, verified JWT signing keys.
* **High**: API tokens (GitHub/GitLab/Slack/Stripe), secrets in ENV/ARG history.
* **Medium**: highentropy candidates with strong context.
* **Low**: weak context/entropy only, or likely sample values.
**Dedupe** same value across files/layers/envs; keep a single canonical record with **occurrence list**.
---
## 6) Dockerspecific checks you must implement
* **ENV/ARG leakage in history**
Parse `config.History[].CreatedBy` or `docker history --no-trunc`.
Flag any `ENV/ARG` with suspicious key names or values matching detectors.
* **Deletedlater files**
If a file existed in an earlier layer and got deleted later (common `.env` mishap), still flag it and report **layer** + **instruction** that introduced it.
* **`.dockerignore` advisory**
If highrisk files (.env, .pem, .tfstate, credentials) entered the build context once, suggest `.dockerignore` entries.
---
## 7) Runtime inspection rules
* **Environment**
* Scan all `Env` pairs; **boost** hits for keys containing:
`PASSWORD|PASS|PWD|SECRET|TOKEN|KEY|CLIENT_SECRET|SAS|CONNECTIONSTRING`
* **Process args**
* Flag `--password`, `--api-key`, `--token`, `--secret`, `--connection-string`.
* **Mounted secrets**
* Enumerate `/run/secrets/*`, `/var/run/secrets/*` (Swarm/K8s).
* Ensure permissions are restrictive; still **scan contents** (apps sometimes copy them elsewhere).
* **Logs**
* Tail & scan. Provide **optional redaction** pipeline.
---
## 8) Reporting format (JSON)
Example JSON for one finding:
```json
{
"detectorId": "aws.accessKeyId",
"name": "AWS Access Key ID",
"severity": "HIGH",
"confidence": 0.92,
"valueSample": "AKIA************WXYZ",
"locations": [
{
"type": "image-layer-file",
"image": "repo/app:1.4.2",
"layerDigest": "sha256:...abc",
"path": "/app/.env",
"line": 12
},
{
"type": "container-env",
"containerId": "f3e9d...",
"envKey": "AWS_ACCESS_KEY_ID"
}
],
"context": {
"filePathScore": 0.2,
"regexMatch": true,
"entropy": null,
"nearbyKeywords": ["AWS_ACCESS_KEY_ID"]
},
"remediation": "Remove from image; inject via secrets manager or runtime mount; rotate the key."
}
```
> Optionally also emit **SARIF** to plug into codescanning dashboards.
---
## 9) C# implementation sketch
### Project layout
```
SecretsScanner/
Core/
IDetector.cs // interface: Detect(stream|text, path, context) -> Findings
RegexDetector.cs // holds Pattern, Hints, Confidence rules
EntropyDetector.cs // Shannon entropy
JwtDetector.cs // structural decoding check
FileClassifier.cs // text/binary check, ext-based hints
Scoring.cs // combine signals; severity
PathsHeuristics.cs // globs & filename rules
ReportModel.cs // JSON schema / SARIF
Docker/
ImageReader.cs // reads image tars, layers via Docker.DotNet or stream
HistoryParser.cs // extracts ENV/ARG from history
ContainerInspector.cs // env, args, mounts, logs (Docker.DotNet)
Catalog/
RegexCatalog.cs // patterns (section 3), per-detector metadata
Keywords.cs // boost/suppress lists
Cli/
Program.cs // options: image, container, path; json output; fail-on
```
### C# snippets (illustrative)
**Regex catalog**
```csharp
public static class RegexCatalog
{
public static readonly (string Id, string Name, Regex Rx, string Severity, string Hint)[] Rules =
{
("pem.openssh", "OpenSSH Private Key",
new Regex(@"-----BEGIN OPENSSH PRIVATE KEY-----", RegexOptions.Compiled),
"CRITICAL", "Remove private keys from images; use mounts or vault."),
("pem.private", "PEM Private Key",
new Regex(@"-----BEGIN (?:RSA |DSA |EC |PGP )?PRIVATE KEY-----", RegexOptions.Compiled),
"CRITICAL", "Remove private keys; rotate credentials."),
("aws.akid", "AWS Access Key ID",
new Regex(@"\b(?:AKIA|ASIA|AGPA|AIDA|AROA|AIPA|ANPA)[A-Z0-9]{16}\b", RegexOptions.Compiled),
"HIGH", "Rotate; use IAM roles/STS; remove from code/config."),
("github.pat", "GitHub Personal Access Token",
new Regex(@"\bgh[prusoa]_[A-Za-z0-9]{36}\b", RegexOptions.Compiled),
"HIGH", "Revoke PAT; use fine-grained tokens; remove from image."),
// ... add remaining patterns from Section 3
};
}
```
**Entropy**
```csharp
public static class Entropy
{
public static double Shannon(ReadOnlySpan<char> s, ReadOnlySpan<char> alphabet)
{
Span<int> counts = stackalloc int[256];
int n = 0;
foreach (var ch in s)
{
if (alphabet.IndexOf(ch) >= 0) { counts[ch]++; n++; }
}
if (n == 0) return 0.0;
double H = 0.0;
for (int i = 0; i < counts.Length; i++)
{
if (counts[i] == 0) continue;
double p = counts[i] / (double)n;
H -= p * Math.Log(p, 2);
}
return H;
}
}
```
**Candidate extraction (simplified)**
```csharp
static readonly Regex Base64Token = new(@"[A-Za-z0-9/_\-\+=]{20,}", RegexOptions.Compiled);
static readonly Regex HexToken = new(@"[A-Fa-f0-9]{32,}", RegexOptions.Compiled);
IEnumerable<Candidate> ExtractCandidates(string line)
{
foreach (Match m in Base64Token.Matches(line)) yield return new Candidate(m.Value, "b64", line);
foreach (Match m in HexToken.Matches(line)) yield return new Candidate(m.Value, "hex", line);
}
```
**Scoring**
```csharp
double Score(DetectionSignals s)
{
double score = 0;
if (s.RegexHard) score += 0.9;
if (s.RegexSoft) score += 0.7;
if (s.EntropyHit) score += 0.4;
if (s.SuspiciousPath) score += 0.2;
if (s.StructuralOk) score += 0.2;
if (s.Suppressor) score -= 0.5;
return Math.Clamp(score, 0, 1);
}
```
**Docker (Docker.DotNet)**
* Images: `IImageOperations.GetImageHistoryAsync`, `Images.GetImageAsync` + tar unpack.
* Containers: `Containers.InspectContainerAsync`, `Exec.ExecCreateContainerAsync` + `ExecStart`, `GetArchiveFromContainerAsync`, `Logs.GetContainerLogsAsync`.
---
## 10) Falsepositive control & hygiene
* **Ignore lists**: file globs (`test/**`, `**/*.example.*`), value lists (`REDACTED`, `example`, `dummy`, `changeme`).
* **Public materials**: downrank matches inside `BEGIN PUBLIC KEY`/`BEGIN CERTIFICATE`.
* **Thresholds**: tune entropy and minimum lengths to your codebase; keep perdetector knobs in config.
* **Masking**: never print full values; keep secure logs.
* **Ratelimits**: cap perfile matches; cap percontainer to avoid spam.
---
## 11) CI/CD and policy
* **Build step**: after `docker build`, run image scan; **fail** on High/Critical (configurable).
* **Predeploy**: scan runtime env for env/args/mounts (readonly).
* **Baselining**: allow a first pass to **baseline known leftovers**, then block any **new** secrets.
* **Rotation**: autoemit pertype remediation (e.g., rotate PAT, revoke AWS AK/SK, move to secret manager).
---
## 12) Optional enhancements
* **SBOMguided scanning**: use SBOM/file inventory to prioritize text/config assets; cache base layers.
* **JWT structural checks**: base64urldecode header/payload; verify JSON; flag if plausible.
* **Checksum checks**: Luhn for CCNs (if in scope); simple format checks for cloud tokens.
* **Interactive audit**: CLI `--audit` mode to triage and write an “allowlist/baseline”.
---
## 13) Minimal “first list” your dev can paste today
**Start with these detectors (high ROI):**
* PEM/OPENSSH private keys
* AWS AKID + secret (contextaided)
* GitHub PAT, GitLab PAT, NPM, PyPI
* Slack, Stripe, SendGrid, Twilio
* Docker config `auth` field
* DB connection strings (Postgres/MySQL/Mongo/SQLServer)
* JWT
* `.aws/credentials`, `.npmrc`, `.docker/config.json`, `appsettings*.json`, `.env*`, `*.tfstate`, `*kubeconfig*` (path heuristics)
* Entropy (base64/hex/alnum) with context boosts/suppressors
That set alone catches the overwhelming majority of realworld leaks.
---
### Final note
This blueprint keeps everything **offline** (no external calls), so its safe in CI and reproducible. If you later want to add **credential validation** (e.g., confirm an AWS key via STS), make it optin and heavily ratelimited.
If you want, I can package these regexes and the scaffolding into a **starter C# repo** with a CLI (`scan image <ref> | scan container <id> | scan path <dir>`) and JSON output.