Files
git.stella-ops.org/docs/modules/scanner/semantic-entrypoint-schema.md
StellaOps Bot f1a39c4ce3
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
Notify Smoke Test / Notify Unit Tests (push) Has been cancelled
Notify Smoke Test / Notifier Service Tests (push) Has been cancelled
Notify Smoke Test / Notification Smoke Test (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Scanner Analyzers / Discover Analyzers (push) Has been cancelled
Scanner Analyzers / Build Analyzers (push) Has been cancelled
Scanner Analyzers / Test Language Analyzers (push) Has been cancelled
Scanner Analyzers / Validate Test Fixtures (push) Has been cancelled
Scanner Analyzers / Verify Deterministic Output (push) Has been cancelled
Signals CI & Image / signals-ci (push) Has been cancelled
Signals Reachability Scoring & Events / reachability-smoke (push) Has been cancelled
Signals Reachability Scoring & Events / sign-and-upload (push) Has been cancelled
Manifest Integrity / Validate Schema Integrity (push) Has been cancelled
Manifest Integrity / Validate Contract Documents (push) Has been cancelled
Manifest Integrity / Validate Pack Fixtures (push) Has been cancelled
Manifest Integrity / Audit SHA256SUMS Files (push) Has been cancelled
Manifest Integrity / Verify Merkle Roots (push) Has been cancelled
devportal-offline / build-offline (push) Has been cancelled
Mirror Thin Bundle Sign & Verify / mirror-sign (push) Has been cancelled
up
2025-12-13 18:08:55 +02:00

309 lines
11 KiB
Markdown

# Semantic Entrypoint Schema
> Part of Sprint 0411 - Semantic Entrypoint Engine (Task 23)
This document defines the schema for semantic entrypoint analysis, which enriches container scan results with application-level intent, capabilities, and threat modeling.
---
## Overview
The Semantic Entrypoint Engine analyzes container entrypoints to infer:
1. **Application Intent** - What kind of application is running (web server, worker, CLI, etc.)
2. **Capabilities** - What system resources the application accesses (network, filesystem, database, etc.)
3. **Attack Surface** - Potential security threat vectors based on capabilities
4. **Data Boundaries** - Data flow boundaries with sensitivity classification
This semantic layer enables more precise vulnerability prioritization by understanding which code paths are actually reachable from the entrypoint.
---
## Schema Definitions
### SemanticEntrypoint
The root type representing semantic analysis of an entrypoint.
```typescript
interface SemanticEntrypoint {
id: string; // Unique identifier for this analysis
specification: EntrypointSpecification;
intent: ApplicationIntent;
capabilities: CapabilityClass; // Bitmask of detected capabilities
attackSurface: ThreatVector[];
dataBoundaries: DataFlowBoundary[];
confidence: SemanticConfidence;
language?: string; // Primary language (python, java, node, dotnet, go)
framework?: string; // Detected framework (django, spring-boot, express, etc.)
frameworkVersion?: string;
runtimeVersion?: string;
analyzedAt: string; // ISO-8601 timestamp
}
```
### ApplicationIntent
Enumeration of application types.
| Value | Description | Common Indicators |
|-------|-------------|-------------------|
| `Unknown` | Intent could not be determined | Fallback |
| `WebServer` | HTTP/HTTPS server | Flask, Django, Express, ASP.NET Core, Gin |
| `Worker` | Background job processor | Celery, Sidekiq, BackgroundService |
| `CliTool` | Command-line interface | Click, argparse, Cobra, Picocli |
| `Serverless` | FaaS function | Lambda handler, Cloud Functions |
| `StreamProcessor` | Event stream handler | Kafka Streams, Flink |
| `RpcServer` | RPC/gRPC server | gRPC, Thrift |
| `Daemon` | Long-running service | Custom main loops |
| `TestRunner` | Test execution | pytest, JUnit, xunit |
| `BatchJob` | Scheduled/periodic task | Cron-style entry |
| `Proxy` | Network proxy/gateway | Envoy, nginx config |
### CapabilityClass (Bitmask)
Flags indicating detected capabilities. Multiple flags can be combined.
| Flag | Value | Description |
|------|-------|-------------|
| `None` | 0x0 | No capabilities detected |
| `NetworkListen` | 0x1 | Binds to network ports |
| `NetworkOutbound` | 0x2 | Makes outbound network requests |
| `FileRead` | 0x4 | Reads from filesystem |
| `FileWrite` | 0x8 | Writes to filesystem |
| `ProcessSpawn` | 0x10 | Spawns child processes |
| `DatabaseSql` | 0x20 | SQL database access |
| `DatabaseNoSql` | 0x40 | NoSQL database access |
| `MessageQueue` | 0x80 | Message queue producer/consumer |
| `CacheAccess` | 0x100 | Cache system access (Redis, Memcached) |
| `CryptoSign` | 0x200 | Cryptographic signing operations |
| `CryptoEncrypt` | 0x400 | Encryption/decryption operations |
| `UserInput` | 0x800 | Processes user input |
| `SecretAccess` | 0x1000 | Reads secrets/credentials |
| `CloudSdk` | 0x2000 | Cloud provider SDK usage |
| `ContainerApi` | 0x4000 | Container/orchestration API access |
| `SystemCall` | 0x8000 | Direct syscall/FFI usage |
### ThreatVector
Represents a potential attack vector.
```typescript
interface ThreatVector {
type: ThreatVectorType;
confidence: number; // 0.0 to 1.0
contributingCapabilities: CapabilityClass;
evidence: string[];
cweId?: number; // CWE identifier
owaspCategory?: string; // OWASP category
}
```
### ThreatVectorType
| Type | CWE | OWASP | Triggered By |
|------|-----|-------|--------------|
| `SqlInjection` | 89 | A03:Injection | DatabaseSql + UserInput |
| `CommandInjection` | 78 | A03:Injection | ProcessSpawn + UserInput |
| `PathTraversal` | 22 | A01:Broken Access Control | FileRead/FileWrite + UserInput |
| `Ssrf` | 918 | A10:SSRF | NetworkOutbound + UserInput |
| `Xss` | 79 | A03:Injection | NetworkListen + UserInput |
| `InsecureDeserialization` | 502 | A08:Software and Data Integrity | UserInput + dynamic types |
| `SensitiveDataExposure` | 200 | A02:Cryptographic Failures | SecretAccess + NetworkListen |
| `BrokenAuthentication` | 287 | A07:Identification and Auth | NetworkListen + SecretAccess |
| `InsufficientLogging` | 778 | A09:Logging Failures | NetworkListen without logging |
| `CryptoWeakness` | 327 | A02:Cryptographic Failures | CryptoSign/CryptoEncrypt |
### DataFlowBoundary
Represents a data flow boundary crossing.
```typescript
interface DataFlowBoundary {
type: DataFlowBoundaryType;
direction: DataFlowDirection; // Inbound | Outbound | Bidirectional
sensitivity: DataSensitivity; // Public | Internal | Confidential | Restricted
confidence: number;
port?: number; // For network boundaries
protocol?: string; // http, grpc, amqp, etc.
evidence: string[];
}
```
### DataFlowBoundaryType
| Type | Security Sensitive | Description |
|------|-------------------|-------------|
| `HttpRequest` | Yes | HTTP/HTTPS endpoint |
| `GrpcCall` | Yes | gRPC service |
| `WebSocket` | Yes | WebSocket connection |
| `DatabaseQuery` | Yes | Database queries |
| `MessageBroker` | No | Message queue pub/sub |
| `FileSystem` | No | File I/O boundary |
| `Cache` | No | Cache read/write |
| `ExternalApi` | Yes | Third-party API calls |
| `CloudService` | Yes | Cloud provider services |
### SemanticConfidence
Confidence scoring for semantic analysis.
```typescript
interface SemanticConfidence {
score: number; // 0.0 to 1.0
tier: ConfidenceTier;
reasons: string[];
}
enum ConfidenceTier {
Unknown = 0,
Low = 1,
Medium = 2,
High = 3,
Definitive = 4
}
```
| Tier | Score Range | Description |
|------|-------------|-------------|
| `Unknown` | 0.0 | No analysis possible |
| `Low` | 0.0-0.4 | Heuristic guess only |
| `Medium` | 0.4-0.7 | Partial evidence |
| `High` | 0.7-0.9 | Strong indicators |
| `Definitive` | 0.9-1.0 | Explicit declaration found |
---
## SBOM Property Extensions
When semantic data is included in CycloneDX or SPDX SBOMs, the following property namespace is used:
```
stellaops:semantic.*
```
### Property Names
| Property | Type | Description |
|----------|------|-------------|
| `stellaops:semantic.intent` | string | ApplicationIntent value |
| `stellaops:semantic.capabilities` | string | Comma-separated capability names |
| `stellaops:semantic.capability.count` | int | Number of detected capabilities |
| `stellaops:semantic.threats` | JSON | Array of threat vector summaries |
| `stellaops:semantic.threat.count` | int | Number of identified threats |
| `stellaops:semantic.risk.score` | float | Overall risk score (0.0-1.0) |
| `stellaops:semantic.confidence` | float | Confidence score (0.0-1.0) |
| `stellaops:semantic.confidence.tier` | string | Confidence tier name |
| `stellaops:semantic.language` | string | Primary language |
| `stellaops:semantic.framework` | string | Detected framework |
| `stellaops:semantic.framework.version` | string | Framework version |
| `stellaops:semantic.boundary.count` | int | Number of data boundaries |
| `stellaops:semantic.boundary.sensitive.count` | int | Security-sensitive boundaries |
| `stellaops:semantic.owasp.categories` | string | Comma-separated OWASP categories |
| `stellaops:semantic.cwe.ids` | string | Comma-separated CWE IDs |
---
## RichGraph Integration
Semantic data is attached to `richgraph-v1` nodes via the Attributes dictionary:
| Attribute Key | Description |
|---------------|-------------|
| `semantic_intent` | ApplicationIntent value |
| `semantic_capabilities` | Comma-separated capability flags |
| `semantic_threats` | Comma-separated threat types |
| `semantic_risk_score` | Risk score (formatted to 3 decimal places) |
| `semantic_confidence` | Confidence score |
| `semantic_confidence_tier` | Confidence tier name |
| `semantic_framework` | Framework name |
| `semantic_framework_version` | Framework version |
| `is_entrypoint` | "true" if node is an entrypoint |
| `semantic_boundaries` | JSON array of boundary types |
| `owasp_category` | OWASP category if applicable |
| `cwe_id` | CWE identifier if applicable |
---
## Language Adapter Support
The following language-specific adapters are available:
| Language | Adapter | Supported Frameworks |
|----------|---------|---------------------|
| Python | `PythonSemanticAdapter` | Django, Flask, FastAPI, Celery, Click |
| Java | `JavaSemanticAdapter` | Spring Boot, Quarkus, Micronaut, Kafka Streams |
| Node.js | `NodeSemanticAdapter` | Express, NestJS, Fastify, Koa |
| .NET | `DotNetSemanticAdapter` | ASP.NET Core, Worker Service, Console |
| Go | `GoSemanticAdapter` | net/http, Gin, Echo, Cobra, gRPC |
---
## Configuration
Semantic analysis is configured via the `Scanner:EntryTrace:Semantic` configuration section:
```yaml
Scanner:
EntryTrace:
Semantic:
Enabled: true
ThreatConfidenceThreshold: 0.3
MaxThreatVectors: 50
IncludeLowConfidenceCapabilities: false
EnabledLanguages: [] # Empty = all languages
```
| Option | Default | Description |
|--------|---------|-------------|
| `Enabled` | true | Enable semantic analysis |
| `ThreatConfidenceThreshold` | 0.3 | Minimum confidence for threat vectors |
| `MaxThreatVectors` | 50 | Maximum threats per entrypoint |
| `IncludeLowConfidenceCapabilities` | false | Include low-confidence capabilities |
| `EnabledLanguages` | [] | Languages to analyze (empty = all) |
---
## Determinism Guarantees
All semantic analysis outputs are deterministic:
1. **Capability ordering** - Flags are ordered by value (bitmask position)
2. **Threat vector ordering** - Ordered by ThreatVectorType enum value
3. **Data boundary ordering** - Ordered by (Type, Direction) tuple
4. **Evidence ordering** - Alphabetically sorted within each element
5. **JSON serialization** - Uses camelCase naming, consistent formatting
This enables reliable diffing of semantic analysis results across scan runs.
---
## CLI Usage
Semantic analysis can be enabled via the CLI `--semantic` flag:
```bash
stella scan --semantic docker.io/library/python:3.12
```
Output includes semantic summary when enabled:
```
Semantic Analysis:
Intent: WebServer
Framework: flask (v3.0.0)
Capabilities: NetworkListen, DatabaseSql, FileRead
Threat Vectors: 2 (SqlInjection, Ssrf)
Risk Score: 0.72
Confidence: High (0.85)
```
---
## References
- [OWASP Top 10 2021](https://owasp.org/Top10/)
- [CWE/SANS Top 25](https://cwe.mitre.org/top25/)
- [CycloneDX Property Extensions](https://cyclonedx.org/docs/1.5/json/#properties)
- [SPDX 3.0 External Identifiers](https://spdx.github.io/spdx-spec/v3.0/annexes/external-identifier-types/)