git.stella-ops.org/docs/modules/scanner/semantic-entrypoint-schema.md

# Semantic Entrypoint Schema

> Part of Sprint 0411 - Semantic Entrypoint Engine (Task 23)

This document defines the schema for semantic entrypoint analysis, which enriches container scan results with application-level intent, capabilities, and threat modeling.

---

## Overview

The Semantic Entrypoint Engine analyzes container entrypoints to infer:

1. **Application Intent** - What kind of application is running (web server, worker, CLI, etc.)
2. **Capabilities** - What system resources the application accesses (network, filesystem, database, etc.)
3. **Attack Surface** - Potential security threat vectors based on capabilities
4. **Data Boundaries** - Data flow boundaries with sensitivity classification

This semantic layer enables more precise vulnerability prioritization by understanding which code paths are actually reachable from the entrypoint.

---

## Schema Definitions

### SemanticEntrypoint

The root type representing semantic analysis of an entrypoint.

```typescript
interface SemanticEntrypoint {
  id: string;                        // Unique identifier for this analysis
  specification: EntrypointSpecification;
  intent: ApplicationIntent;
  capabilities: CapabilityClass;     // Bitmask of detected capabilities
  attackSurface: ThreatVector[];
  dataBoundaries: DataFlowBoundary[];
  confidence: SemanticConfidence;
  language?: string;                 // Primary language (python, java, node, dotnet, go)
  framework?: string;                // Detected framework (django, spring-boot, express, etc.)
  frameworkVersion?: string;
  runtimeVersion?: string;
  analyzedAt: string;                // ISO-8601 timestamp
}
```

### ApplicationIntent

Enumeration of application types.

| Value | Description | Common Indicators |
|-------|-------------|-------------------|
| `Unknown` | Intent could not be determined | Fallback |
| `WebServer` | HTTP/HTTPS server | Flask, Django, Express, ASP.NET Core, Gin |
| `Worker` | Background job processor | Celery, Sidekiq, BackgroundService |
| `CliTool` | Command-line interface | Click, argparse, Cobra, Picocli |
| `Serverless` | FaaS function | Lambda handler, Cloud Functions |
| `StreamProcessor` | Event stream handler | Kafka Streams, Flink |
| `RpcServer` | RPC/gRPC server | gRPC, Thrift |
| `Daemon` | Long-running service | Custom main loops |
| `TestRunner` | Test execution | pytest, JUnit, xunit |
| `BatchJob` | Scheduled/periodic task | Cron-style entry |
| `Proxy` | Network proxy/gateway | Envoy, nginx config |

### CapabilityClass (Bitmask)

Flags indicating detected capabilities. Multiple flags can be combined.

| Flag | Value | Description |
|------|-------|-------------|
| `None` | 0x0 | No capabilities detected |
| `NetworkListen` | 0x1 | Binds to network ports |
| `NetworkOutbound` | 0x2 | Makes outbound network requests |
| `FileRead` | 0x4 | Reads from filesystem |
| `FileWrite` | 0x8 | Writes to filesystem |
| `ProcessSpawn` | 0x10 | Spawns child processes |
| `DatabaseSql` | 0x20 | SQL database access |
| `DatabaseNoSql` | 0x40 | NoSQL database access |
| `MessageQueue` | 0x80 | Message queue producer/consumer |
| `CacheAccess` | 0x100 | Cache system access (Redis, Memcached) |
| `CryptoSign` | 0x200 | Cryptographic signing operations |
| `CryptoEncrypt` | 0x400 | Encryption/decryption operations |
| `UserInput` | 0x800 | Processes user input |
| `SecretAccess` | 0x1000 | Reads secrets/credentials |
| `CloudSdk` | 0x2000 | Cloud provider SDK usage |
| `ContainerApi` | 0x4000 | Container/orchestration API access |
| `SystemCall` | 0x8000 | Direct syscall/FFI usage |

### ThreatVector

Represents a potential attack vector.

```typescript
interface ThreatVector {
  type: ThreatVectorType;
  confidence: number;                // 0.0 to 1.0
  contributingCapabilities: CapabilityClass;
  evidence: string[];
  cweId?: number;                    // CWE identifier
  owaspCategory?: string;            // OWASP category
}
```

### ThreatVectorType

| Type | CWE | OWASP | Triggered By |
|------|-----|-------|--------------|
| `SqlInjection` | 89 | A03:Injection | DatabaseSql + UserInput |
| `CommandInjection` | 78 | A03:Injection | ProcessSpawn + UserInput |
| `PathTraversal` | 22 | A01:Broken Access Control | FileRead/FileWrite + UserInput |
| `Ssrf` | 918 | A10:SSRF | NetworkOutbound + UserInput |
| `Xss` | 79 | A03:Injection | NetworkListen + UserInput |
| `InsecureDeserialization` | 502 | A08:Software and Data Integrity | UserInput + dynamic types |
| `SensitiveDataExposure` | 200 | A02:Cryptographic Failures | SecretAccess + NetworkListen |
| `BrokenAuthentication` | 287 | A07:Identification and Auth | NetworkListen + SecretAccess |
| `InsufficientLogging` | 778 | A09:Logging Failures | NetworkListen without logging |
| `CryptoWeakness` | 327 | A02:Cryptographic Failures | CryptoSign/CryptoEncrypt |

### DataFlowBoundary

Represents a data flow boundary crossing.

```typescript
interface DataFlowBoundary {
  type: DataFlowBoundaryType;
  direction: DataFlowDirection;      // Inbound | Outbound | Bidirectional
  sensitivity: DataSensitivity;      // Public | Internal | Confidential | Restricted
  confidence: number;
  port?: number;                     // For network boundaries
  protocol?: string;                 // http, grpc, amqp, etc.
  evidence: string[];
}
```

### DataFlowBoundaryType

| Type | Security Sensitive | Description |
|------|-------------------|-------------|
| `HttpRequest` | Yes | HTTP/HTTPS endpoint |
| `GrpcCall` | Yes | gRPC service |
| `WebSocket` | Yes | WebSocket connection |
| `DatabaseQuery` | Yes | Database queries |
| `MessageBroker` | No | Message queue pub/sub |
| `FileSystem` | No | File I/O boundary |
| `Cache` | No | Cache read/write |
| `ExternalApi` | Yes | Third-party API calls |
| `CloudService` | Yes | Cloud provider services |

### SemanticConfidence

Confidence scoring for semantic analysis.

```typescript
interface SemanticConfidence {
  score: number;                     // 0.0 to 1.0
  tier: ConfidenceTier;
  reasons: string[];
}

enum ConfidenceTier {
  Unknown = 0,
  Low = 1,
  Medium = 2,
  High = 3,
  Definitive = 4
}
```

| Tier | Score Range | Description |
|------|-------------|-------------|
| `Unknown` | 0.0 | No analysis possible |
| `Low` | 0.0-0.4 | Heuristic guess only |
| `Medium` | 0.4-0.7 | Partial evidence |
| `High` | 0.7-0.9 | Strong indicators |
| `Definitive` | 0.9-1.0 | Explicit declaration found |

---

## SBOM Property Extensions

When semantic data is included in CycloneDX or SPDX SBOMs, the following property namespace is used:

```
stellaops:semantic.*
```

### Property Names

| Property | Type | Description |
|----------|------|-------------|
| `stellaops:semantic.intent` | string | ApplicationIntent value |
| `stellaops:semantic.capabilities` | string | Comma-separated capability names |
| `stellaops:semantic.capability.count` | int | Number of detected capabilities |
| `stellaops:semantic.threats` | JSON | Array of threat vector summaries |
| `stellaops:semantic.threat.count` | int | Number of identified threats |
| `stellaops:semantic.risk.score` | float | Overall risk score (0.0-1.0) |
| `stellaops:semantic.confidence` | float | Confidence score (0.0-1.0) |
| `stellaops:semantic.confidence.tier` | string | Confidence tier name |
| `stellaops:semantic.language` | string | Primary language |
| `stellaops:semantic.framework` | string | Detected framework |
| `stellaops:semantic.framework.version` | string | Framework version |
| `stellaops:semantic.boundary.count` | int | Number of data boundaries |
| `stellaops:semantic.boundary.sensitive.count` | int | Security-sensitive boundaries |
| `stellaops:semantic.owasp.categories` | string | Comma-separated OWASP categories |
| `stellaops:semantic.cwe.ids` | string | Comma-separated CWE IDs |

---

## RichGraph Integration

Semantic data is attached to `richgraph-v1` nodes via the Attributes dictionary:

| Attribute Key | Description |
|---------------|-------------|
| `semantic_intent` | ApplicationIntent value |
| `semantic_capabilities` | Comma-separated capability flags |
| `semantic_threats` | Comma-separated threat types |
| `semantic_risk_score` | Risk score (formatted to 3 decimal places) |
| `semantic_confidence` | Confidence score |
| `semantic_confidence_tier` | Confidence tier name |
| `semantic_framework` | Framework name |
| `semantic_framework_version` | Framework version |
| `is_entrypoint` | "true" if node is an entrypoint |
| `semantic_boundaries` | JSON array of boundary types |
| `owasp_category` | OWASP category if applicable |
| `cwe_id` | CWE identifier if applicable |

---

## Language Adapter Support

The following language-specific adapters are available:

| Language | Adapter | Supported Frameworks |
|----------|---------|---------------------|
| Python | `PythonSemanticAdapter` | Django, Flask, FastAPI, Celery, Click |
| Java | `JavaSemanticAdapter` | Spring Boot, Quarkus, Micronaut, Kafka Streams |
| Node.js | `NodeSemanticAdapter` | Express, NestJS, Fastify, Koa |
| .NET | `DotNetSemanticAdapter` | ASP.NET Core, Worker Service, Console |
| Go | `GoSemanticAdapter` | net/http, Gin, Echo, Cobra, gRPC |

---

## Configuration

Semantic analysis is configured via the `Scanner:EntryTrace:Semantic` configuration section:

```yaml
Scanner:
  EntryTrace:
    Semantic:
      Enabled: true
      ThreatConfidenceThreshold: 0.3
      MaxThreatVectors: 50
      IncludeLowConfidenceCapabilities: false
      EnabledLanguages: []  # Empty = all languages
```

| Option | Default | Description |
|--------|---------|-------------|
| `Enabled` | true | Enable semantic analysis |
| `ThreatConfidenceThreshold` | 0.3 | Minimum confidence for threat vectors |
| `MaxThreatVectors` | 50 | Maximum threats per entrypoint |
| `IncludeLowConfidenceCapabilities` | false | Include low-confidence capabilities |
| `EnabledLanguages` | [] | Languages to analyze (empty = all) |

---

## Determinism Guarantees

All semantic analysis outputs are deterministic:

1. **Capability ordering** - Flags are ordered by value (bitmask position)
2. **Threat vector ordering** - Ordered by ThreatVectorType enum value
3. **Data boundary ordering** - Ordered by (Type, Direction) tuple
4. **Evidence ordering** - Alphabetically sorted within each element
5. **JSON serialization** - Uses camelCase naming, consistent formatting

This enables reliable diffing of semantic analysis results across scan runs.

---

## CLI Usage

Semantic analysis can be enabled via the CLI `--semantic` flag:

```bash
stella scan --semantic docker.io/library/python:3.12
```

Output includes semantic summary when enabled:

```
Semantic Analysis:
  Intent: WebServer
  Framework: flask (v3.0.0)
  Capabilities: NetworkListen, DatabaseSql, FileRead
  Threat Vectors: 2 (SqlInjection, Ssrf)
  Risk Score: 0.72
  Confidence: High (0.85)
```

---

## References

- [OWASP Top 10 2021](https://owasp.org/Top10/)
- [CWE/SANS Top 25](https://cwe.mitre.org/top25/)
- [CycloneDX Property Extensions](https://cyclonedx.org/docs/1.5/json/#properties)
- [SPDX 3.0 External Identifiers](https://spdx.github.io/spdx-spec/v3.0/annexes/external-identifier-types/)