Fix build and code structure improvements. New but essential UI functionality. CI improvements. Documentation improvements. AI module improvements.

This commit is contained in:
StellaOps Bot
2025-12-26 21:54:17 +02:00
parent 335ff7da16
commit c2b9cd8d1f
3717 changed files with 264714 additions and 48202 deletions

View File

@@ -0,0 +1,624 @@
# CI/CD Troubleshooting Guide
Common issues and solutions for StellaOps CI/CD infrastructure.
## Quick Diagnostics
### Check Workflow Status
```bash
# View recent workflow runs
gh run list --limit 10
# View specific run logs
gh run view <run-id> --log
# Re-run failed workflow
gh run rerun <run-id>
```
### Verify Local Environment
```bash
# Check .NET SDK
dotnet --list-sdks
# Check Docker
docker version
docker buildx version
# Check Node.js
node --version
npm --version
# Check required tools
which cosign syft helm
```
---
## Build Failures
### NuGet Restore Failures
**Symptom:** `error NU1301: Unable to load the service index`
**Causes:**
1. Network connectivity issues
2. NuGet source unavailable
3. Invalid credentials
**Solutions:**
```bash
# Clear NuGet cache
dotnet nuget locals all --clear
# Check NuGet sources
dotnet nuget list source
# Restore with verbose logging
dotnet restore src/StellaOps.sln -v detailed
```
**In CI:**
```yaml
- name: Restore with retry
run: |
for i in {1..3}; do
dotnet restore src/StellaOps.sln && break
echo "Retry $i..."
sleep 30
done
```
---
### SDK Version Mismatch
**Symptom:** `error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found`
**Solutions:**
1. Check `global.json`:
```bash
cat global.json
```
2. Install correct SDK:
```bash
# CI environment
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.100'
include-prerelease: true
```
3. Override SDK version:
```bash
# Remove global.json override
rm global.json
```
---
### Docker Build Failures
**Symptom:** `failed to solve: rpc error: code = Unknown`
**Causes:**
1. Disk space exhausted
2. Layer cache corruption
3. Network timeout
**Solutions:**
```bash
# Clean Docker system
docker system prune -af
docker builder prune -af
# Build without cache
docker build --no-cache -t myimage .
# Increase buildx timeout
docker buildx create --driver-opt network=host --use
```
---
### Multi-arch Build Failures
**Symptom:** `exec format error` or QEMU issues
**Solutions:**
```bash
# Install QEMU for cross-platform builds
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
# Create new buildx builder
docker buildx create --name multiarch --driver docker-container --use
docker buildx inspect --bootstrap
# Build for specific platforms
docker buildx build --platform linux/amd64 -t myimage .
```
---
## Test Failures
### Testcontainers Issues
**Symptom:** `Could not find a running Docker daemon`
**Solutions:**
1. Ensure Docker is running:
```bash
docker info
```
2. Set Testcontainers host:
```bash
export TESTCONTAINERS_HOST_OVERRIDE=host.docker.internal
# or for Linux
export TESTCONTAINERS_HOST_OVERRIDE=$(hostname -I | awk '{print $1}')
```
3. Use Ryuk container for cleanup:
```bash
export TESTCONTAINERS_RYUK_DISABLED=false
```
4. CI configuration:
```yaml
services:
dind:
image: docker:dind
privileged: true
```
---
### PostgreSQL Test Failures
**Symptom:** `FATAL: role "postgres" does not exist`
**Solutions:**
1. Check connection string:
```bash
export STELLAOPS_TEST_POSTGRES_CONNECTION="Host=localhost;Database=test;Username=postgres;Password=postgres"
```
2. Use Testcontainers PostgreSQL:
```csharp
var container = new PostgreSqlBuilder()
.WithDatabase("test")
.WithUsername("postgres")
.WithPassword("postgres")
.Build();
```
3. Wait for PostgreSQL readiness:
```bash
until pg_isready -h localhost -p 5432; do
sleep 1
done
```
---
### Test Timeouts
**Symptom:** `Test exceeded timeout`
**Solutions:**
1. Increase timeout:
```bash
dotnet test --blame-hang-timeout 10m
```
2. Run tests in parallel with limited concurrency:
```bash
dotnet test -maxcpucount:2
```
3. Identify slow tests:
```bash
dotnet test --logger "console;verbosity=detailed" --logger "trx"
```
---
### Determinism Test Failures
**Symptom:** `Output mismatch: expected SHA256 differs`
**Solutions:**
1. Check for non-deterministic sources:
- Timestamps
- Random GUIDs
- Floating-point operations
- Dictionary ordering
2. Run determinism comparison:
```bash
.gitea/scripts/test/determinism-run.sh
diff out/scanner-determinism/run1.json out/scanner-determinism/run2.json
```
3. Update golden fixtures:
```bash
.gitea/scripts/test/run-fixtures-check.sh --update
```
---
## Deployment Failures
### SSH Connection Issues
**Symptom:** `ssh: connect to host X.X.X.X port 22: Connection refused`
**Solutions:**
1. Verify SSH key:
```bash
ssh-keygen -lf ~/.ssh/id_rsa.pub
```
2. Test connection:
```bash
ssh -vvv user@host
```
3. Add host to known_hosts:
```yaml
- name: Setup SSH
run: |
mkdir -p ~/.ssh
ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
```
---
### Registry Push Failures
**Symptom:** `unauthorized: authentication required`
**Solutions:**
1. Login to registry:
```bash
docker login git.stella-ops.org -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD
```
2. Check token permissions:
- `write:packages` scope required
- Token not expired
3. Use credential helper:
```yaml
- name: Login to Registry
uses: docker/login-action@v3
with:
registry: git.stella-ops.org
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}
```
---
### Helm Deployment Failures
**Symptom:** `Error: UPGRADE FAILED: cannot patch`
**Solutions:**
1. Check resource conflicts:
```bash
kubectl get events -n stellaops --sort-by='.lastTimestamp'
```
2. Force upgrade:
```bash
helm upgrade --install --force stellaops ./devops/helm/stellaops
```
3. Clean up stuck release:
```bash
helm history stellaops
helm rollback stellaops <revision>
# or
kubectl delete secret -l name=stellaops,owner=helm
```
---
## Workflow Issues
### Workflow Not Triggering
**Symptom:** Push/PR doesn't trigger workflow
**Causes:**
1. Path filter not matching
2. Branch protection rules
3. YAML syntax error
**Solutions:**
1. Check path filters:
```yaml
on:
push:
paths:
- 'src/**' # Check if files match
```
2. Validate YAML:
```bash
.gitea/scripts/validate/validate-workflows.sh
```
3. Check branch rules:
- Verify workflow permissions
- Check protected branch settings
---
### Concurrency Issues
**Symptom:** Duplicate runs or stuck workflows
**Solutions:**
1. Add concurrency control:
```yaml
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
```
2. Cancel stale runs manually:
```bash
gh run cancel <run-id>
```
---
### Artifact Upload/Download Failures
**Symptom:** `Unable to find any artifacts`
**Solutions:**
1. Check artifact names match:
```yaml
# Upload
- uses: actions/upload-artifact@v4
with:
name: my-artifact # Must match
# Download
- uses: actions/download-artifact@v4
with:
name: my-artifact # Must match
```
2. Check retention period:
```yaml
- uses: actions/upload-artifact@v4
with:
retention-days: 90 # Default is 90
```
3. Verify job dependencies:
```yaml
download-job:
needs: [upload-job] # Must complete first
```
---
## Runner Issues
### Disk Space Exhausted
**Symptom:** `No space left on device`
**Solutions:**
1. Run cleanup script:
```bash
.gitea/scripts/util/cleanup-runner-space.sh
```
2. Add cleanup step to workflow:
```yaml
- name: Free disk space
run: |
docker system prune -af
rm -rf /tmp/*
df -h
```
3. Use larger runner:
```yaml
runs-on: ubuntu-latest-4xlarge
```
---
### Out of Memory
**Symptom:** `Killed` or `OOMKilled`
**Solutions:**
1. Limit parallel jobs:
```yaml
strategy:
max-parallel: 2
```
2. Limit dotnet memory:
```bash
export DOTNET_GCHeapHardLimit=0x40000000 # 1GB
```
3. Use swap:
```yaml
- name: Create swap
run: |
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
```
---
### Runner Not Picking Up Jobs
**Symptom:** Jobs stuck in `queued` state
**Solutions:**
1. Check runner status:
```bash
# Self-hosted runner
./run.sh --check
```
2. Verify labels match:
```yaml
runs-on: [self-hosted, linux, x64] # All labels must match
```
3. Restart runner service:
```bash
sudo systemctl restart actions.runner.*.service
```
---
## Signing & Attestation Issues
### Cosign Signing Failures
**Symptom:** `error opening key: no such file`
**Solutions:**
1. Check key configuration:
```bash
# From base64 secret
echo "$COSIGN_PRIVATE_KEY_B64" | base64 -d > cosign.key
# Verify key
cosign public-key --key cosign.key
```
2. Set password:
```bash
export COSIGN_PASSWORD="${{ secrets.COSIGN_PASSWORD }}"
```
3. Use keyless signing:
```yaml
- name: Sign with keyless
env:
COSIGN_EXPERIMENTAL: 1
run: cosign sign --yes $IMAGE
```
---
### SBOM Generation Failures
**Symptom:** `syft: command not found`
**Solutions:**
1. Install Syft:
```bash
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
```
2. Use container:
```yaml
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: ${{ env.IMAGE }}
```
---
## Debugging Tips
### Enable Debug Logging
```yaml
env:
ACTIONS_STEP_DEBUG: true
ACTIONS_RUNNER_DEBUG: true
```
### SSH into Runner
```yaml
- name: Debug SSH
uses: mxschmitt/action-tmate@v3
if: failure()
```
### Collect Diagnostic Info
```yaml
- name: Diagnostics
if: failure()
run: |
echo "=== Environment ==="
env | sort
echo "=== Disk ==="
df -h
echo "=== Memory ==="
free -m
echo "=== Docker ==="
docker info
docker ps -a
```
### View Workflow Logs
```bash
# Stream logs
gh run watch <run-id>
# Download logs
gh run download <run-id> --name logs
```
---
## Getting Help
1. **Check existing issues:** Search repository issues
2. **Review workflow history:** Look for similar failures
3. **Consult documentation:** `docs/` and `.gitea/docs/`
4. **Contact DevOps:** Create issue with label `ci-cd`
### Information to Include
- Workflow name and run ID
- Error message and stack trace
- Steps to reproduce
- Environment details (OS, SDK versions)
- Recent changes to affected code