Fix build and code structure improvements. New but essential UI functionality. CI improvements. Documentation improvements. AI module improvements.
This commit is contained in:
624
.gitea/docs/troubleshooting.md
Normal file
624
.gitea/docs/troubleshooting.md
Normal file
@@ -0,0 +1,624 @@
|
||||
# CI/CD Troubleshooting Guide
|
||||
|
||||
Common issues and solutions for StellaOps CI/CD infrastructure.
|
||||
|
||||
## Quick Diagnostics
|
||||
|
||||
### Check Workflow Status
|
||||
|
||||
```bash
|
||||
# View recent workflow runs
|
||||
gh run list --limit 10
|
||||
|
||||
# View specific run logs
|
||||
gh run view <run-id> --log
|
||||
|
||||
# Re-run failed workflow
|
||||
gh run rerun <run-id>
|
||||
```
|
||||
|
||||
### Verify Local Environment
|
||||
|
||||
```bash
|
||||
# Check .NET SDK
|
||||
dotnet --list-sdks
|
||||
|
||||
# Check Docker
|
||||
docker version
|
||||
docker buildx version
|
||||
|
||||
# Check Node.js
|
||||
node --version
|
||||
npm --version
|
||||
|
||||
# Check required tools
|
||||
which cosign syft helm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Build Failures
|
||||
|
||||
### NuGet Restore Failures
|
||||
|
||||
**Symptom:** `error NU1301: Unable to load the service index`
|
||||
|
||||
**Causes:**
|
||||
1. Network connectivity issues
|
||||
2. NuGet source unavailable
|
||||
3. Invalid credentials
|
||||
|
||||
**Solutions:**
|
||||
|
||||
```bash
|
||||
# Clear NuGet cache
|
||||
dotnet nuget locals all --clear
|
||||
|
||||
# Check NuGet sources
|
||||
dotnet nuget list source
|
||||
|
||||
# Restore with verbose logging
|
||||
dotnet restore src/StellaOps.sln -v detailed
|
||||
```
|
||||
|
||||
**In CI:**
|
||||
```yaml
|
||||
- name: Restore with retry
|
||||
run: |
|
||||
for i in {1..3}; do
|
||||
dotnet restore src/StellaOps.sln && break
|
||||
echo "Retry $i..."
|
||||
sleep 30
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### SDK Version Mismatch
|
||||
|
||||
**Symptom:** `error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check `global.json`:
|
||||
```bash
|
||||
cat global.json
|
||||
```
|
||||
|
||||
2. Install correct SDK:
|
||||
```bash
|
||||
# CI environment
|
||||
- uses: actions/setup-dotnet@v4
|
||||
with:
|
||||
dotnet-version: '10.0.100'
|
||||
include-prerelease: true
|
||||
```
|
||||
|
||||
3. Override SDK version:
|
||||
```bash
|
||||
# Remove global.json override
|
||||
rm global.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Docker Build Failures
|
||||
|
||||
**Symptom:** `failed to solve: rpc error: code = Unknown`
|
||||
|
||||
**Causes:**
|
||||
1. Disk space exhausted
|
||||
2. Layer cache corruption
|
||||
3. Network timeout
|
||||
|
||||
**Solutions:**
|
||||
|
||||
```bash
|
||||
# Clean Docker system
|
||||
docker system prune -af
|
||||
docker builder prune -af
|
||||
|
||||
# Build without cache
|
||||
docker build --no-cache -t myimage .
|
||||
|
||||
# Increase buildx timeout
|
||||
docker buildx create --driver-opt network=host --use
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Multi-arch Build Failures
|
||||
|
||||
**Symptom:** `exec format error` or QEMU issues
|
||||
|
||||
**Solutions:**
|
||||
|
||||
```bash
|
||||
# Install QEMU for cross-platform builds
|
||||
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
||||
|
||||
# Create new buildx builder
|
||||
docker buildx create --name multiarch --driver docker-container --use
|
||||
docker buildx inspect --bootstrap
|
||||
|
||||
# Build for specific platforms
|
||||
docker buildx build --platform linux/amd64 -t myimage .
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Failures
|
||||
|
||||
### Testcontainers Issues
|
||||
|
||||
**Symptom:** `Could not find a running Docker daemon`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Ensure Docker is running:
|
||||
```bash
|
||||
docker info
|
||||
```
|
||||
|
||||
2. Set Testcontainers host:
|
||||
```bash
|
||||
export TESTCONTAINERS_HOST_OVERRIDE=host.docker.internal
|
||||
# or for Linux
|
||||
export TESTCONTAINERS_HOST_OVERRIDE=$(hostname -I | awk '{print $1}')
|
||||
```
|
||||
|
||||
3. Use Ryuk container for cleanup:
|
||||
```bash
|
||||
export TESTCONTAINERS_RYUK_DISABLED=false
|
||||
```
|
||||
|
||||
4. CI configuration:
|
||||
```yaml
|
||||
services:
|
||||
dind:
|
||||
image: docker:dind
|
||||
privileged: true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### PostgreSQL Test Failures
|
||||
|
||||
**Symptom:** `FATAL: role "postgres" does not exist`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check connection string:
|
||||
```bash
|
||||
export STELLAOPS_TEST_POSTGRES_CONNECTION="Host=localhost;Database=test;Username=postgres;Password=postgres"
|
||||
```
|
||||
|
||||
2. Use Testcontainers PostgreSQL:
|
||||
```csharp
|
||||
var container = new PostgreSqlBuilder()
|
||||
.WithDatabase("test")
|
||||
.WithUsername("postgres")
|
||||
.WithPassword("postgres")
|
||||
.Build();
|
||||
```
|
||||
|
||||
3. Wait for PostgreSQL readiness:
|
||||
```bash
|
||||
until pg_isready -h localhost -p 5432; do
|
||||
sleep 1
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Test Timeouts
|
||||
|
||||
**Symptom:** `Test exceeded timeout`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Increase timeout:
|
||||
```bash
|
||||
dotnet test --blame-hang-timeout 10m
|
||||
```
|
||||
|
||||
2. Run tests in parallel with limited concurrency:
|
||||
```bash
|
||||
dotnet test -maxcpucount:2
|
||||
```
|
||||
|
||||
3. Identify slow tests:
|
||||
```bash
|
||||
dotnet test --logger "console;verbosity=detailed" --logger "trx"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Determinism Test Failures
|
||||
|
||||
**Symptom:** `Output mismatch: expected SHA256 differs`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check for non-deterministic sources:
|
||||
- Timestamps
|
||||
- Random GUIDs
|
||||
- Floating-point operations
|
||||
- Dictionary ordering
|
||||
|
||||
2. Run determinism comparison:
|
||||
```bash
|
||||
.gitea/scripts/test/determinism-run.sh
|
||||
diff out/scanner-determinism/run1.json out/scanner-determinism/run2.json
|
||||
```
|
||||
|
||||
3. Update golden fixtures:
|
||||
```bash
|
||||
.gitea/scripts/test/run-fixtures-check.sh --update
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment Failures
|
||||
|
||||
### SSH Connection Issues
|
||||
|
||||
**Symptom:** `ssh: connect to host X.X.X.X port 22: Connection refused`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Verify SSH key:
|
||||
```bash
|
||||
ssh-keygen -lf ~/.ssh/id_rsa.pub
|
||||
```
|
||||
|
||||
2. Test connection:
|
||||
```bash
|
||||
ssh -vvv user@host
|
||||
```
|
||||
|
||||
3. Add host to known_hosts:
|
||||
```yaml
|
||||
- name: Setup SSH
|
||||
run: |
|
||||
mkdir -p ~/.ssh
|
||||
ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Registry Push Failures
|
||||
|
||||
**Symptom:** `unauthorized: authentication required`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Login to registry:
|
||||
```bash
|
||||
docker login git.stella-ops.org -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD
|
||||
```
|
||||
|
||||
2. Check token permissions:
|
||||
- `write:packages` scope required
|
||||
- Token not expired
|
||||
|
||||
3. Use credential helper:
|
||||
```yaml
|
||||
- name: Login to Registry
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: git.stella-ops.org
|
||||
username: ${{ secrets.REGISTRY_USERNAME }}
|
||||
password: ${{ secrets.REGISTRY_PASSWORD }}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Helm Deployment Failures
|
||||
|
||||
**Symptom:** `Error: UPGRADE FAILED: cannot patch`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check resource conflicts:
|
||||
```bash
|
||||
kubectl get events -n stellaops --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
2. Force upgrade:
|
||||
```bash
|
||||
helm upgrade --install --force stellaops ./devops/helm/stellaops
|
||||
```
|
||||
|
||||
3. Clean up stuck release:
|
||||
```bash
|
||||
helm history stellaops
|
||||
helm rollback stellaops <revision>
|
||||
# or
|
||||
kubectl delete secret -l name=stellaops,owner=helm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow Issues
|
||||
|
||||
### Workflow Not Triggering
|
||||
|
||||
**Symptom:** Push/PR doesn't trigger workflow
|
||||
|
||||
**Causes:**
|
||||
1. Path filter not matching
|
||||
2. Branch protection rules
|
||||
3. YAML syntax error
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check path filters:
|
||||
```yaml
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'src/**' # Check if files match
|
||||
```
|
||||
|
||||
2. Validate YAML:
|
||||
```bash
|
||||
.gitea/scripts/validate/validate-workflows.sh
|
||||
```
|
||||
|
||||
3. Check branch rules:
|
||||
- Verify workflow permissions
|
||||
- Check protected branch settings
|
||||
|
||||
---
|
||||
|
||||
### Concurrency Issues
|
||||
|
||||
**Symptom:** Duplicate runs or stuck workflows
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Add concurrency control:
|
||||
```yaml
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
```
|
||||
|
||||
2. Cancel stale runs manually:
|
||||
```bash
|
||||
gh run cancel <run-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Artifact Upload/Download Failures
|
||||
|
||||
**Symptom:** `Unable to find any artifacts`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check artifact names match:
|
||||
```yaml
|
||||
# Upload
|
||||
- uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: my-artifact # Must match
|
||||
|
||||
# Download
|
||||
- uses: actions/download-artifact@v4
|
||||
with:
|
||||
name: my-artifact # Must match
|
||||
```
|
||||
|
||||
2. Check retention period:
|
||||
```yaml
|
||||
- uses: actions/upload-artifact@v4
|
||||
with:
|
||||
retention-days: 90 # Default is 90
|
||||
```
|
||||
|
||||
3. Verify job dependencies:
|
||||
```yaml
|
||||
download-job:
|
||||
needs: [upload-job] # Must complete first
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Runner Issues
|
||||
|
||||
### Disk Space Exhausted
|
||||
|
||||
**Symptom:** `No space left on device`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Run cleanup script:
|
||||
```bash
|
||||
.gitea/scripts/util/cleanup-runner-space.sh
|
||||
```
|
||||
|
||||
2. Add cleanup step to workflow:
|
||||
```yaml
|
||||
- name: Free disk space
|
||||
run: |
|
||||
docker system prune -af
|
||||
rm -rf /tmp/*
|
||||
df -h
|
||||
```
|
||||
|
||||
3. Use larger runner:
|
||||
```yaml
|
||||
runs-on: ubuntu-latest-4xlarge
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Out of Memory
|
||||
|
||||
**Symptom:** `Killed` or `OOMKilled`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Limit parallel jobs:
|
||||
```yaml
|
||||
strategy:
|
||||
max-parallel: 2
|
||||
```
|
||||
|
||||
2. Limit dotnet memory:
|
||||
```bash
|
||||
export DOTNET_GCHeapHardLimit=0x40000000 # 1GB
|
||||
```
|
||||
|
||||
3. Use swap:
|
||||
```yaml
|
||||
- name: Create swap
|
||||
run: |
|
||||
sudo fallocate -l 4G /swapfile
|
||||
sudo chmod 600 /swapfile
|
||||
sudo mkswap /swapfile
|
||||
sudo swapon /swapfile
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Runner Not Picking Up Jobs
|
||||
|
||||
**Symptom:** Jobs stuck in `queued` state
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check runner status:
|
||||
```bash
|
||||
# Self-hosted runner
|
||||
./run.sh --check
|
||||
```
|
||||
|
||||
2. Verify labels match:
|
||||
```yaml
|
||||
runs-on: [self-hosted, linux, x64] # All labels must match
|
||||
```
|
||||
|
||||
3. Restart runner service:
|
||||
```bash
|
||||
sudo systemctl restart actions.runner.*.service
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Signing & Attestation Issues
|
||||
|
||||
### Cosign Signing Failures
|
||||
|
||||
**Symptom:** `error opening key: no such file`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Check key configuration:
|
||||
```bash
|
||||
# From base64 secret
|
||||
echo "$COSIGN_PRIVATE_KEY_B64" | base64 -d > cosign.key
|
||||
|
||||
# Verify key
|
||||
cosign public-key --key cosign.key
|
||||
```
|
||||
|
||||
2. Set password:
|
||||
```bash
|
||||
export COSIGN_PASSWORD="${{ secrets.COSIGN_PASSWORD }}"
|
||||
```
|
||||
|
||||
3. Use keyless signing:
|
||||
```yaml
|
||||
- name: Sign with keyless
|
||||
env:
|
||||
COSIGN_EXPERIMENTAL: 1
|
||||
run: cosign sign --yes $IMAGE
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### SBOM Generation Failures
|
||||
|
||||
**Symptom:** `syft: command not found`
|
||||
|
||||
**Solutions:**
|
||||
|
||||
1. Install Syft:
|
||||
```bash
|
||||
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
|
||||
```
|
||||
|
||||
2. Use container:
|
||||
```yaml
|
||||
- name: Generate SBOM
|
||||
uses: anchore/sbom-action@v0
|
||||
with:
|
||||
image: ${{ env.IMAGE }}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Enable Debug Logging
|
||||
|
||||
```yaml
|
||||
env:
|
||||
ACTIONS_STEP_DEBUG: true
|
||||
ACTIONS_RUNNER_DEBUG: true
|
||||
```
|
||||
|
||||
### SSH into Runner
|
||||
|
||||
```yaml
|
||||
- name: Debug SSH
|
||||
uses: mxschmitt/action-tmate@v3
|
||||
if: failure()
|
||||
```
|
||||
|
||||
### Collect Diagnostic Info
|
||||
|
||||
```yaml
|
||||
- name: Diagnostics
|
||||
if: failure()
|
||||
run: |
|
||||
echo "=== Environment ==="
|
||||
env | sort
|
||||
echo "=== Disk ==="
|
||||
df -h
|
||||
echo "=== Memory ==="
|
||||
free -m
|
||||
echo "=== Docker ==="
|
||||
docker info
|
||||
docker ps -a
|
||||
```
|
||||
|
||||
### View Workflow Logs
|
||||
|
||||
```bash
|
||||
# Stream logs
|
||||
gh run watch <run-id>
|
||||
|
||||
# Download logs
|
||||
gh run download <run-id> --name logs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Getting Help
|
||||
|
||||
1. **Check existing issues:** Search repository issues
|
||||
2. **Review workflow history:** Look for similar failures
|
||||
3. **Consult documentation:** `docs/` and `.gitea/docs/`
|
||||
4. **Contact DevOps:** Create issue with label `ci-cd`
|
||||
|
||||
### Information to Include
|
||||
|
||||
- Workflow name and run ID
|
||||
- Error message and stack trace
|
||||
- Steps to reproduce
|
||||
- Environment details (OS, SDK versions)
|
||||
- Recent changes to affected code
|
||||
Reference in New Issue
Block a user