Files
git.stella-ops.org/docs/ops/feedser-nkcki-operations.md
Vladimir Moushkov ea1106ce7c up
2025-10-15 10:03:56 +03:00

49 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NKCKI Connector Operations Guide
## Overview
The NKCKI connector ingests JSON bulletin archives from cert.gov.ru, expanding each `*.json.zip` attachment into per-vulnerability DTOs before canonical mapping. The fetch pipeline now supports cache-backed recovery, deterministic pagination, and telemetry suitable for production monitoring.
## Configuration
Key options exposed through `feedser:sources:ru-nkcki:http`:
- `maxBulletinsPerFetch` limits new bulletin downloads in a single run (default `5`).
- `maxListingPagesPerFetch` maximum listing pages visited during pagination (default `3`).
- `listingCacheDuration` minimum interval between listing fetches before falling back to cached artefacts (default `00:10:00`).
- `cacheDirectory` optional path for persisted bulletin archives used during offline or failure scenarios.
- `requestDelay` delay inserted between bulletin downloads to respect upstream politeness.
When operating in offline-first mode, set `cacheDirectory` to a writable path (e.g. `/var/lib/feedser/cache/ru-nkcki`) and pre-populate bulletin archives via the offline kit.
## Telemetry
`RuNkckiDiagnostics` emits the following metrics under meter `StellaOps.Feedser.Source.Ru.Nkcki`:
- `nkcki.listing.fetch.attempts` / `nkcki.listing.fetch.success` / `nkcki.listing.fetch.failures`
- `nkcki.listing.pages.visited` (histogram, `pages`)
- `nkcki.listing.attachments.discovered` / `nkcki.listing.attachments.new`
- `nkcki.bulletin.fetch.success` / `nkcki.bulletin.fetch.cached` / `nkcki.bulletin.fetch.failures`
- `nkcki.entries.processed` (histogram, `entries`)
Integrate these counters into standard Feedser observability dashboards to track crawl coverage and cache hit rates.
## Archive Backfill Strategy
Bitrix pagination surfaces archives via `?PAGEN_1=n`. The connector now walks up to `maxListingPagesPerFetch` pages, deduplicating bulletin IDs and maintaining a rolling `knownBulletins` window. Backfill strategy:
1. Enumerate pages from newest to oldest, respecting `maxListingPagesPerFetch` and `listingCacheDuration` to avoid refetch storms.
2. Persist every `*.json.zip` attachment to the configured cache directory. This enables replay when listing access is temporarily blocked.
3. During archive replay, `ProcessCachedBulletinsAsync` enqueues missing documents while respecting `maxVulnerabilitiesPerFetch`.
4. For historical HTML-only advisories, collect page URLs and metadata while offline (future work: HTML and PDF extraction pipeline documented in `docs/feedser-connector-research-20251011.md`).
For large migrations, seed caches with archived zip bundles, then run fetch/parse/map cycles in chronological order to maintain deterministic outputs.
## Failure Handling
- Listing failures mark the source state with exponential backoff while attempting cache replay.
- Bulletin fetches fall back to cached copies before surfacing an error.
- Mongo integration tests rely on bundled OpenSSL 1.1 libraries (`tools/openssl/linux-x64`) to keep `Mongo2Go` operational on modern distros.
Refer to `ru-nkcki` entries in `src/StellaOps.Feedser.Source.Ru.Nkcki/TASKS.md` for outstanding items.