Documentation

Quality metric glossary

Definitions for every quality-tab metric the catalog surfaces. Tier badges next to each metric in the catalog tell you whether the value was computed by the in-process sniffer, will be computed by an external job (planned), or requires a reference signal alongside the asset.

Provenance tier

computed — Computed by the catalog sniffer for this asset.

external — Mock today; will be computed by an external job (KNative / K8s job) when the heavier decode pipeline lands.

reference — Requires a reference signal (gold transcript, source recording, human label). Mocked or supplied alongside the asset.

Audio

Loudness (LUFS)

Integrated program loudness measured per EBU R128 / ITU-R BS.1770. A target of −16 LUFS is broadcast-typical for spoken audio; podcasts and music master at −14 to −24 LUFS depending on platform.

Method: Decode samples and run a BS.1770 K-weighted integrator (`bs1770` or `ebur128` crate). Requires a full decode pass — our pipeline ships this as an external job.

Signal-to-noise ratio (SNR)

Ratio of signal power to background-noise power, in decibels. Speech recordings target ≥ 30 dB; studio music ≥ 60 dB. Higher is cleaner.

Method: Without a clean reference we estimate the noise floor from the quietest 5–10% of windows and compare to peak signal level. Approximate, not reference-grade.

Clipping rate

Fraction of samples that hit the digital ceiling (|sample| ≥ 0.999 × full scale). Any non-zero value indicates the signal was squashed at peak — irreversible distortion.

Method: Trivially calculable from a sample stream: count clipped samples / total samples. Cheap once we decode.

Silence ratio

Fraction of the timeline below a silence threshold (e.g. RMS under −50 dBFS over 0.5-second windows). High values flag accidental dead air or trailing silence.

Method: Windowed RMS over the decoded sample stream. External job at scale.

Dynamic range (LRA)

Loudness range — the difference between typical-loud and typical-quiet sections, in LU. Talk-radio is ≈ 5 LU, classical music 15–20 LU. Low values suggest aggressive compression.

Method: Computed as part of a BS.1770 pass. External job.

Spectral balance

How energy is distributed across frequency bands — voice-band centric (300 Hz – 3.4 kHz) vs full-band. A speech file with significant energy above 8 kHz may indicate music or noise.

Method: FFT (`rustfft`), sum band energies, classify. External job.

Sample rate

Samples per second of the decoded waveform. Speech is fine at 16 kHz; music wants 44.1 / 48 kHz.

Method: Read from the codec parameter block (already in the sniffer header path).

Audio · Transcript

ASR confidence median

Median per-token confidence score reported by the ASR model itself (e.g. Whisper-v3). Lower than ≈ 0.7 suggests poor input audio or a domain mismatch.

Method: Read directly from the ASR output JSON if present. Reference-required in the sense that you need an ASR pass to produce it.

Contract · audio

Loudness target

Production target for integrated loudness, in LUFS, with a tolerance band (e.g. −16 LUFS ±1 LU). Out-of-spec audio is auto-normalized at distribution.

Method: Contract field. Compliance check is the LUFS metric on the quality tab.

Contract · audio · video

Codec floor

Minimum acceptable codec / bitrate / encoding spec. Files below the floor are rejected at ingest. Example: 'OGG Vorbis ≥ 96 kbps' or 'VP9 / WebM (preferred) or AV1'.

Method: Contract field; checked against sniffed codec parameters at ingest.

Contract · cross-cutting

Retention

How long the asset (and its derivatives) may be kept. Tied to compliance (SOX 7-year, GDPR right to erasure, CCPA 5-year) and to storage tier choice.

Method: Contract field; auto-renewed from policy bindings on the governance side.

Attribution

Whether (and how) downstream consumers must credit the source. CC-BY family licenses require attribution; PD-USGov-NASA does not. Affects auto-injected provenance tags on export.

Method: Contract field bound to the source-license selection.

Derivative works

Whether transformed / remixed copies may be produced and distributed. CC-BY allows them with attribution; CC-BY-ND forbids derivatives entirely.

Method: Contract field, license-driven.

Commercial use

Whether the asset may be used in commercial products or paid distribution. CC-BY-NC restricts to non-commercial; PD and CC-BY family permit it.

Method: Contract field, license-driven.

Modifications policy

Rules governing edits, including share-alike obligations. CC-BY-SA requires derivatives to inherit the same license; plain CC-BY does not.

Method: Contract field bound to the upstream license.

Contract · document

Alt-text

Whether figures / images embedded in the document must carry alternative-text descriptions for screen readers. Required for PDF/A-2u and WCAG conformance.

Method: Contract field; verified by walking the structure tree.

Contract · image

Color profile

Required color space + ICC profile for delivered images (sRGB / Display P3 / ProPhoto). Mismatched profiles cause visible color shifts on consumption.

Method: Contract field; verified against the sniffed ICC profile.

Contract · structured

Schema version

The contracted version of the dataset schema. Breaking changes bump the major; column additions bump the minor; bug fixes bump the patch.

Method: Contract field; verified against the parquet / Delta schema at ingest.

Freshness SLA

Maximum allowed lag between the upstream source-of-truth and the gold-tier mirror. Gold is typically T+4 h; silver T+24 h.

Method: Contract field; tracked by the dataset-pipeline observer.

Quality SLA

Contracted minimum row-level pass-rate across the quality suite (null-rate, foreign-key integrity, business-rule pass). 99.98% is broadcast-grade for finance gold tables.

Method: Contract field; verified by the quality test runner.

Contract · text

Source freshness SLA

Maximum acceptable age of the snapshot relative to the live upstream source. Wikipedia summaries target ≤ 24 h; news feeds typically ≤ 1 h.

Method: Contract field; checked against `fetched_at` on every read.

Contract · text · document · image

Source license

License of the upstream source the asset was derived from (CC-BY-SA 4.0 for Wikipedia, arXiv perpetual non-exclusive, PD-old-100, etc.). Drives obligations on derivatives.

Method: Contract field; auto-captured at fetch time when machine-readable.

Contract · transcript

Source ASR

Which automatic speech-recognition model produced the transcript (Whisper-v3, AWS Transcribe, Google STT, …). Influences WER and confidence interpretation.

Method: Contract field; recorded in the transcript's lineage chain.

Contract · transcript · text

PII redaction policy

Rules governing removal of personally-identifying tokens (names, accounts, card numbers, addresses) before publish or downstream consumption.

Method: Contract field. The PII flag rate metric on the quality tab reports residual hits after the policy runs.

Downstream use

Permitted downstream applications — internal analytics only, anonymized fine-tune, RAG retrieval, external sharing. Drives access-control rules on the search index.

Method: Contract field; enforced by Cedar policies on retrieval.

Contract · video

Frame rate

Acceptable frame rates for the deliverable (typically 24 / 25 / 30 / 60 fps). Variable-frame-rate inputs are normally re-encoded at ingest.

Method: Contract field; verified from the codec config block.

DRM

Digital rights management. 'DRM-free' permits unencumbered redistribution; 'DRM-required' triggers a packaging step (e.g. Widevine, FairPlay) at distribution time.

Method: Contract field; reflected in the export manifest.

Contract · video · audio

Captioning

Whether captions / subtitles are required for distribution. Drives accessibility compliance (Section 508, WCAG 2.x).

Method: Contract field. Verified against the container's caption-track presence.

Contract · video · image

Resolution floor

Minimum frame / image resolution. Below this, the asset is considered too low-fidelity for distribution.

Method: Contract field; checked against sniffed dimensions at ingest.

Document · PDF

Text extractable

Whether the PDF carries native text streams (no OCR needed) or is scanned imagery requiring optical character recognition.

Method: Try `lopdf` text extraction per page; native-text yield > 0 chars / page → extractable. Cheap header path.

Font embedding

Fraction of glyphs in the document whose font subsets are embedded (vs referenced by name only). Missing fonts cause reflow on viewers without the same fonts installed.

Method: Walk `/Font` dictionaries and compare font names with FontFile streams. Cheap header path.

PDF/A compliance

Conformance to ISO 19005 archival-PDF standards (PDF/A-1, A-2, A-3, A-4). PDF/A-2u is the typical accessibility-friendly target.

Method: Surface check: read XMP for a PDF/A claim. Full validation requires veraPDF — external job.

Tagged structure

Whether the PDF carries a `/StructTreeRoot` and `MarkInfo /Marked true` — required for screen-reader navigation and PDF/A-2u accessibility.

Method: Direct dictionary lookup. Cheap header path.

Link integrity

Fraction of `/URI` annotations in the document that resolve to a reachable URL.

Method: Extract URIs (lopdf), HEAD them. Heavy — run as an external batch.

Document · Text

Readability (Flesch)

Flesch reading-ease score. Higher = easier; ≥ 60 is plain-English, 30–60 is technical writing, < 30 is dense academic prose.

Method: Sentence + syllable count over the body text. Cheap, deterministic.

Language confidence

Probability that the detected language code is correct, as reported by the language-ID model.

Method: Pulled from `whatlang::detect` (already wired). Computed.

Citation density

Number of citations or hyperlinks per 1,000 words. A signal of scholarly grounding; not normally meaningful for short prose.

Method: Regex over the body text. Cheap.

Image

Sharpness (Laplacian σ²)

Variance of the Laplacian of the decoded image. Crisp photos have high variance; blurred or out-of-focus images are low.

Method: Decode + Laplacian filter (`image` + ~4 lines). External job once we wire image decode in the sniffer.

EXIF completeness

Fraction of an opinionated set of camera-context tags (Make / Model / DateTime / ISO / FocalLength / GPS) that are present in the file. Useful for detecting stripped-metadata uploads.

Method: Counted from kamadak-exif output (already wired). Computed.

Color space

The color model the file was encoded in (sRGB, Display P3, ProPhoto). Mismatched color spaces between source and consumer cause visible color shifts.

Method: Read from the embedded ICC profile. Cheap header path.

Compression quality

Estimated JPEG / WebP quality factor based on the quantization matrices in the file. Native q=85 is the typical photo-sharing ceiling; below ≈ 60 visible artifacts appear.

Method: Quantization-matrix inspection. Cheap header path.

Taxonomy · cross-cutting

Domain taxonomy

Hierarchical classification of every catalog asset into one of ~10 top-level buckets and ~50 second-level subdomains (Investment Management → Holdings & Positions, External Data → Market Data, …). Drives catalog filters and search facets.

Method: Stored on `catalog_assets.domain_path` as a Postgres `ltree`; labels + descriptions live in `domain_taxonomy`. Subtree filters use ltree's native `<@` ancestor operator.

Top-level domain

First segment of an asset's taxonomy path — its broadest bucket (e.g. Investment Management, External Data). One of ten in the asset-manager-shaped seed.

Method: Derived as `subpath(domain_path, 0, 1)` in the search export view; emitted as the `domain_l1` Qdrant facet field.

Subdomain

Second segment of an asset's taxonomy path — a refinement of the top-level bucket (e.g. Equity Research under Investment Research).

Method: Derived as `subpath(domain_path, 1, 1)` in the search export view; populates the cascading subfilter on the catalog list.

Transcript

Word Error Rate (WER)

Edit-distance error rate of the transcript against a gold human reference, normalized by reference length. Lower is better; human-typical is ≈ 4%, broadcast ASR ≈ 10–15%.

Method: Reference-required: needs a gold-standard transcript. Mocked here.

Diarization Error Rate (DER)

Combined miss / false-alarm / speaker-confusion error rate from the speaker-attribution pass. Driven by overlap, channel conditions, and speaker count.

Method: Reference-required: needs gold speaker labels. Mocked here.

Speakers

Number of distinct speakers present in the transcript.

Method: JSON parse; counted directly. Computed.

Utterances

Number of speaker turns / sentences in the transcript.

Method: JSON parse. Computed.

Disfluency rate

Fraction of tokens that are filled-pauses or repairs (um, uh, er, like, you-know, false starts). High values flag rough audio or overly raw transcription.

Method: Token-level regex against a disfluency lexicon. Cheap, computed in the sniffer.

Transcript · Text

PII flag rate

Number of detected personally-identifying tokens (names, SSNs, credit cards, emails) per 1,000 words. Drives downstream redaction policy.

Method: Regex / heuristic detection (`scrubadub`-style). Cheap.

Profanity flags

Count of tokens matching the configured profanity wordlist.

Method: Wordlist scan over the body. Cheap.

Video

VMAF

Netflix Video Multimethod Assessment Fusion — a perceptual quality score (0–100) trained against subjective ratings. 90+ is transparent quality; 60 is noticeably degraded.

Method: Reference-required: VMAF compares the encoded asset to an uncompressed source. Without the source, no value is calculable.

Bitrate stability

How consistent the encoded bitrate is over time. Wide swings can starve clients on poor connections or saturate CDNs.

Method: Per-second bitrate from container parsing, then σ/μ. Cheap from header + index.

Dropped-frame rate

Fraction of presentation timestamps that are missing or out of order in the delivered stream — a sign of capture or encode hiccups.

Method: Walk the container packet stream, look for non-monotonic PTS / DTS. Doable header-side; we run it as part of the external decode job.

A/V sync drift

Drift between the audio and video tracks at known sync events (claps, click-tracks). Tolerable threshold is ≈ 40 ms; > 100 ms is perceptible.

Method: Reference-required: needs a known sync event on both tracks. Mocked in the catalog today.

Color depth

Bits per channel of the encoded frame (8-bit, 10-bit, 12-bit). 8-bit is BT.709 SDR; 10-bit unlocks BT.2020 / HDR.

Method: Codec-config read. Cheap header path.

Captions present

Whether the container ships a caption / subtitle track (WebVTT, SRT, CEA-608). Required for accessibility distribution.

Method: Track-list inspection.