Definitions for every quality-tab metric the catalog surfaces. Tier badges next to each metric in the catalog tell you whether the value was computed by the in-process sniffer, will be computed by an external job (planned), or requires a reference signal alongside the asset.
Integrated program loudness measured per EBU R128 / ITU-R BS.1770. A target of −16 LUFS is broadcast-typical for spoken audio; podcasts and music master at −14 to −24 LUFS depending on platform.
Method: Decode samples and run a BS.1770 K-weighted integrator (`bs1770` or `ebur128` crate). Requires a full decode pass — our pipeline ships this as an external job.
Ratio of signal power to background-noise power, in decibels. Speech recordings target ≥ 30 dB; studio music ≥ 60 dB. Higher is cleaner.
Method: Without a clean reference we estimate the noise floor from the quietest 5–10% of windows and compare to peak signal level. Approximate, not reference-grade.
Fraction of samples that hit the digital ceiling (|sample| ≥ 0.999 × full scale). Any non-zero value indicates the signal was squashed at peak — irreversible distortion.
Method: Trivially calculable from a sample stream: count clipped samples / total samples. Cheap once we decode.
Fraction of the timeline below a silence threshold (e.g. RMS under −50 dBFS over 0.5-second windows). High values flag accidental dead air or trailing silence.
Method: Windowed RMS over the decoded sample stream. External job at scale.
Loudness range — the difference between typical-loud and typical-quiet sections, in LU. Talk-radio is ≈ 5 LU, classical music 15–20 LU. Low values suggest aggressive compression.
Method: Computed as part of a BS.1770 pass. External job.
How energy is distributed across frequency bands — voice-band centric (300 Hz – 3.4 kHz) vs full-band. A speech file with significant energy above 8 kHz may indicate music or noise.
Method: FFT (`rustfft`), sum band energies, classify. External job.
Samples per second of the decoded waveform. Speech is fine at 16 kHz; music wants 44.1 / 48 kHz.
Method: Read from the codec parameter block (already in the sniffer header path).
Median per-token confidence score reported by the ASR model itself (e.g. Whisper-v3). Lower than ≈ 0.7 suggests poor input audio or a domain mismatch.
Method: Read directly from the ASR output JSON if present. Reference-required in the sense that you need an ASR pass to produce it.
Production target for integrated loudness, in LUFS, with a tolerance band (e.g. −16 LUFS ±1 LU). Out-of-spec audio is auto-normalized at distribution.
Method: Contract field. Compliance check is the LUFS metric on the quality tab.
Minimum acceptable codec / bitrate / encoding spec. Files below the floor are rejected at ingest. Example: 'OGG Vorbis ≥ 96 kbps' or 'VP9 / WebM (preferred) or AV1'.
Method: Contract field; checked against sniffed codec parameters at ingest.
How long the asset (and its derivatives) may be kept. Tied to compliance (SOX 7-year, GDPR right to erasure, CCPA 5-year) and to storage tier choice.
Method: Contract field; auto-renewed from policy bindings on the governance side.
Whether (and how) downstream consumers must credit the source. CC-BY family licenses require attribution; PD-USGov-NASA does not. Affects auto-injected provenance tags on export.
Method: Contract field bound to the source-license selection.
Whether transformed / remixed copies may be produced and distributed. CC-BY allows them with attribution; CC-BY-ND forbids derivatives entirely.
Method: Contract field, license-driven.
Whether the asset may be used in commercial products or paid distribution. CC-BY-NC restricts to non-commercial; PD and CC-BY family permit it.
Method: Contract field, license-driven.
Rules governing edits, including share-alike obligations. CC-BY-SA requires derivatives to inherit the same license; plain CC-BY does not.
Method: Contract field bound to the upstream license.
Whether figures / images embedded in the document must carry alternative-text descriptions for screen readers. Required for PDF/A-2u and WCAG conformance.
Method: Contract field; verified by walking the structure tree.
Required color space + ICC profile for delivered images (sRGB / Display P3 / ProPhoto). Mismatched profiles cause visible color shifts on consumption.
Method: Contract field; verified against the sniffed ICC profile.
The contracted version of the dataset schema. Breaking changes bump the major; column additions bump the minor; bug fixes bump the patch.
Method: Contract field; verified against the parquet / Delta schema at ingest.
Maximum allowed lag between the upstream source-of-truth and the gold-tier mirror. Gold is typically T+4 h; silver T+24 h.
Method: Contract field; tracked by the dataset-pipeline observer.
Contracted minimum row-level pass-rate across the quality suite (null-rate, foreign-key integrity, business-rule pass). 99.98% is broadcast-grade for finance gold tables.
Method: Contract field; verified by the quality test runner.
Maximum acceptable age of the snapshot relative to the live upstream source. Wikipedia summaries target ≤ 24 h; news feeds typically ≤ 1 h.
Method: Contract field; checked against `fetched_at` on every read.
License of the upstream source the asset was derived from (CC-BY-SA 4.0 for Wikipedia, arXiv perpetual non-exclusive, PD-old-100, etc.). Drives obligations on derivatives.
Method: Contract field; auto-captured at fetch time when machine-readable.
Whether all speakers in the recording have consented to transcription, retention, and downstream use. Customer-call recordings inherit CCPA / GDPR consent terms.
Method: Contract field, evidenced by a consent log keyed to the recording's call-id.
Which automatic speech-recognition model produced the transcript (Whisper-v3, AWS Transcribe, Google STT, …). Influences WER and confidence interpretation.
Method: Contract field; recorded in the transcript's lineage chain.
Rules governing removal of personally-identifying tokens (names, accounts, card numbers, addresses) before publish or downstream consumption.
Method: Contract field. The PII flag rate metric on the quality tab reports residual hits after the policy runs.
Permitted downstream applications — internal analytics only, anonymized fine-tune, RAG retrieval, external sharing. Drives access-control rules on the search index.
Method: Contract field; enforced by Cedar policies on retrieval.
Acceptable frame rates for the deliverable (typically 24 / 25 / 30 / 60 fps). Variable-frame-rate inputs are normally re-encoded at ingest.
Method: Contract field; verified from the codec config block.
Digital rights management. 'DRM-free' permits unencumbered redistribution; 'DRM-required' triggers a packaging step (e.g. Widevine, FairPlay) at distribution time.
Method: Contract field; reflected in the export manifest.
Whether captions / subtitles are required for distribution. Drives accessibility compliance (Section 508, WCAG 2.x).
Method: Contract field. Verified against the container's caption-track presence.
Minimum frame / image resolution. Below this, the asset is considered too low-fidelity for distribution.
Method: Contract field; checked against sniffed dimensions at ingest.
Whether the PDF carries native text streams (no OCR needed) or is scanned imagery requiring optical character recognition.
Method: Try `lopdf` text extraction per page; native-text yield > 0 chars / page → extractable. Cheap header path.
Fraction of glyphs in the document whose font subsets are embedded (vs referenced by name only). Missing fonts cause reflow on viewers without the same fonts installed.
Method: Walk `/Font` dictionaries and compare font names with FontFile streams. Cheap header path.
Conformance to ISO 19005 archival-PDF standards (PDF/A-1, A-2, A-3, A-4). PDF/A-2u is the typical accessibility-friendly target.
Method: Surface check: read XMP for a PDF/A claim. Full validation requires veraPDF — external job.
Whether the PDF carries a `/StructTreeRoot` and `MarkInfo /Marked true` — required for screen-reader navigation and PDF/A-2u accessibility.
Method: Direct dictionary lookup. Cheap header path.
Fraction of `/URI` annotations in the document that resolve to a reachable URL.
Method: Extract URIs (lopdf), HEAD them. Heavy — run as an external batch.
Flesch reading-ease score. Higher = easier; ≥ 60 is plain-English, 30–60 is technical writing, < 30 is dense academic prose.
Method: Sentence + syllable count over the body text. Cheap, deterministic.
Probability that the detected language code is correct, as reported by the language-ID model.
Method: Pulled from `whatlang::detect` (already wired). Computed.
Number of citations or hyperlinks per 1,000 words. A signal of scholarly grounding; not normally meaningful for short prose.
Method: Regex over the body text. Cheap.
Fraction of an opinionated set of camera-context tags (Make / Model / DateTime / ISO / FocalLength / GPS) that are present in the file. Useful for detecting stripped-metadata uploads.
Method: Counted from kamadak-exif output (already wired). Computed.
The color model the file was encoded in (sRGB, Display P3, ProPhoto). Mismatched color spaces between source and consumer cause visible color shifts.
Method: Read from the embedded ICC profile. Cheap header path.
Estimated JPEG / WebP quality factor based on the quantization matrices in the file. Native q=85 is the typical photo-sharing ceiling; below ≈ 60 visible artifacts appear.
Method: Quantization-matrix inspection. Cheap header path.
Hierarchical classification of every catalog asset into one of ~10 top-level buckets and ~50 second-level subdomains (Investment Management → Holdings & Positions, External Data → Market Data, …). Drives catalog filters and search facets.
Method: Stored on `catalog_assets.domain_path` as a Postgres `ltree`; labels + descriptions live in `domain_taxonomy`. Subtree filters use ltree's native `<@` ancestor operator.
First segment of an asset's taxonomy path — its broadest bucket (e.g. Investment Management, External Data). One of ten in the asset-manager-shaped seed.
Method: Derived as `subpath(domain_path, 0, 1)` in the search export view; emitted as the `domain_l1` Qdrant facet field.
Second segment of an asset's taxonomy path — a refinement of the top-level bucket (e.g. Equity Research under Investment Research).
Method: Derived as `subpath(domain_path, 1, 1)` in the search export view; populates the cascading subfilter on the catalog list.
Edit-distance error rate of the transcript against a gold human reference, normalized by reference length. Lower is better; human-typical is ≈ 4%, broadcast ASR ≈ 10–15%.
Method: Reference-required: needs a gold-standard transcript. Mocked here.
Combined miss / false-alarm / speaker-confusion error rate from the speaker-attribution pass. Driven by overlap, channel conditions, and speaker count.
Method: Reference-required: needs gold speaker labels. Mocked here.
Number of distinct speakers present in the transcript.
Method: JSON parse; counted directly. Computed.
Number of speaker turns / sentences in the transcript.
Method: JSON parse. Computed.
Fraction of tokens that are filled-pauses or repairs (um, uh, er, like, you-know, false starts). High values flag rough audio or overly raw transcription.
Method: Token-level regex against a disfluency lexicon. Cheap, computed in the sniffer.
Number of detected personally-identifying tokens (names, SSNs, credit cards, emails) per 1,000 words. Drives downstream redaction policy.
Method: Regex / heuristic detection (`scrubadub`-style). Cheap.
Count of tokens matching the configured profanity wordlist.
Method: Wordlist scan over the body. Cheap.
Netflix Video Multimethod Assessment Fusion — a perceptual quality score (0–100) trained against subjective ratings. 90+ is transparent quality; 60 is noticeably degraded.
Method: Reference-required: VMAF compares the encoded asset to an uncompressed source. Without the source, no value is calculable.
How consistent the encoded bitrate is over time. Wide swings can starve clients on poor connections or saturate CDNs.
Method: Per-second bitrate from container parsing, then σ/μ. Cheap from header + index.
Fraction of presentation timestamps that are missing or out of order in the delivered stream — a sign of capture or encode hiccups.
Method: Walk the container packet stream, look for non-monotonic PTS / DTS. Doable header-side; we run it as part of the external decode job.
Drift between the audio and video tracks at known sync events (claps, click-tracks). Tolerable threshold is ≈ 40 ms; > 100 ms is perceptible.
Method: Reference-required: needs a known sync event on both tracks. Mocked in the catalog today.
Bits per channel of the encoded frame (8-bit, 10-bit, 12-bit). 8-bit is BT.709 SDR; 10-bit unlocks BT.2020 / HDR.
Method: Codec-config read. Cheap header path.
Whether the container ships a caption / subtitle track (WebVTT, SRT, CEA-608). Required for accessibility distribution.
Method: Track-list inspection.