Frontmatter YAML Audit — content/
Generated 2026-06-21. Companion file: yaml-frontmatter-audit.xlsx.
Purpose
A complete inventory of the YAML frontmatter used across every section and leaf index file in content/, produced as groundwork for the ontology project. The goal is to establish, empirically, what schema actually exists today — which keys are in use, how consistently, with what value types, and where the drift and data-quality issues sit — before any modelling decisions are made.
Method
Every index.md and _index.md under content/ was located, its frontmatter block (the YAML between the opening and closing --- fences) extracted and parsed with a real YAML parser rather than regex, so that nesting, comments, tilde-nulls and quoted strings are interpreted correctly. Each document was then flattened to dotted key paths (e.g. params.author.name, params.content.thumb.src) so that nested structures are directly comparable across files. For every key path the audit records its fill rate, the value type(s) observed, and — for low-cardinality fields — the full set of distinct values. Duplicate keys were detected authoritatively by re-parsing under strict rules, and line-ending and byte-order-mark conventions were captured from the raw bytes.
Headline numbers
| Measure | Value |
|---|---|
| Index files scanned | 267 |
— leaf pages (index.md) | 178 |
— section pages (_index.md) | 89 |
| Parsed successfully | 267 (100%) |
| Distinct flattened key paths | 71 |
| Maximum nesting depth | 3 |
| Keys with hard type conflicts | 2 |
| Files with duplicate keys | 3 |
| Files with non-standard conventions | 4 |
All 267 files parse. Three only parse once duplicate keys are tolerated — they are genuine defects, listed below.
The schema, as it actually exists
The frontmatter falls into clear tiers of usage rather than a single uniform schema.
A near-universal core is present on almost everything: title and description (99.6%), keywords (98%), and draft (97%). These are the only fields you can rely on being everywhere.
A common publishing layer sits just below at ~93%: linkTitle, and the params.author block (params.author.name, params.author.email). The author block is notable for being machine-uniform — the same two sub-keys every time it appears.
A content-model layer covers roughly two-thirds of files (~65%): the params.content.* family (type, category, file, mime, url, thumb, etc.), plus params.tags, params.categories, params.links and the date fields. This is where most of the structural richness — and most of the inconsistency — lives. It is strongly skewed toward leaf pages: e.g. date appears on 168 of 178 leaf pages but only 23 of 89 sections.
A sparse / bespoke tail of one-off keys appears on a handful of pages or a single page: params.audio.* (12 podcast-style pages), slug (10), and a set of single-page presentation blocks — hero.* on about/_index.md, videoSpotlight.* and showVideo on the site root _index.md, and Layout on tags/_index.md. For an ontology these are special-case presentation fields, not part of the shared model.
Priority findings for the ontology
1. Date type inconsistency (hard conflict). date is stored as a timezone-aware datetime in 184 files but as a bare date in 4; lastMod is the reverse — a bare date in 160 and a datetime in 25. publishDate is a datetime in 178 and null in 2. Mixed temporal granularity is the single most important thing to normalise before modelling — pick one representation (ISO 8601 with timezone is the obvious choice) and apply it corpus-wide.
2. Key-casing collision. A key named Layout (capital L, value cloud) exists on tags/_index.md alongside the normal lowercase layout used by 189 files. This is almost certainly a typo that Hugo silently ignores, but in an ontology it would register as two distinct properties. Worth fixing at source.
3. Pervasive tilde-null placeholder convention. Around thirty keys are sometimes a real value and sometimes explicit null (~) — for example params.content.category, params.content.file, params.tags, layout, publishDate. This is a deliberate “scaffold every key, leave it null if unused” pattern. It is internally consistent but the ontology needs an explicit rule: does “not set” mean omit the property, or assert an explicit null? Decide once.
4. Always-null keys (dead scaffolding). Several keys never carry a value in any file: expiryDate (183 files, always null), url (85, always null), params.url (26, always null), params.resources (174, always null), and params.content.thumb.caption (96, always null). These are pure placeholders — candidates for removal, or for an explicit “reserved / not modelled” note so they don’t pollute the ontology with empty properties.
5. De-facto controlled vocabularies. Several fields are effectively enumerations and should be modelled as such:
type:blog(139),page(26),section(12),briefings(11)params.content.type:post(110),page(21),link(13),section(12),audio(9),youtube(5),photo(2),tweet(2)params.content.category:Board,AI,Cloud,Data,Emerging(each 1–2 uses; the field is null on 168 files)params.content.thumb.type:photo(86),graphic(10)layout: nine values, one of which (briefings/single) embeds a path — a different convention from the rest.
The type vs params.content.type overlap (both carry page/section) is worth resolving — they appear to encode related-but-different notions and an ontology will need them disambiguated.
6. Data-quality defects to fix at source.
- Duplicate keys:
blog/2012-ecommerce-125m/index.md(links_title),blog/enterprise-data-monetisation/index.md(seoTitle),briefings/_index.md(keywords). In each case the second value silently wins. about/career/index.mduses CRLF line endings while the rest of the corpus is LF — normalise for clean diffs and tooling.
Recommended next steps
Before progressing the ontology, normalise the two date fields to a single type, fix the three duplicate-key files and the Layout casing typo, and make an explicit decision on the tilde-null convention and the always-null keys. With those resolved, the type / params.content.type / params.content.category vocabularies become the natural backbone for the content taxonomy, and the stable params.author and params.content.thumb blocks are clean candidates for first-class entities.
The full key-by-key data — coverage matrix, conflict list, vocabularies, and the raw machine-readable table — is in yaml-frontmatter-audit.xlsx.




