---
title: 
date: 0001-01-01
canonical: https://mariothomas.com/yaml-frontmatter-audit/
---

# Frontmatter YAML Audit — `content/`

*Generated 2026-06-21. Companion file: `yaml-frontmatter-audit.xlsx`.*

## Purpose

A complete inventory of the YAML frontmatter used across every section and leaf index file in `content/`, produced as groundwork for the ontology project. The goal is to establish, empirically, what schema actually exists today — which keys are in use, how consistently, with what value types, and where the drift and data-quality issues sit — before any modelling decisions are made.

## Method

Every `index.md` and `_index.md` under `content/` was located, its frontmatter block (the YAML between the opening and closing `---` fences) extracted and parsed with a real YAML parser rather than regex, so that nesting, comments, tilde-nulls and quoted strings are interpreted correctly. Each document was then flattened to dotted key paths (e.g. `params.author.name`, `params.content.thumb.src`) so that nested structures are directly comparable across files. For every key path the audit records its fill rate, the value type(s) observed, and — for low-cardinality fields — the full set of distinct values. Duplicate keys were detected authoritatively by re-parsing under strict rules, and line-ending and byte-order-mark conventions were captured from the raw bytes.

## Headline numbers

| Measure | Value |
|---|---|
| Index files scanned | **267** |
| — leaf pages (`index.md`) | 178 |
| — section pages (`_index.md`) | 89 |
| Parsed successfully | 267 (100%) |
| Distinct flattened key paths | **71** |
| Maximum nesting depth | 3 |
| Keys with hard type conflicts | 2 |
| Files with duplicate keys | 3 |
| Files with non-standard conventions | 4 |

All 267 files parse. Three only parse once duplicate keys are tolerated — they are genuine defects, listed below.

## The schema, as it actually exists

The frontmatter falls into clear tiers of usage rather than a single uniform schema.

A **near-universal core** is present on almost everything: `title` and `description` (99.6%), `keywords` (98%), and `draft` (97%). These are the only fields you can rely on being everywhere.

A **common publishing layer** sits just below at ~93%: `linkTitle`, and the `params.author` block (`params.author.name`, `params.author.email`). The author block is notable for being machine-uniform — the same two sub-keys every time it appears.

A **content-model layer** covers roughly two-thirds of files (~65%): the `params.content.*` family (`type`, `category`, `file`, `mime`, `url`, `thumb`, etc.), plus `params.tags`, `params.categories`, `params.links` and the date fields. This is where most of the structural richness — and most of the inconsistency — lives. It is strongly skewed toward leaf pages: e.g. `date` appears on 168 of 178 leaf pages but only 23 of 89 sections.

A **sparse / bespoke tail** of one-off keys appears on a handful of pages or a single page: `params.audio.*` (12 podcast-style pages), `slug` (10), and a set of single-page presentation blocks — `hero.*` on `about/_index.md`, `videoSpotlight.*` and `showVideo` on the site root `_index.md`, and `Layout` on `tags/_index.md`. For an ontology these are special-case presentation fields, not part of the shared model.

## Priority findings for the ontology

**1. Date type inconsistency (hard conflict).** `date` is stored as a timezone-aware datetime in 184 files but as a bare date in 4; `lastMod` is the reverse — a bare date in 160 and a datetime in 25. `publishDate` is a datetime in 178 and null in 2. Mixed temporal granularity is the single most important thing to normalise before modelling — pick one representation (ISO 8601 with timezone is the obvious choice) and apply it corpus-wide.

**2. Key-casing collision.** A key named `Layout` (capital L, value `cloud`) exists on `tags/_index.md` alongside the normal lowercase `layout` used by 189 files. This is almost certainly a typo that Hugo silently ignores, but in an ontology it would register as two distinct properties. Worth fixing at source.

**3. Pervasive tilde-null placeholder convention.** Around thirty keys are *sometimes* a real value and *sometimes* explicit null (`~`) — for example `params.content.category`, `params.content.file`, `params.tags`, `layout`, `publishDate`. This is a deliberate "scaffold every key, leave it null if unused" pattern. It is internally consistent but the ontology needs an explicit rule: does "not set" mean omit the property, or assert an explicit null? Decide once.

**4. Always-null keys (dead scaffolding).** Several keys *never* carry a value in any file: `expiryDate` (183 files, always null), `url` (85, always null), `params.url` (26, always null), `params.resources` (174, always null), and `params.content.thumb.caption` (96, always null). These are pure placeholders — candidates for removal, or for an explicit "reserved / not modelled" note so they don't pollute the ontology with empty properties.

**5. De-facto controlled vocabularies.** Several fields are effectively enumerations and should be modelled as such:
- `type`: `blog` (139), `page` (26), `section` (12), `briefings` (11)
- `params.content.type`: `post` (110), `page` (21), `link` (13), `section` (12), `audio` (9), `youtube` (5), `photo` (2), `tweet` (2)
- `params.content.category`: `Board`, `AI`, `Cloud`, `Data`, `Emerging` (each 1–2 uses; the field is null on 168 files)
- `params.content.thumb.type`: `photo` (86), `graphic` (10)
- `layout`: nine values, one of which (`briefings/single`) embeds a path — a different convention from the rest.

The `type` vs `params.content.type` overlap (both carry `page`/`section`) is worth resolving — they appear to encode related-but-different notions and an ontology will need them disambiguated.

**6. Data-quality defects to fix at source.**
- Duplicate keys: `blog/2012-ecommerce-125m/index.md` (`links_title`), `blog/enterprise-data-monetisation/index.md` (`seoTitle`), `briefings/_index.md` (`keywords`). In each case the second value silently wins.
- `about/career/index.md` uses CRLF line endings while the rest of the corpus is LF — normalise for clean diffs and tooling.

## Recommended next steps

Before progressing the ontology, normalise the two date fields to a single type, fix the three duplicate-key files and the `Layout` casing typo, and make an explicit decision on the tilde-null convention and the always-null keys. With those resolved, the `type` / `params.content.type` / `params.content.category` vocabularies become the natural backbone for the content taxonomy, and the stable `params.author` and `params.content.thumb` blocks are clean candidates for first-class entities.

The full key-by-key data — coverage matrix, conflict list, vocabularies, and the raw machine-readable table — is in `yaml-frontmatter-audit.xlsx`.
