PDF/UA tagging

PDF/UA tagging

htmlbag emits PDF logical-structure elements when the underlying document is set to document.FormatPDFUA (PDF/UA-1, ISO 14289-1, on PDF 1.7) or document.FormatPDFUA2 (PDF/UA-2, ISO 14289-2, on PDF 2.0). Tagging is automatic — the tree of structure elements is built as content is laid out, and each VList carries a back-reference to the structure node it should attach to.

Activation

Set the format on the frontend.Document before constructing the CSSBuilder:

// PDF/UA-1 (ISO 14289-1)
fd.Doc.Format = document.FormatPDFUA

// or PDF/UA-2 (ISO 14289-2, PDF 2.0)
fd.Doc.Format = document.FormatPDFUA2

cb, _ := htmlbag.New(fd, css)

After New(), cb.enableTagging is true, cb.structureRoot points to a root structure element (created on fd.Doc if not already present), and cb.structureCurrent tracks the current parent during walk.

For UA-2, New() also declares the PDF 2.0 Standard Structure Namespace and an HTML5 namespace with a complete RoleMapNS on fd.Doc, so every emitted structure element is properly namespace-qualified (required by ISO 14289-2 §8.2.4).

Element-to-role mapping

Each HTML tag maps to a canonical PDF Standard Structure role; the final emission depends on the active format.

HTML PDF SSN (UA-1) HTML5 NS (UA-2)
<h1><h6> H1H6 h1h6
<p>, <pre> P p
<div> Div div
<span> Span span
<a> Link a
<img> Figure img
<figure> Figure figure
<table> Table table
<thead>/<tbody>/<tfoot> THead/TBody/TFoot thead/tbody/tfoot
<tr>, <th>, <td> TR, TH, TD tr, th, td
<ul>/<ol> L ul/ol
<li> LI li
<blockquote> BlockQuote blockquote
<code> Code code
<section> Sect section
<article> Art article

Roles that have no HTML5 equivalent (Document, Note, LBody, Lbl, L, Part) stay in the PDF 2.0 Standard Structure Namespace under UA-2.

Internally, canonicalRoleForTag resolves the HTML tag to the canonical PDF SSN name, then newSE(canonical, format) builds the StructureElement with the right Role and NS. The two helpers live in htmlbag/tagging.go.

Footnotes attach as Note structure elements.

Repeated table headers

When a table breaks across pages, the repeated header rows on continuation pages are tagged as artifacts (not as TH again) to avoid duplication in the structure tree.

Form XObjects (imported PDFs)

A <img src="…pdf"> (an imported PDF page) is wrapped as a Form XObject with /StructParent and an OBJR-entry in the parent Figure’s /K array (PDF 1.7 §14.7.4.4 + PDF/UA-1 §7.1 Note 1). This stops Acrobat’s tag inspector from exploding the imported content stream’s path operators into a “Pfad / Path” list under the Figure tag.

Outline destinations under UA-2

PDF/UA-2 §8.8 requires intra-document destinations to be structure destinations, not page destinations. When the document is FormatPDFUA2, glu/markdown’s outline generator emits /Dest [ <SE-objref> /Fit ] for each heading instead of [ <page-objref> /Fit ]. The link from each HeadingEntry to its backing StructureElement is set in vlistbuilder.go at the time the heading SE is constructed.

Custom roles

The mapping in canonicalToHTML5 (htmlbag/tagging.go) is the single source of truth for the PDF SSN ↔ HTML5 correspondence. To add a custom mapping, extend that map; html5RoleMap() builds the UA-2 RoleMapNS from it automatically.

Verification

Both profiles are validated against veraPDF as part of the showcase examples:

verapdf --flavour ua1 boxesandglue-examples/glu/xslfo/10-pdfua/result.pdf
verapdf --flavour ua2 boxesandglue-examples/glu/xslfo/11-pdfua2/result.pdf

Both report isCompliant=true.