PDF/UA tagging
htmlbag emits PDF logical-structure elements when the underlying
document is set to document.FormatPDFUA (PDF/UA-1, ISO 14289-1, on
PDF 1.7) or document.FormatPDFUA2 (PDF/UA-2, ISO 14289-2, on PDF
2.0). Tagging is automatic — the tree of structure elements is
built as content is laid out, and each VList carries a
back-reference to the structure node it should attach to.
Activation
Set the format on the frontend.Document before constructing the
CSSBuilder:
// PDF/UA-1 (ISO 14289-1)
fd.Doc.Format = document.FormatPDFUA
// or PDF/UA-2 (ISO 14289-2, PDF 2.0)
fd.Doc.Format = document.FormatPDFUA2
cb, _ := htmlbag.New(fd, css)After New(), cb.enableTagging is true, cb.structureRoot
points to a root structure element (created on fd.Doc if not
already present), and cb.structureCurrent tracks the current
parent during walk.
For UA-2, New() also declares the PDF 2.0 Standard Structure
Namespace and an HTML5 namespace with a complete RoleMapNS on
fd.Doc, so every emitted structure element is properly
namespace-qualified (required by ISO 14289-2 §8.2.4).
Element-to-role mapping
Each HTML tag maps to a canonical PDF Standard Structure role; the final emission depends on the active format.
| HTML | PDF SSN (UA-1) | HTML5 NS (UA-2) |
|---|---|---|
<h1> … <h6> |
H1 … H6 |
h1 … h6 |
<p>, <pre> |
P |
p |
<div> |
Div |
div |
<span> |
Span |
span |
<a> |
Link |
a |
<img> |
Figure |
img |
<figure> |
Figure |
figure |
<table> |
Table |
table |
<thead>/<tbody>/<tfoot> |
THead/TBody/TFoot |
thead/tbody/tfoot |
<tr>, <th>, <td> |
TR, TH, TD |
tr, th, td |
<ul>/<ol> |
L |
ul/ol |
<li> |
LI |
li |
<blockquote> |
BlockQuote |
blockquote |
<code> |
Code |
code |
<section> |
Sect |
section |
<article> |
Art |
article |
Roles that have no HTML5 equivalent (Document, Note, LBody,
Lbl, L, Part) stay in the PDF 2.0 Standard Structure
Namespace under UA-2.
Internally, canonicalRoleForTag resolves the HTML tag to the
canonical PDF SSN name, then newSE(canonical, format) builds the
StructureElement with the right Role and NS. The two helpers
live in htmlbag/tagging.go.
Footnotes attach as Note structure elements.
Repeated table headers
When a table breaks across pages, the repeated header rows on
continuation pages are tagged as artifacts (not as TH again) to
avoid duplication in the structure tree.
Form XObjects (imported PDFs)
A <img src="…pdf"> (an imported PDF page) is wrapped as a Form
XObject with /StructParent and an OBJR-entry in the parent
Figure’s /K array (PDF 1.7 §14.7.4.4 + PDF/UA-1 §7.1 Note 1).
This stops Acrobat’s tag inspector from exploding the imported
content stream’s path operators into a “Pfad / Path” list under
the Figure tag.
Outline destinations under UA-2
PDF/UA-2 §8.8 requires intra-document destinations to be structure
destinations, not page destinations. When the document is
FormatPDFUA2, glu/markdown’s outline generator emits
/Dest [ <SE-objref> /Fit ] for each heading instead of
[ <page-objref> /Fit ]. The link from each HeadingEntry to its
backing StructureElement is set in vlistbuilder.go at the time
the heading SE is constructed.
Custom roles
The mapping in canonicalToHTML5 (htmlbag/tagging.go) is the
single source of truth for the PDF SSN ↔ HTML5 correspondence. To
add a custom mapping, extend that map; html5RoleMap() builds the
UA-2 RoleMapNS from it automatically.
Verification
Both profiles are validated against veraPDF as part of the showcase examples:
verapdf --flavour ua1 boxesandglue-examples/glu/xslfo/10-pdfua/result.pdf
verapdf --flavour ua2 boxesandglue-examples/glu/xslfo/11-pdfua2/result.pdfBoth report isCompliant=true.