Building an HTML Decode Tool: Standards, Privacy, and Edge Cases
Building an HTML Decode Tool: Standards, Privacy, and Edge Cases
HTML entity decoding sounds trivial — convert & to &, done. But correctly handling all 2231+ named entities from the HTML Living Standard, three numeric reference formats, non-BMP Unicode characters, malformed sequences, and multi-layer encoding requires careful design. This post covers how we built a privacy-first HTML decoder that processes everything locally in the browser, with no server involvement.
Why HTML Entity Decoding Is Harder Than It Looks
There are three distinct entity formats a compliant decoder must handle:
Named entities — the most familiar format:
& → &
< → <
→ (non-breaking space)
© → ©
Decimal numeric character references:
© → ©
€ → €
Hexadecimal numeric character references:
© → ©
😀 → 😀
The HTML Living Standard defines 2231 named character references in total. No native browser API decodes HTML entities without DOM involvement — the classic innerHTML trick (el.innerHTML = text; return el.textContent) and the DOMParser API both work by structurally parsing the input as markup, which can corrupt inputs that aren't valid HTML. We needed a pure string-processing solution that treats input as opaque text.
Choosing the Decoding Engine
Given the 2231-entity requirement and the DOM restriction, we evaluated three approaches:
Option 1: Custom entity lookup table. Technically feasible — but it means maintaining a copy of all 2231 entries plus their Unicode values, keeping pace with HTML Living Standard updates, and carrying the same test burden as a mature library. All cost, no benefit.
Option 2: innerHTML / DOMParser trick. Fast and zero-dependency, but explicitly prohibited. DOM parsing mutates structure: <script> tags get reinterpreted, attribute values are normalised, whitespace is collapsed by the parser. A user pasting encoded API payload JSON would get garbled output.
Option 3: The he library. A 32 KB minified, zero-dependency pure JavaScript library implementing the complete WHATWG named character reference list. It handles named entities, decimal references, hexadecimal references, and non-BMP characters via surrogate pair decomposition — and its decode() function operates as pure string processing with no DOM interaction.
import he from 'he'
// Non-strict mode (default): preserves unrecognised entities verbatim
const decoded = he.decode('<h1>Hello & world</h1>')
// → '<h1>Hello & world</h1>'
The he library was chosen. With 170M+ weekly downloads and a direct implementation of the WHATWG spec, it is the de facto standard reference for HTML entity handling in JavaScript.
Handling Malformed Entities
When he.decode() encounters a sequence it cannot resolve — &invalid;, &#xZZ;, or a stray & — it preserves the sequence verbatim in the output. This is the correct behaviour per the spec: silently discarding unknown sequences would cause data loss; throwing errors would break the user experience.
The challenge is counting these anomalies for the warning indicator. he doesn't expose a count — it just silently passes through what it can't resolve. Our solution: scan the decoded output for entity-like sequences that survived decoding unchanged.
function detectAnomalies(decodedText: string): number {
return (decodedText.match(/&[^\s&]+;/g) ?? []).length
}
Why post-decode scanning is correct: Any sequence matching /&[^\s&]+;/g in the decoded output is one that he.decode() could not resolve and left verbatim. Valid entities like & are decoded to & — they no longer look like entity syntax and won't match. Invalid sequences like &invalid; or &#xZZ; survive unchanged and will match.
Consider this input: Price & tax &invalid; &#xZZ;
After he.decode(): Price & tax &invalid; &#xZZ;
Applying the regex to the decoded output: 2 matches (&invalid; and &#xZZ;). The & was decoded to & and correctly does not count — it was a valid entity, not an anomaly.
Why not scan the raw input? A pre-decode regex cannot distinguish &invalid; (syntactically valid name, semantically unknown) from & (syntactically valid, known entity) without duplicating he's entire resolution table. Post-decode scanning leverages he's own logic as the ground truth.
Multi-Layer Decode: Convergence Loop
Some content arrives double- or triple-encoded — for example, &lt; which decodes to < which decodes to <. A single pass only gets you to <.
The multi-layer toggle applies repeated decode passes until the output stabilises:
function htmlDecode(input: string, multiLayer: boolean): HtmlDecodeResult {
if (!multiLayer) {
const decoded = he.decode(input)
return { decoded, anomalyCount: detectAnomalies(decoded) }
}
let current = input
let passes = 0
while (passes < 10) {
const next = he.decode(current)
if (next === current) break // convergence: nothing changed
current = next
passes++
}
return { decoded: current, anomalyCount: detectAnomalies(current) }
}
Termination condition: If he.decode(current) === current, no entities were resolved in this pass. Decoding has converged. This is the mathematically correct stopping condition.
The 10-pass cap: With he's behaviour, decoded characters cannot create new entity syntax — so the loop is guaranteed to converge. The cap is a defensive programming measure. If something unexpected ever created a pathological input, the loop would still terminate within 10 iterations rather than running indefinitely.
Why not a fixed 2-pass decode? Double-encoded content (&lt;) needs 2 passes. Triple-encoded content (&amp;lt;) needs 3. A fixed limit would miss legitimate multi-layer cases. Convergence detection handles all depths correctly.
Performance: 150ms Debounce and the 500 KB Threshold
he.decode() is synchronous — it runs on the main thread. Benchmarking on a modern browser:
- 100 KB of densely encoded HTML: ~15–30ms (imperceptible)
- 500 KB of densely encoded HTML: ~75–150ms (perceptible but not blocking)
- Beyond 500 KB: decode may cause a visible frame drop
Two strategies keep the UI responsive:
1. 150ms debounce on input:
let debounceTimer: ReturnType<typeof setTimeout> | null = null
const debouncedInput = ref('')
watch(encodedInput, () => {
if (debounceTimer) {
clearTimeout(debounceTimer)
}
debounceTimer = setTimeout(() => {
debouncedInput.value = encodedInput.value
}, 150)
}, { immediate: true })
During fast typing, the decode only runs after 150ms of inactivity. This reduces unnecessary decode calls by ~90% during continuous input without any perceptible delay to the user.
2. Non-blocking warning above 500 KB:
Rather than imposing a hard cap, inputs above 500 KB display a UAlert warning and continue decoding. This matches the spec requirement: users working with large encoded payloads (full HTML documents, serialised XML) must not be artificially blocked.
Why no Web Worker? Offloading he.decode() to a Web Worker would require message passing, error propagation, and bridging Vue reactivity across thread boundaries — significant complexity for a 500 KB target that synchronous decode handles acceptably. The Simplicity First principle applies: minimum necessary complexity for the current requirement.
Privacy-First Architecture
Like the URL Encode and URL Decode tools before it, the HTML Decode tool operates with zero server transmission:
- No API calls:
he.decode()runs entirely in the browser process - No telemetry: no analytics scripts capture or log input content
- Clipboard API only: copy-to-clipboard uses
navigator.clipboard.writeText()— a browser-native operation that never touches a server - Offline capable: once the page is loaded, decoding works without internet access
This privacy posture matters because HTML-encoded content often carries sensitive data — API responses with embedded tokens, CMS exports containing customer data, log files with authentication headers. Users can paste this content confidently, knowing it never leaves their device.
Technical Stack
- Vue 3 Composition API:
ref(),computed(),watch()— reactive state without a global store - Nuxt 4: SPA routing with
useSeoMeta()and file-based pages - TypeScript:
HtmlDecodeResultinterface enforces the decode contract helibrary: WHATWG-compliant HTML entity codec, zero dependencies, 32 KB minified- TailwindCSS + Nuxt UI:
UCard,UTextarea,UAlertfor consistent, accessible UI - Clipboard API: native browser API for zero-server copy-to-clipboard
Edge Cases Handled
✓ Named entities (&, , ©, all 2231+ from HTML Living Standard)
✓ Decimal numeric references (© → ©)
✓ Hexadecimal numeric references (© → ©, 😀 → 😀)
✓ Non-BMP characters (codepoints above U+FFFF via surrogate pair decomposition)
✓ Malformed entities preserved verbatim (&invalid;, &#xZZ; left unchanged)
✓ Stray ampersands treated as literal text
✓ Multi-layer encoded content (&amp;lt; → < with 2 passes)
✓ Mixed encoded and literal characters
✓ Empty input — no warnings or errors
✓ Whitespace-only input handled gracefully
✓ Inputs above 500 KB — non-blocking warning, decoding continues
Key Takeaways
- The DOM is not always the answer. The
innerHTMLtrick works for simple cases but breaks for non-HTML inputs. When the spec says "no DOM parsing", there's always a pure string alternative — find it. - Post-decode scanning beats pre-decode prediction. Detecting anomalies after decoding uses the library's own resolution logic as the ground truth. Trying to predict what the library will or won't resolve requires duplicating its entire knowledge base.
- Convergence is a better loop condition than a counter. Multi-layer decode terminates when output equals input — a mathematically correct condition. Fixed pass counts either under-serve deeply encoded content or waste cycles on already-stable output.
- Debouncing is essential for real-time string processing. 150ms of inactivity before decoding eliminates the vast majority of intermediate states during typing, with no perceptible delay to the user.
- Privacy by design eliminates entire classes of risk. Client-side processing doesn't just protect the user — it eliminates the server infrastructure, the logging concerns, the data retention policies, and the breach surface entirely.
Try It Yourself
Visit the HTML Decode tool to decode your HTML-encoded strings. Paste encoded API responses, CMS exports, or log file fragments — everything processes locally in your browser, with nothing transmitted to any server.
Decoding URLs Safely: Building an RFC 3986 Compliant Decoder
Deep dive into URL decoding implementation using native browser APIs. Learn how to handle malformed sequences, UTF-8 edge cases, and maintain sub-second performance while preserving user privacy.
Programming is like Cultivating
The similarities between programming and cultivation.