TypeScript, Vue, HTML·Feb 22, 2026

Building an HTML Decode Tool: Standards, Privacy, and Edge Cases

How we built a privacy-first HTML entity decoder supporting all 2231+ HTML Living Standard named entities, malformed sequence handling, and multi-layer decoding — entirely client-side.

Han Chee

Building an HTML Decode Tool: Standards, Privacy, and Edge Cases

HTML entity decoding sounds trivial — convert & to &, done. But correctly handling all 2231+ named entities from the HTML Living Standard, three numeric reference formats, non-BMP Unicode characters, malformed sequences, and multi-layer encoding requires careful design. This post covers how we built a privacy-first HTML decoder that processes everything locally in the browser, with no server involvement.

Why HTML Entity Decoding Is Harder Than It Looks

There are three distinct entity formats a compliant decoder must handle:

Named entities — the most familiar format:

&amp;   → &
&lt;    → <
&nbsp;  → (non-breaking space)
&copy;  → ©

Decimal numeric character references:

&#169;  → ©
&#8364; → €

Hexadecimal numeric character references:

&#xA9;    → ©
&#x1F600; → 😀

The HTML Living Standard defines 2231 named character references in total. No native browser API decodes HTML entities without DOM involvement — the classic innerHTML trick (el.innerHTML = text; return el.textContent) and the DOMParser API both work by structurally parsing the input as markup, which can corrupt inputs that aren't valid HTML. We needed a pure string-processing solution that treats input as opaque text.

Choosing the Decoding Engine

Given the 2231-entity requirement and the DOM restriction, we evaluated three approaches:

Option 1: Custom entity lookup table. Technically feasible — but it means maintaining a copy of all 2231 entries plus their Unicode values, keeping pace with HTML Living Standard updates, and carrying the same test burden as a mature library. All cost, no benefit.

Option 2: innerHTML / DOMParser trick. Fast and zero-dependency, but explicitly prohibited. DOM parsing mutates structure: <script> tags get reinterpreted, attribute values are normalised, whitespace is collapsed by the parser. A user pasting encoded API payload JSON would get garbled output.

Option 3: The he library. A 32 KB minified, zero-dependency pure JavaScript library implementing the complete WHATWG named character reference list. It handles named entities, decimal references, hexadecimal references, and non-BMP characters via surrogate pair decomposition — and its decode() function operates as pure string processing with no DOM interaction.

import he from "he";

// Non-strict mode (default): preserves unrecognised entities verbatim
const decoded = he.decode("&lt;h1&gt;Hello &amp; world&lt;/h1&gt;");
// → '<h1>Hello & world</h1>'

The he library was chosen. With 170M+ weekly downloads and a direct implementation of the WHATWG spec, it is the de facto standard reference for HTML entity handling in JavaScript.

Handling Malformed Entities

When he.decode() encounters a sequence it cannot resolve — &invalid;, &#xZZ;, or a stray & — it preserves the sequence verbatim in the output. This is the correct behaviour per the spec: silently discarding unknown sequences would cause data loss; throwing errors would break the user experience.

The challenge is counting these anomalies for the warning indicator. he doesn't expose a count — it just silently passes through what it can't resolve. Our solution: scan the decoded output for entity-like sequences that survived decoding unchanged.

function detectAnomalies(decodedText: string): number {
  return (decodedText.match(/&[^\s&]+;/g) ?? []).length;
}

Why post-decode scanning is correct: Any sequence matching /&[^\s&]+;/g in the decoded output is one that he.decode() could not resolve and left verbatim. Valid entities like & are decoded to & — they no longer look like entity syntax and won't match. Invalid sequences like &invalid; or &#xZZ; survive unchanged and will match.

Consider this input: Price & tax &invalid; &#xZZ;

After he.decode(): Price & tax &invalid; &#xZZ;

Applying the regex to the decoded output: 2 matches (&invalid; and &#xZZ;). The & was decoded to & and correctly does not count — it was a valid entity, not an anomaly.

Why not scan the raw input? A pre-decode regex cannot distinguish &invalid; (syntactically valid name, semantically unknown) from & (syntactically valid, known entity) without duplicating he's entire resolution table. Post-decode scanning leverages he's own logic as the ground truth.

Multi-Layer Decode: Convergence Loop

Some content arrives double- or triple-encoded — for example, &lt; which decodes to < which decodes to <. A single pass only gets you to <.

The multi-layer toggle applies repeated decode passes until the output stabilises:

function htmlDecode(input: string, multiLayer: boolean): HtmlDecodeResult {
  if (!multiLayer) {
    const decoded = he.decode(input);
    return { decoded, anomalyCount: detectAnomalies(decoded) };
  }

  let current = input;
  let passes = 0;
  while (passes < 10) {
    const next = he.decode(current);
    if (next === current) break; // convergence: nothing changed
    current = next;
    passes++;
  }
  return { decoded: current, anomalyCount: detectAnomalies(current) };
}

Termination condition: If he.decode(current) === current, no entities were resolved in this pass. Decoding has converged. This is the mathematically correct stopping condition.

The 10-pass cap: With he's behaviour, decoded characters cannot create new entity syntax — so the loop is guaranteed to converge. The cap is a defensive programming measure. If something unexpected ever created a pathological input, the loop would still terminate within 10 iterations rather than running indefinitely.

Why not a fixed 2-pass decode? Double-encoded content (&lt;) needs 2 passes. Triple-encoded content (&amp;lt;) needs 3. A fixed limit would miss legitimate multi-layer cases. Convergence detection handles all depths correctly.

Performance: 150ms Debounce and the 500 KB Threshold

he.decode() is synchronous — it runs on the main thread. Benchmarking on a modern browser:

100 KB of densely encoded HTML: ~15–30ms (imperceptible)
500 KB of densely encoded HTML: ~75–150ms (perceptible but not blocking)
Beyond 500 KB: decode may cause a visible frame drop

Two strategies keep the UI responsive:

1. 150ms debounce on input:

let debounceTimer: ReturnType<typeof setTimeout> | null = null;
const debouncedInput = ref("");

watch(
  encodedInput,
  () => {
    if (debounceTimer) {
      clearTimeout(debounceTimer);
    }
    debounceTimer = setTimeout(() => {
      debouncedInput.value = encodedInput.value;
    }, 150);
  },
  { immediate: true },
);

During fast typing, the decode only runs after 150ms of inactivity. This reduces unnecessary decode calls by ~90% during continuous input without any perceptible delay to the user.

2. Non-blocking warning above 500 KB:

Rather than imposing a hard cap, inputs above 500 KB display a UAlert warning and continue decoding. This matches the spec requirement: users working with large encoded payloads (full HTML documents, serialised XML) must not be artificially blocked.

Why no Web Worker? Offloading he.decode() to a Web Worker would require message passing, error propagation, and bridging Vue reactivity across thread boundaries — significant complexity for a 500 KB target that synchronous decode handles acceptably. The Simplicity First principle applies: minimum necessary complexity for the current requirement.

Privacy-First Architecture

Like the URL Encode and URL Decode tools before it, the HTML Decode tool operates with zero server transmission:

No API calls: he.decode() runs entirely in the browser process
No telemetry: no analytics scripts capture or log input content
Clipboard API only: copy-to-clipboard uses navigator.clipboard.writeText() — a browser-native operation that never touches a server
Offline capable: once the page is loaded, decoding works without internet access

This privacy posture matters because HTML-encoded content often carries sensitive data — API responses with embedded tokens, CMS exports containing customer data, log files with authentication headers. Users can paste this content confidently, knowing it never leaves their device.

Technical Stack

Vue 3 Composition API: ref(), computed(), watch() — reactive state without a global store
Nuxt 4: SPA routing with useSeoMeta() and file-based pages
TypeScript: HtmlDecodeResult interface enforces the decode contract
he library: WHATWG-compliant HTML entity codec, zero dependencies, 32 KB minified
TailwindCSS + Nuxt UI: UCard, UTextarea, UAlert for consistent, accessible UI
Clipboard API: native browser API for zero-server copy-to-clipboard

Edge Cases Handled

✓ Named entities (&,  , ©, all 2231+ from HTML Living Standard)
✓ Decimal numeric references (© → ©)
✓ Hexadecimal numeric references (© → ©, 😀 → 😀)
✓ Non-BMP characters (codepoints above U+FFFF via surrogate pair decomposition)
✓ Malformed entities preserved verbatim (&invalid;, &#xZZ; left unchanged)
✓ Stray ampersands treated as literal text
✓ Multi-layer encoded content (&amp;lt; → < with 2 passes)
✓ Mixed encoded and literal characters
✓ Empty input — no warnings or errors
✓ Whitespace-only input handled gracefully
✓ Inputs above 500 KB — non-blocking warning, decoding continues

Key Takeaways

The DOM is not always the answer. The innerHTML trick works for simple cases but breaks for non-HTML inputs. When the spec says "no DOM parsing", there's always a pure string alternative — find it.
Post-decode scanning beats pre-decode prediction. Detecting anomalies after decoding uses the library's own resolution logic as the ground truth. Trying to predict what the library will or won't resolve requires duplicating its entire knowledge base.
Convergence is a better loop condition than a counter. Multi-layer decode terminates when output equals input — a mathematically correct condition. Fixed pass counts either under-serve deeply encoded content or waste cycles on already-stable output.
Debouncing is essential for real-time string processing. 150ms of inactivity before decoding eliminates the vast majority of intermediate states during typing, with no perceptible delay to the user.
Privacy by design eliminates entire classes of risk. Client-side processing doesn't just protect the user — it eliminates the server infrastructure, the logging concerns, the data retention policies, and the breach surface entirely.

Try It Yourself

Visit the HTML Decode tool to decode your HTML-encoded strings. Paste encoded API responses, CMS exports, or log file fragments — everything processes locally in your browser, with nothing transmitted to any server.

Decoding URLs Safely: Building an RFC 3986 Compliant Decoder

Deep dive into URL decoding implementation using native browser APIs. Learn how to handle malformed sequences, UTF-8 edge cases, and maintain sub-second performance while preserving user privacy.

Building a HTML Encode Tool

How we built a deterministic, standards-compliant HTML encoding utility that runs entirely in the browser — no servers, no tracking, no compromises.