Web Development, Vue, Unicode, Algorithms·

Building a Change Case Tool with a Unicode-Safe Tokenizer

How we built a browser-native text case converter supporting 14 transformation modes — from snake_case to Title Case to Alternating Case — using a custom tokenizer that handles camelCase bumps, acronym boundaries, digit transitions, emoji, and non-Latin scripts.

Why One More Case Converter?

The landscape of online case converters splits cleanly into two categories: tools that handle simple whitespace-delimited text, and tools that handle programmer identifiers. Almost none handle both — and almost none explain why identifier conversion is harder than it looks.

The task seems trivial. Convert APIResponseCode to snake_case. The expected output is api_response_code. But if your tool splits on uppercase letters naively, you get a_p_i_response_code — treating each letter of the acronym as a separate word. A tokenizer that understands acronyms is the core differentiator.

This post documents the design of the shared tokenizer at the heart of the Change Case tool, and the decisions behind each of the 14 transformation modes.

The Tokenizer Problem

Converting between identifier formats requires splitting input into semantic word tokens before reassembling them in the target format. The challenge is that the same input string can come from any naming convention, so the tokenizer must handle all of them simultaneously.

Consider these inputs, all of which should tokenize to the same three-token sequence ["version", "2", "number"]:

  • version2Number (camelCase with digit)
  • version-2-number (kebab-case)
  • version_2_number (snake_case)
  • VERSION_2_NUMBER (CONSTANT_CASE)

The tokenizer rules, applied in order, are:

  1. Explicit delimiters — whitespace, -, _, ., / are word boundaries. Consecutive delimiters are collapsed into one boundary; leading and trailing are stripped.
  2. camelCase bump — a lowercase→uppercase transition (lU) inserts a boundary before the uppercase letter.
  3. Acronym end — an uppercase run followed by a lowercase letter inserts a boundary before the last uppercase in the run. HTMLParser["HTML", "Parser"] (not ["HTMLP", "arser"]).
  4. Letter-to-digitversion2["version", "2"].
  5. Digit-to-letter2Number["2", "Number"].
  6. Emoji / non-Latin — treated as opaque tokens; a boundary is inserted on both sides, and the character is emitted unchanged.

The acronym rule (3) is the most subtle. The boundary is inserted before the last uppercase letter in a run, not after the first. For HTMLParser, the characters are H-T-M-L-P-a-r-s-e-r. The run is H-T-M-L-P, but P is the last uppercase before the lowercase arser. So the boundary goes before P, yielding ["HTML", "Parser"] — the semantically correct split.

Implementing the Tokenizer

The tokenizer iterates over the input one Unicode code point at a time using the spread iterator ([...input]), which is safe for emoji and supplementary characters that occupy two UTF-16 code units.

function tokenize(input: string, stripPunctuation = false): string[] {
  const tokens: string[] = [];
  let current = "";
  const chars = [...input]; // code-point-safe

  for (let i = 0; i < chars.length; i++) {
    const ch = chars[i];
    const prev = i > 0 ? chars[i - 1] : null;
    const next = i < chars.length - 1 ? chars[i + 1] : null;

    if (isExplicitDelimiter(ch)) {
      if (current) {
        tokens.push(current);
        current = "";
      }
    } else if (isEmojiChar(ch) || isNonLatinLetter(ch)) {
      if (current) {
        tokens.push(current);
        current = "";
      }
      tokens.push(ch); // emit as standalone opaque token
    } else if (isDigit(ch)) {
      if (current && prev && !isDigit(prev)) {
        tokens.push(current);
        current = ""; // letter→digit boundary
      }
      current += ch;
    } else if (isLatinLetter(ch)) {
      if (current && prev && isDigit(prev)) {
        tokens.push(current);
        current = ""; // digit→letter boundary
      } else if (isUpperLatin(ch) && prev && isLowerLatin(prev)) {
        tokens.push(current);
        current = ""; // camelCase bump
      } else if (
        isUpperLatin(ch) &&
        prev &&
        isUpperLatin(prev) &&
        next &&
        isLowerLatin(next)
      ) {
        tokens.push(current);
        current = ""; // acronym end
      }
      current += ch;
    } else {
      // punctuation — keep or strip
      if (stripPunctuation) {
        if (current) {
          tokens.push(current);
          current = "";
        }
      } else {
        current += ch;
      }
    }
  }

  if (current) tokens.push(current);
  return tokens.filter((t) => t !== "");
}

Each character is classified independently, relying only on its immediate neighbours — prev and next. This keeps the logic O(n) with a single pass.

The 14 Transformation Modes

The modes divide into two categories: Letter Case modes operate directly on the input string without tokenization, and Identifier Format modes tokenize first and then reassemble.

Letter Case Modes

Lowercase and Uppercase delegate to toLocaleLowerCase(locale) and toLocaleUpperCase(locale). The locale parameter matters for Turkish, where uppercase I maps to İ (dotted capital I) and lowercase I maps to ı (dotless small I) — different from the English mapping.

Title Case (Editorial) uses whitespace-only tokenization and applies a skip list of articles, coordinating conjunctions, and short prepositions (a, an, the, and, but, or, in, of, on, etc.). The first and last words are always capitalised regardless of the skip list. Crucially, this mode does not apply case-transition splitting — APIResponseCode is a single word and is title-cased as Apiresponsecode, not split into Api Response Code.

Sentence Case is deliberately naive — it lowercases everything, then capitalises the first non-whitespace character and any character after ., !, or ? followed by whitespace. No abbreviation detection. Dr. Smith would capitalise the S in Smith (correct) and also the character after Dr. if followed by a space (which capitalises S — still correct in this case, but Mr. Smith goes to Washington would yield Mr. Smith Goes To Washington). The documented behaviour, not a bug.

Invert Case is a pure character-level operation — uppercase becomes lowercase and vice versa. No tokenization, no locale. Non-alphabetic characters pass through unchanged.

Alternating Case is also character-level, with three configurable reset scopes: global (single counter for the entire string), per-word (counter resets at whitespace), and per-line (counter resets at newlines). Non-alphabetic characters pass through unchanged and do not advance the counter. This means hElLo wOrLd with per-word scope starts fresh at w — each word begins with the same case.

Identifier Format Modes

All seven identifier modes share the tokenizer. After splitting, they differ only in how tokens are cased and what separator is used:

ModeToken casingSeparator
camelCasefirst: lower; rest: Capitalised(none)
PascalCaseall: Capitalised(none)
snake_caseall: lower_
CONSTANT_CASEall: UPPER_
kebab-caseall: lower-
dot.caseall: lower.
Path/Slash Caseall: lower/

The Preserve Known Acronyms advanced option affects camelCase and PascalCase. When enabled, tokens matching the built-in acronym list (HTML, CSS, API, URL, JWT, UUID, and 20 others) are emitted in canonical form rather than title-cased. parseHTML stays parseHTML (not parseHtml); parseHTTP stays parseHTTP (not parseHttp).

Batch Mode

Batch mode splits the input on newlines and applies the transformation independently to each line before rejoining with \n. Blank lines are preserved positionally — an empty string transformed is still an empty string. This allows converting a list of identifiers (one per line) without manual iteration.

if (batchMode) {
  return input
    .split("\n")
    .map((line) => applyTransform(line, mode, cfg))
    .join("\n");
}

The simplicity here is the point. Batch mode is a one-liner wrapper, not a separate code path.

Handling the Edge Cases

Consecutive delimiters (__value--name): the tokenizer emits a boundary at each delimiter but only pushes the current accumulator when it is non-empty. The result is ["value", "name"] — no empty tokens.

Emoji in identifier modes (🚀 launch): the emoji is emitted as a standalone opaque token. For snake_case, tokens are joined with _, so 🚀 launch becomes 🚀_launch. The emoji is preserved exactly.

Non-Latin scripts (Arabic, CJK, Devanagari): classified as non-Latin letters and emitted as standalone tokens. For casing modes (lowercase, uppercase), they pass through unchanged — there is no case concept in these scripts.

Whitespace-only input: for identifier modes, all characters are delimiters and the token array is empty, so the output is an empty string. For casing modes, the input is returned lowercased/uppercased/etc. as-is.

Privacy Architecture

The privacy guarantee is non-negotiable for a tool where developers might paste internal identifier names, database schema names, or API field names from proprietary systems. All transformations are synchronous, in-browser operations using only JavaScript string methods. There is no fetch, no axios, no WebSocket, and no analytics telemetry.

// Pure in-browser — no network calls of any kind
function applyTransform(
  text: string,
  mode: ModeId,
  cfg: TransformConfig,
): string {
  // ... string operations only
}

The useDebounceFn from @vueuse/core handles the 150ms debounce on keystroke — it is a timer, not a network call. The Clipboard API writes to the system clipboard with no external transmission.

Key Takeaways

Building a correct case converter requires more care than stringing together .toLowerCase() calls:

  1. Tokenise first, format second — identifier conversion requires splitting into semantic tokens regardless of the input format. The tokenizer is the core of the tool.
  2. Acronym boundaries need look-ahead — the HTMLParser["HTML", "Parser"] split requires checking whether the next character is lowercase, not just whether the current one is uppercase.
  3. Spread for Unicode safety[...input] respects Unicode code points; iterating by code unit would break emoji and supplementary characters.
  4. Letter Case modes skip the tokenizer — whitespace-only splitting (Title Case, Sentence Case) produces different results from identifier splitting, and that's intentional.
  5. Locale matters for casing — Turkish I/ı and German ß require toLocaleLowerCase(locale) rather than toLowerCase().

The full implementation is in spa/app/components/ChangeCaseConverter.vue, following the same client-side architecture pattern as the URL Encoder, HTML Encoder, and Base64 tools in this platform.

Code Cultivation • © 2026