Building a Change Case Tool with a Unicode-Safe Tokenizer
Why One More Case Converter?
The landscape of online case converters splits cleanly into two categories: tools that handle simple whitespace-delimited text, and tools that handle programmer identifiers. Almost none handle both — and almost none explain why identifier conversion is harder than it looks.
The task seems trivial. Convert APIResponseCode to snake_case. The expected output is api_response_code. But if your tool splits on uppercase letters naively, you get a_p_i_response_code — treating each letter of the acronym as a separate word. A tokenizer that understands acronyms is the core differentiator.
This post documents the design of the shared tokenizer at the heart of the Change Case tool, and the decisions behind each of the 14 transformation modes.
The Tokenizer Problem
Converting between identifier formats requires splitting input into semantic word tokens before reassembling them in the target format. The challenge is that the same input string can come from any naming convention, so the tokenizer must handle all of them simultaneously.
Consider these inputs, all of which should tokenize to the same three-token sequence ["version", "2", "number"]:
version2Number(camelCase with digit)version-2-number(kebab-case)version_2_number(snake_case)VERSION_2_NUMBER(CONSTANT_CASE)
The tokenizer rules, applied in order, are:
- Explicit delimiters — whitespace,
-,_,.,/are word boundaries. Consecutive delimiters are collapsed into one boundary; leading and trailing are stripped. - camelCase bump — a lowercase→uppercase transition (
lU) inserts a boundary before the uppercase letter. - Acronym end — an uppercase run followed by a lowercase letter inserts a boundary before the last uppercase in the run.
HTMLParser→["HTML", "Parser"](not["HTMLP", "arser"]). - Letter-to-digit —
version2→["version", "2"]. - Digit-to-letter —
2Number→["2", "Number"]. - Emoji / non-Latin — treated as opaque tokens; a boundary is inserted on both sides, and the character is emitted unchanged.
The acronym rule (3) is the most subtle. The boundary is inserted before the last uppercase letter in a run, not after the first. For HTMLParser, the characters are H-T-M-L-P-a-r-s-e-r. The run is H-T-M-L-P, but P is the last uppercase before the lowercase arser. So the boundary goes before P, yielding ["HTML", "Parser"] — the semantically correct split.
Implementing the Tokenizer
The tokenizer iterates over the input one Unicode code point at a time using the spread iterator ([...input]), which is safe for emoji and supplementary characters that occupy two UTF-16 code units.
function tokenize(input: string, stripPunctuation = false): string[] {
const tokens: string[] = [];
let current = "";
const chars = [...input]; // code-point-safe
for (let i = 0; i < chars.length; i++) {
const ch = chars[i];
const prev = i > 0 ? chars[i - 1] : null;
const next = i < chars.length - 1 ? chars[i + 1] : null;
if (isExplicitDelimiter(ch)) {
if (current) {
tokens.push(current);
current = "";
}
} else if (isEmojiChar(ch) || isNonLatinLetter(ch)) {
if (current) {
tokens.push(current);
current = "";
}
tokens.push(ch); // emit as standalone opaque token
} else if (isDigit(ch)) {
if (current && prev && !isDigit(prev)) {
tokens.push(current);
current = ""; // letter→digit boundary
}
current += ch;
} else if (isLatinLetter(ch)) {
if (current && prev && isDigit(prev)) {
tokens.push(current);
current = ""; // digit→letter boundary
} else if (isUpperLatin(ch) && prev && isLowerLatin(prev)) {
tokens.push(current);
current = ""; // camelCase bump
} else if (
isUpperLatin(ch) &&
prev &&
isUpperLatin(prev) &&
next &&
isLowerLatin(next)
) {
tokens.push(current);
current = ""; // acronym end
}
current += ch;
} else {
// punctuation — keep or strip
if (stripPunctuation) {
if (current) {
tokens.push(current);
current = "";
}
} else {
current += ch;
}
}
}
if (current) tokens.push(current);
return tokens.filter((t) => t !== "");
}
Each character is classified independently, relying only on its immediate neighbours — prev and next. This keeps the logic O(n) with a single pass.
The 14 Transformation Modes
The modes divide into two categories: Letter Case modes operate directly on the input string without tokenization, and Identifier Format modes tokenize first and then reassemble.
Letter Case Modes
Lowercase and Uppercase delegate to toLocaleLowerCase(locale) and toLocaleUpperCase(locale). The locale parameter matters for Turkish, where uppercase I maps to İ (dotted capital I) and lowercase I maps to ı (dotless small I) — different from the English mapping.
Title Case (Editorial) uses whitespace-only tokenization and applies a skip list of articles, coordinating conjunctions, and short prepositions (a, an, the, and, but, or, in, of, on, etc.). The first and last words are always capitalised regardless of the skip list. Crucially, this mode does not apply case-transition splitting — APIResponseCode is a single word and is title-cased as Apiresponsecode, not split into Api Response Code.
Sentence Case is deliberately naive — it lowercases everything, then capitalises the first non-whitespace character and any character after ., !, or ? followed by whitespace. No abbreviation detection. Dr. Smith would capitalise the S in Smith (correct) and also the character after Dr. if followed by a space (which capitalises S — still correct in this case, but Mr. Smith goes to Washington would yield Mr. Smith Goes To Washington). The documented behaviour, not a bug.
Invert Case is a pure character-level operation — uppercase becomes lowercase and vice versa. No tokenization, no locale. Non-alphabetic characters pass through unchanged.
Alternating Case is also character-level, with three configurable reset scopes: global (single counter for the entire string), per-word (counter resets at whitespace), and per-line (counter resets at newlines). Non-alphabetic characters pass through unchanged and do not advance the counter. This means hElLo wOrLd with per-word scope starts fresh at w — each word begins with the same case.
Identifier Format Modes
All seven identifier modes share the tokenizer. After splitting, they differ only in how tokens are cased and what separator is used:
| Mode | Token casing | Separator |
|---|---|---|
camelCase | first: lower; rest: Capitalised | (none) |
PascalCase | all: Capitalised | (none) |
snake_case | all: lower | _ |
CONSTANT_CASE | all: UPPER | _ |
kebab-case | all: lower | - |
dot.case | all: lower | . |
Path/Slash Case | all: lower | / |
The Preserve Known Acronyms advanced option affects camelCase and PascalCase. When enabled, tokens matching the built-in acronym list (HTML, CSS, API, URL, JWT, UUID, and 20 others) are emitted in canonical form rather than title-cased. parseHTML stays parseHTML (not parseHtml); parseHTTP stays parseHTTP (not parseHttp).
Batch Mode
Batch mode splits the input on newlines and applies the transformation independently to each line before rejoining with \n. Blank lines are preserved positionally — an empty string transformed is still an empty string. This allows converting a list of identifiers (one per line) without manual iteration.
if (batchMode) {
return input
.split("\n")
.map((line) => applyTransform(line, mode, cfg))
.join("\n");
}
The simplicity here is the point. Batch mode is a one-liner wrapper, not a separate code path.
Handling the Edge Cases
Consecutive delimiters (__value--name): the tokenizer emits a boundary at each delimiter but only pushes the current accumulator when it is non-empty. The result is ["value", "name"] — no empty tokens.
Emoji in identifier modes (🚀 launch): the emoji is emitted as a standalone opaque token. For snake_case, tokens are joined with _, so 🚀 launch becomes 🚀_launch. The emoji is preserved exactly.
Non-Latin scripts (Arabic, CJK, Devanagari): classified as non-Latin letters and emitted as standalone tokens. For casing modes (lowercase, uppercase), they pass through unchanged — there is no case concept in these scripts.
Whitespace-only input: for identifier modes, all characters are delimiters and the token array is empty, so the output is an empty string. For casing modes, the input is returned lowercased/uppercased/etc. as-is.
Privacy Architecture
The privacy guarantee is non-negotiable for a tool where developers might paste internal identifier names, database schema names, or API field names from proprietary systems. All transformations are synchronous, in-browser operations using only JavaScript string methods. There is no fetch, no axios, no WebSocket, and no analytics telemetry.
// Pure in-browser — no network calls of any kind
function applyTransform(
text: string,
mode: ModeId,
cfg: TransformConfig,
): string {
// ... string operations only
}
The useDebounceFn from @vueuse/core handles the 150ms debounce on keystroke — it is a timer, not a network call. The Clipboard API writes to the system clipboard with no external transmission.
Key Takeaways
Building a correct case converter requires more care than stringing together .toLowerCase() calls:
- Tokenise first, format second — identifier conversion requires splitting into semantic tokens regardless of the input format. The tokenizer is the core of the tool.
- Acronym boundaries need look-ahead — the
HTMLParser→["HTML", "Parser"]split requires checking whether the next character is lowercase, not just whether the current one is uppercase. - Spread for Unicode safety —
[...input]respects Unicode code points; iterating by code unit would break emoji and supplementary characters. - Letter Case modes skip the tokenizer — whitespace-only splitting (Title Case, Sentence Case) produces different results from identifier splitting, and that's intentional.
- Locale matters for casing — Turkish
I/ıand GermanßrequiretoLocaleLowerCase(locale)rather thantoLowerCase().
The full implementation is in spa/app/components/ChangeCaseConverter.vue, following the same client-side architecture pattern as the URL Encoder, HTML Encoder, and Base64 tools in this platform.
Building a Base64 Decode Tool
How we built a browser-native Base64 decoder with RFC 4648 compliance, structured error reporting with character-and-position identification, Base64URL mode for JWT and OAuth tokens, and zero network requests — all using native atob() and TextDecoder.
Building a String to CSV Converter: Auto-Detection, RFC 4180, and Edge Cases
How we built a privacy-first, RFC 4180-compliant CSV converter with automatic delimiter detection, four row-handling modes, fixed-width parsing, and regex mode — all in the browser.