Tools, CSV, Developer Tools·Mar 4, 2026

Building a String to CSV Converter: Auto-Detection, RFC 4180, and Edge Cases

How we built a privacy-first, RFC 4180-compliant CSV converter with automatic delimiter detection, four row-handling modes, fixed-width parsing, and regex mode — all in the browser.

Han Chee

Why Another CSV Tool?

CSV converters are everywhere. Paste text, get CSV — how hard can it be? The answer depends entirely on what you mean by "CSV." Most online converters split on commas, wrap things in quotes if they feel like it, and call it done. That approach fails the moment your input uses tabs, or contains fields with embedded commas, or has rows with different column counts. RFC 4180 — the closest thing to a formal standard for CSV — specifies rules for quoting, line endings, and embedded delimiters that most tools simply ignore. We wanted to build one that doesn't.

The privacy argument matters too. CSV data is frequently sensitive: database exports, financial spreadsheets, user lists, internal identifiers. Every tool that processes that data server-side introduces a data transmission step that's easy to overlook. By keeping all parsing and conversion in the browser, we eliminate the category entirely. There is no server receiving your data. You can verify this in the Network tab of your browser's DevTools while the tool runs — the only request is the initial page load.

The Auto-Detection Algorithm

The most useful feature in the converter is auto-detection: paste any delimited text and it figures out the separator without you having to specify it. The implementation scores each candidate delimiter — comma, tab, pipe, semicolon — by measuring the consistency of its occurrence count across rows.

The intuition is simple. If a file is truly tab-delimited, every row should contain roughly the same number of tabs. A file with three columns has two tabs per row, every row. The score for a candidate is computed as one minus the coefficient of variation of its per-row counts: a low coefficient of variation (low standard deviation relative to the mean) yields a high score, indicating consistent occurrence. A candidate that appears zero times in most rows scores zero regardless of any sparse occurrences. Ties are broken in a fixed priority order — comma, then tab, then pipe, then semicolon — which reflects the real-world frequency of those formats.

Running the algorithm over only the first twenty rows keeps detection fast even for large files. The detected delimiter is shown in the selector label as "Auto (Tab)" or "Auto (Comma)" so you always know what was chosen, and you can override it manually at any time.

RFC 4180 Rules Most Tools Ignore

RFC 4180 is only two pages long, but most implementations skip the hard parts. The rules that matter most in practice are quoting, embedded newlines, the BOM, and line ending consistency.

Quoting in RFC 4180 is triggered by four conditions: the field contains the output delimiter, the field contains a double-quote character, the field contains a newline, or the field contains a carriage return. When quoting is triggered, the entire field is wrapped in double-quotes, and any double-quote characters within the field are escaped by doubling them. The tool applies these rules to every field on output, which means a field containing a literal comma in a comma-separated output is correctly quoted regardless of whether the input was comma-delimited. On the parsing side, the RFC 4180 parser handles quoted fields that span multiple lines — a single quoted field can contain embedded newlines, and the parser accumulates characters across line boundaries until it finds the closing quote.

Line ending mirroring is a subtle but important detail. If the input uses CRLF line endings, the output should too. Changing line endings silently converts a Windows CSV to a Unix CSV, which can confuse downstream tools. The converter detects the line ending style from the first occurrence in the input and mirrors it in the output.

The BOM (byte order mark, U+FEFF) is a special case. Excel on Windows expects a UTF-8 BOM at the start of a CSV file to correctly detect the encoding. Most Unix tools strip it or emit a spurious character in the first field. The tool strips the BOM from input transparently and offers an explicit toggle to add it to the output — on by default it would break Unix pipelines, so it defaults to off.

Row Length Handling: Four Modes and Their Trade-offs

Real-world delimited text is messy. Log files add extra fields when something goes wrong. Exports from certain systems omit trailing empty fields. The row length mode controls how the converter responds to rows that are shorter or longer than the first row (used as the reference width).

Strict mode is the default: rows are emitted as-is, with no padding or truncation. This is appropriate when you know your input is clean and you want to preserve it exactly as structured. Normalize mode pads short rows with empty fields to bring them up to the reference width. This is useful when downstream tools (spreadsheets, database importers) require a fixed schema. Truncate mode removes fields from the end of long rows to match the reference width — handy when a log format appends a variable-length extra column that you want to drop. Informational mode emits all rows unchanged but collects every length mismatch into a warning list you can expand in the warning panel below the output. It is the right choice when you are auditing data quality rather than transforming it.

The warning panel is collapsible and shows row number, direction of mismatch (too few or too many fields), expected count, and actual count. Warnings are only gathered in informational mode for display, but short and long rows also generate warnings in strict and normalize modes so you are aware of the mismatch even when the tool has handled it silently.

Fixed-Width Parsing

Some input formats are not delimiter-based at all. Command-line tools, legacy reports, and certain database dump formats use fixed-width columns aligned with spaces. The converter detects column boundaries by looking for transitions from content to runs of two or more consecutive spaces — those runs are the column separators in fixed-width text. Boundaries are inferred from the first two lines rather than just the first, because the header row and the first data row together provide a more reliable picture of where columns break.

Once boundaries are known, each row is sliced at the same character positions. Trimming whitespace (on by default) then cleans up the padding that fixed-width formats use to fill shorter values. The result is a cleanly structured CSV regardless of how the original report was formatted. Fixed-width mode disables the delimiter selector since there is no delimiter to detect or specify.

Regex Mode

Some formats do not fit cleanly into delimiter-based or fixed-width models. Log lines with multiple possible separators, text that uses repeated characters as dividers, or formats with optional whitespace around delimiters — these all require something more expressive. Regex mode lets you specify a JavaScript regular expression as the split pattern for each line. The pattern is applied with the built-in string split, so any valid regex works: \s*,\s* to split on commas with optional surrounding whitespace, \t+ to split on one or more tabs, or \s{2,} to split on two or more spaces (which approximates fixed-width without needing boundary inference). An inline error message appears immediately if the pattern is not a valid regex, so you get feedback before processing the full file. Regex mode disables the delimiter selector since the pattern takes its place.

Performance Without Web Workers

A natural concern with large file processing in the browser is blocking the main thread. The converter handles this with a 500ms debounce on the input textarea: parsing begins only after you stop typing for half a second. This prevents redundant work during rapid keystrokes and avoids a perceptible lag for inputs that take tens of milliseconds to parse. For very large inputs (over one megabyte), a warning is shown to set expectations.

The 500ms value is a deliberate trade-off. A shorter delay — say 150ms, which we use in the URL encoder — would feel more responsive for small inputs, but at large sizes it would schedule many parse operations in quick succession. At 500ms, even fast typists typically pause between pastes. Config changes (delimiter selection, row mode, toggles) trigger processing immediately without debounce, since those changes come from deliberate user actions rather than keystroke streams. Web Workers would eliminate the blocking concern entirely, but they add significant complexity — inter-thread message serialisation, transferable buffers, worker lifecycle management — and for the dataset sizes this tool is designed for (hundreds of thousands of fields), the synchronous path with a 500ms guard is fast enough that users are unlikely to notice.

What v2 Might Include

The current tool handles the parsing and formatting problem well. A few directions for a future iteration stand out. An inline table preview — showing the parsed rows in a spreadsheet-like grid before producing the CSV — would make it much easier to verify that auto-detection found the right delimiter and that column boundaries look correct. Per-column transforms (trim only column 3, uppercase only the header row, deduplicate by column 1) would cover a class of data-cleaning tasks that currently require a full spreadsheet application. Batch mode for multiple files — drag and drop a folder of CSVs, apply a consistent transform to all of them, download a zip — would be valuable for data engineers who work with partitioned exports. All of those features would stay client-side, preserving the privacy guarantee that motivated the tool in the first place.

Building a Change Case Tool with a Unicode-Safe Tokenizer

How we built a browser-native text case converter supporting 14 transformation modes — from snake_case to Title Case to Alternating Case — using a custom tokenizer that handles camelCase bumps, acronym boundaries, digit transitions, emoji, and non-Latin scripts.

Programming is like Cultivating

The similarities between programming and cultivation.