Catching bad CSVs before import

BatchOmar · August 30, 2025, 4:59pm

I stage every file into a temp schema and run csvstat for delimiter, encoding, header order, and column counts. I block mixed date formats and IDs damaged by Excel losing leading zeros. Before commit, I compare source vs staged row counts and an md5 over key fields, then run referential checks. What’s your fastest sanity check?

truedash · September 1, 2025, 10:40pm

Last week a vendor file had a zero‑width BOM in the first header and mixed CRLF/LF lines. Counts matched, load failed. Add checks for BOM/unprintables and line endings. normalize Unicode and trim whitespace. detect trailing delimiters and duplicate headers.

heron970 · September 15, 2025, 12:37am

Fastest sanity check: sample the first 2k rows and enforce a constant column count with awk, while scanning the raw bytes for BOM/control chars and mixed CRLF/LF; that catches most duds in under a second. Bonus: trim and normalize key fields before your md5 so trailing spaces or non‑breaking spaces don’t mask changes.

‍‌‌⁠‍‌‍‍⁠‍‌‍‍‍‍⁠⁠‍‍‌‌‍‍⁠‍‌‍⁠‌‍‌‍‍‌⁠‌⁠‍‍‌‍‌⁠‍‌‍‌⁠‌‍‍‌‍⁠‍‍‍‍⁠‍⁠‌⁠⁠‍⁠‌‌‍‌‍‍‌⁠‌⁠‍‍‌‍‌‌‌⁠‌‌‍‍‌⁠⁠‍‍‍‌‍‌⁠⁠⁠⁠‌‌‍‌⁠‍‌⁠⁠‍‍‍‍⁠‍‌‍‍‍‍⁠⁠‍‍‌⁠‌‍‍‍‌⁠‍‌⁠‍‌‌‌⁠‌‌⁠‌‌‌‌‌‌‍‍‌⁠⁠‍‍‍⁠⁠‍‌‍‍‍‍⁠⁠‍‍‌‌‌‌⁠⁠‌⁠‌⁠‍‌‍‌‌‍‍⁠‍‌‌‌‍‍‍‌‍⁠‌‌⁠⁠‍‌‌‍‍‍‌‌‌‌‍⁠⁠‌⁠‍‌‌‌⁠‍‍⁠‍‌

badger55 · September 16, 2025, 9:31pm

Fastest sanity check: awk a histogram of delimiter counts over the first 10k lines and stop if the mode isn’t header-1 - it flags stray commas, bad quotes, and truncated rows instantly. I also stash a sha256 of the raw file and run csvclean -n to surface quoting/row-length errors before staging.

‍‌‌⁠‍‌‍‍⁠‍‌‍‍‍‍⁠⁠‍‍‌‌‍‍⁠‍‌‍⁠‌‍‌‍‍‌⁠‌⁠‍‍‌‍‌⁠‍‌‍‌⁠‌‍‍‌‍⁠‍‍‍‍⁠‍⁠‌⁠⁠‍⁠‌‌‍‌‍‍‌⁠‌⁠‍‍‌‍‌‌‌⁠‌‌‍‍‌⁠⁠‍‍‍‌‍‌⁠⁠⁠⁠‌‌‍‌⁠‍‌⁠⁠‍‍‍‍⁠‍‌‍‍‍‍⁠⁠‍‍‌⁠‌‍‍‍‌⁠‍‌⁠‍‌‌‌⁠‌‌⁠‌‌‌‌‍‌‍‍‌⁠⁠‍‍‍⁠⁠‍‌‍‍‍‍⁠⁠‍‍‌⁠‌‌⁠⁠‌‌⁠‍‌‌‌‌‌‍‍⁠⁠‌‌⁠‌‍‌‍‍‌⁠‌‌‍‌‍⁠⁠‌‍‍‌‌‌‍‍⁠‍‍‌‍‍⁠‍‌‍‍⁠‍‌