Ran 1,000 invoice pages through the latest Textract and Form Recognizer this morning and saw 98.4% vs 97.7% field-level accuracy — tiny, but it cuts about 6 corrections per 1,000 fields for us. Anyone have benchmarks on tables/line-items, especially mis-splits at 12+ columns?
On 14‑column invoices I cut ‘mis-splits’ about 22% by post‑processing: k‑means on cell x‑centers to define 12–16 column bands, then snap to bands before reconciliation (Textract Tables or FR Layout both gained about 1.5–2 pts F1). Caveat: it can over‑merge skinny columns; on Form Recognizer, routing 12+ cols to Layout + this snapper beat the prebuilt for me — want a tiny script?
Building on @OpsTheo, try header‑anchored banding: detect header tokens (“Qty”, “Unit Price”, etc.), RANSAC‑fit 12–16 vertical bands, then snap cells to those like bowling bumpers; this cut our mis‑splits about 28% on 13–15 col POs. Caveat: it needs decent header recall, so we fall back to x‑projection peaks when headers are missing. Do you see many headerless line‑item pages?
That 98.4% vs 97.7% bump is real; the drift on wide tables drives me nuts, and we clawed back a bit by strip‑wise deskew/shear before column mapping on 12–16‑column layouts. Slice the page into 8–12 vertical stripes, fit local baselines from character boxes (OpenCV LSD/Hough), then shear each stripe so x stays stable across the row — dropped cross‑column bleed about 19% on 14‑column sets. @OpsTheo are you seeing lateral drift from curl/warp; if so this pass usually saves a couple more fixes per about 1k fields; ref: https://docs.opencv.org/4.x/d5/db5/tutorial_lsd.html.
And on ‘12+ columns’, a small win for us was stripping vertical ruling lines before OCR: detect thin verticals with a Hough pass, erase/inpaint, then feed the cleaned image to Textract/Form Recognizer. It cut mis-splits on wide invoices noticeably and shaved about 4–6 corrections per 1k fields, but only pays off when the source has heavy gridlines; on lightly ruled PDFs it’s negligible.