MessyDocs
← Back to all posts

Why OCR scrambles your invoice columns: linearization and reading order, explained

You photograph a clean bill. Four tidy columns: description, quantity, rate, amount. The OCR runs. What comes back is one long line where the item name, a stray number, the rate and the amount are all jumbled together, and you spend the next ten minutes pulling them apart by hand.

The recognition wasn't wrong. The model probably read every character correctly. What broke is a separate, less-discussed step called linearization, and once you understand it you'll know exactly what to look for in a tool.

What linearization actually is

OCR has two distinct jobs, and people only think about the first one.

The first job is recognition: looking at a blob of pixels and deciding it's the character "7" or the letter "क". That's the part everyone means when they say "OCR."

The second job is ordering: taking all those recognized characters and words, scattered across a two-dimensional page, and deciding what sequence to put them in to produce a single string of text. A page is 2D. Text output is 1D, a line. Flattening the 2D layout into that 1D sequence is linearization, and the rule a tool uses to do it is its reading order.

The naive rule, the one most general OCR ships with, is: read everything left to right, top to bottom, like a page of prose. For a paragraph that's correct. For a table it's a disaster. The naive rule sweeps straight across all four columns on the first physical row of the page, so "Sugar 5kg" and the quantity from the next column and the rate from the column after that and an amount all land in the same output line, in the order they happen to sit on the page. Your columns are gone. The characters are perfect; the structure is destroyed.

Why a bill is the worst case for the naive rule

A tax invoice is built almost entirely out of the layouts that break left-to-right reading:

  • The line-items grid, where meaning runs down each column, not across the row.
  • The totals block, usually pushed to the right, where "Taxable Value", "CGST", "SGST" and their amounts are vertically stacked and the naive sweep interleaves them with whatever sits to their left.
  • The header, where the vendor's name, address and GSTIN are arranged as a visual block, not a sentence.

A tool that gets reading order right does something different. It detects the table structure first, finds the column and row boundaries, and reads within each cell, so the output keeps "description, quantity, rate, amount" together as a row. The recognition is the same. The ordering is what separates a clean Excel row from a paragraph you have to dismantle.

What makes the Indian-language case harder

Reading-order detection is hard enough in English. Indian-language bills pile on three extra difficulties.

Mixed scripts in one line. The description is in Devanagari or Tamil, the GSTIN is in Latin characters, the amount is in digits. A reading-order step has to hold the line together across script changes, and a naive tokenizer can fragment at every switch.

Right-leaning numerals and currency. "₹7,000/-" written in a hurry, Indian-grouping commas (the 1,00,000 style, not 100,000), and sometimes Devanagari digits mixed with Arabic ones. The recognizer can get the glyphs and the ordering step can still place them in the wrong cell.

Vertical and below-the-line marks. Devanagari hangs matras above and below the main line. A reading-order model tuned on flat Latin baselines can misjudge which row a mark belongs to, especially on a skewed photo.

None of these are recognition failures. They're ordering failures, and they're why a tool that reads English tables fine can still hand you scrambled output on a Hindi or Tamil bill.

How to tell if a tool gets it right

You don't need to read a research paper. Run one test. Take a real bill with a clear multi-column line-items table and run it. Then look at the output, not the accuracy claim:

  • Do the line items come back as rows, with the right quantity next to the right item, or as one run-on line you have to re-sort?
  • In the totals block, did "CGST" stay attached to its own amount, or did it get cross-wired with the value beside it?
  • If the bill mixes a regional script with a Latin GSTIN, did the line survive the switch or fragment?

If the rows hold together, the tool is doing reading-order detection. If you get a wall of correct-but-scrambled text, it's running the naive left-to-right rule and you'll be re-sorting by hand no matter how good the character recognition is.

We test exactly this across tools, with real Indian GST bills, in the accuracy benchmark. And the practical photo-and-check workflow built on top of all this is in the Indian-language extraction guide.

The short version: when OCR "scrambles your columns," it usually read every character right and ordered them wrong. Knowing that turns a vague "the tool is bad" into a specific test you can run in two minutes.