MessyDocs
← Back to all posts

How to get data out of a Hindi invoice without typing it all by hand

Every March, a friend of mine doing his articleship spends his weekends the same way. A plastic bag of crumpled thermal receipts and handwritten chits on one side, a laptop on the other, and Excel open. He types in the vendor name, the GSTIN, the taxable value, the SGST, the CGST. One bill, then the next. By Sunday night his eyes are shot and the bag is still half full.

He's not bad at his job. He just got handed the part of accounting that nobody has bothered to fix.

So he tried the obvious things. Google's photo tools. The scanner app on his phone. Pasting bill photos straight into ChatGPT. On a clean, printed, English tax invoice, all of them do fine. You'll get usable text back. Hand any of them a faded Hindi kirana bill with three columns and a tea stain across the middle, and they break in two specific ways.

Where they break

The first is column merging. Most general OCR reads the page the way you'd read a paragraph: left to right, all the way across, then down. A bill isn't a paragraph. It's a table. So "Item, Qty, Rate, Amount" laid out in four columns comes back as one long line of scrambled words and numbers, and now you're re-sorting it by hand anyway. You saved nothing.

The second is the script itself. Devanagari numerals, Hindi and English mixed in the same line, a shopkeeper's handwriting that even a human reads twice. Tools trained mostly on printed English have a thin idea of what a handwritten ७ looks like, or how "₹7,000/-" gets written in a hurry.

There's a quieter problem too, if you go the ChatGPT route: cost. Hindi and other Indian-language text eats far more tokens than the same words in English. Sometimes four times as much. Run a few hundred long bills through a general model and the bill for reading the bills starts to sting.

And it isn't only the cost. The consumer versions of ChatGPT and Gemini use what you paste to train their models, so a client's Hindi bill you drop in stops being private and becomes their data. For someone holding a client's books, that's a real consideration, not a footnote.

What actually helps

You can't fix the shopkeeper's handwriting. You can fix almost everything else.

Shoot the photo properly. Flat, not at an angle. Fill the frame with the bill, not the table it's lying on. Kill the shadow your own hand throws across the page. Half the "OCR is useless" complaints I've seen are really "the photo was useless": a creased bill shot in the dark.

Use something that reads layout before it reads text. This is the column problem. A tool that figures out where the column boundaries are first, then reads inside each column, keeps your item, quantity and amount lined up. That single thing is the difference between a clean row in Excel and a paragraph you have to dismantle.

Check the three numbers that bite. Here's where I'd argue with the "fully automated" crowd. OCR will nail the vendor name and the date, then quietly turn a 3 into an 8 on a taxable value, and you won't notice until the GST doesn't tie out. So whatever tool you use, look at the GSTIN (it's 15 characters, and a wrong one is a rejected return), the taxable value, and the tax split. Every time. The machine reads the 90% you'd hate to type. You check the 10% that costs money if it's wrong.

Get it out in the shape you need. Text in a box is not the goal. A row per bill in Excel, or a file you can import into Tally, is. If a tool gives you back a wall of text and calls it done, it's handed you a different chore, not finished the one you had.

The honest part about handwriting

Handwriting recognition has gotten genuinely good. Good enough that I've watched it pull a name and a rupee figure off a creased, blue-ink donation receipt that I'd have squinted at. It is still not perfect, and on a bad scrawl it won't be. Treat the total as something to confirm, not something to trust. A number you didn't see with your own eyes is a number waiting to embarrass you in front of a client.

The point of all this isn't zero typing. That's a sales line, not a real workflow. The point is to stop retyping the part a machine already reads cleanly, so your Sunday evening goes to the ten numbers that actually matter, not to the plastic bag.