Pulling invoice data out of Indian-language bills, language by language

By MessyDocs team · Updated 23 May 2026

If you do books for clients across two or three states, you already know the problem isn't "OCR." It's that a bill from Coimbatore, a bill from Surat and a bill from Nagpur fail in three different ways, and a tool that handles one cleanly will trip on the next.

This page is the map. What's the same across every Indian-language bill, what's specific to the script in front of you, and where the time actually goes. I've kept it honest about where the machine still loses.

The two problems that show up in every language

Before the script even matters, two things break almost every general OCR tool on an Indian bill.

One is layout. A tax invoice is a grid: description, HSN, quantity, rate, taxable value, then the CGST and SGST split at the bottom. Most OCR reads it like prose, left to right across the whole page, so your four neat columns come back as one scrambled line. We wrote up why this happens, and how reading-order detection fixes it, in the linearization explainer. If you only fix one thing, fix this. It's the difference between a row you paste into Excel and a paragraph you take apart by hand.

Two is the mixed line. Indian bills are rarely one language. The vendor name is in Devanagari, the GSTIN is in Latin characters, the amount is in Indian digits or Arabic ones depending on who printed it, and "Rs." sits next to "₹". A tool has to switch scripts inside a single line without losing the thread. Plenty don't.

Those two are universal. Now the language-specific part.

Hindi and Marathi (Devanagari)

Same script, so they share most quirks. The hard parts are the conjunct characters (two consonants stacked into one glyph) and the matras that hang above and below the main line. When a thermal print fades, the matra is the first thing to go, and "की" can collapse toward "क". On printed GST invoices this is fine in our testing. On handwritten kirana chits it's where you slow down and check.

Marathi adds its own wrinkle: the same Devanagari script, but vocabulary and abbreviations a Hindi-tuned model may not expect on the description line. The numbers and GSTIN read the same, so for accounting it rarely bites. We cover the full Hindi workflow, photo to Excel, in the Hindi invoice guide.

Gujarati, Kannada, Telugu

Grouping these not because they're alike (they're three different scripts) but because the accounting reality is the same: the vendor header and any descriptive text are in the regional script, while the numbers you actually book are in digits and the GSTIN is in Latin characters. So your accuracy on the fields that matter for a return tends to be high even when the header reading is imperfect. The header is nice to have. The taxable value and tax split are the job.

The fields you check, regardless of language

Here's the part I'd argue with anyone who sells "fully automated." The script changes; the verification doesn't. On every bill, in every language, look at three things before you trust the row:

The GSTIN. Fifteen characters, structured (two-digit state code, ten-character PAN, then three more). A single wrong character is a rejected entry against GSTR-2B. This is the field most worth a two-second glance.
The taxable value. OCR is confident and wrong in exactly the same way a tired human is: it turns a 3 into an 8, a 1 into a 7. You catch it because the GST won't tie out, but better to catch it here.
The CGST/SGST split (or IGST). If the split looks off against the rate, the rate or the value was misread upstream.

Vendor name and date almost never need checking. The numbers do. The machine reads the bulk you'd hate to type; you spend your attention on the few fields that cost money when they're wrong.

What this actually buys you

Not zero typing. Anyone promising that hasn't done a real reconciliation. What it buys is this: the plastic bag of bills from four states stops being four separate manual jobs. The layout problem and the mixed-script problem get solved once, the per-language quirks mostly hit the header text and not the numbers, and your time collapses onto the handful of fields worth a human eye.

If you're starting with one language, start with whichever state you have the most bills from, get the photo discipline right (flat, filled frame, no shadow), and check the three numbers. The rest scales from there.