An Exploration of Low-Resource, Non-Latin Optical Character Recognition using CNN + Transformer Architecture
Characters stack vertically — above, below, and beside each other. Left-to-right OCR misses them entirely.
No word spaces means sentences are one huge image. Resizing to fit crushes the tiny stacked characters.
Very few labeled Khmer text examples exist. AI needs millions of samples — Khmer has far less.
Scans images both left-right and up-down — capturing stacked vowels and subscripts that 1D tools miss.
Cuts long sentences into overlapping pieces, processes each at full quality, then merges the results.
2.8 million computer-generated Khmer training images from real fonts — solving the data scarcity problem.
After Khmer, the model was rapidly adapted to Thai, Lao, Burmese, Vietnamese, and Hindi.
Photo or scan of a Khmer text line
Vision scanner captures shapes in height and width
Image split into grid patches, each tagged
Encoder learns context, decoder outputs characters
Tiny stacked characters get crushed — the AI cannot read them. Also very slow to process one giant image.
Each piece is processed at full quality. Small characters stay sharp and readable. Results merged at the end — up to 2× faster.
Reduction in character mistakes vs. standard Tesseract OCR on real Khmer ID cards.
Khmer model adapted to Thai, Lao, Burmese, Vietnamese, and Hindi with minimal extra training.
Reading in both height and width consistently outperforms left-to-right-only models in 4 of 5 test categories.