University Logo
Royal University Of Phnom Penh

Toward a Baseline for Khmer OCR

An Exploration of Low-Resource, Non-Latin Optical Character Recognition using CNN + Transformer Architecture

Subject: IDSE Lecturer: Dr. Chamroeun Khim
Dr.Rina Dr.Iwamura Dr.Sovila Dr.Kise Osaka Metropolitan University
01 / 07
The Challenge

Why can't existing tools read Khmer?

Complex Script

Characters stack vertically — above, below, and beside each other. Left-to-right OCR misses them entirely.

Long Text Lines

No word spaces means sentences are one huge image. Resizing to fit crushes the tiny stacked characters.

Scarce Data

Very few labeled Khmer text examples exist. AI needs millions of samples — Khmer has far less.

Khmer is classified as "non-Latin-complete" — solving it means unlocking OCR for many other complex scripts worldwide.
02 / 07
The Solution

A purpose-built AI that reads Khmer in two dimensions

2D Vision

Scans images both left-right and up-down — capturing stacked vowels and subscripts that 1D tools miss.

Image Slicing

Cuts long sentences into overlapping pieces, processes each at full quality, then merges the results.

Synthetic Data

2.8 million computer-generated Khmer training images from real fonts — solving the data scarcity problem.

Transfer to 5 Scripts

After Khmer, the model was rapidly adapted to Thai, Lao, Burmese, Vietnamese, and Hindi.

03 / 07
Figure 12 — The Reading Pipeline

How the system reads Khmer

1

Input Image

Photo or scan of a Khmer text line

2

CNN Features

Vision scanner captures shapes in height and width

3

Patch Encoder

Image split into grid patches, each tagged

4

Transformer Enc & Dec

Encoder learns context, decoder outputs characters

The pipeline processes raw input images without any destructive resizing — preserving the fine detail of stacked Khmer characters at every stage.
04 / 07
The Clever Trick

Slice. Process. Merge.

A smarter approach to long text lines
⚠ Old Way

Resize the entire image to a fixed width.

Tiny stacked characters get crushed — the AI cannot read them. Also very slow to process one giant image.

❌ Characters crushed · Slow
✓ New Way

Slice into overlapping sections — P1, P2, P3, P4.

Each piece is processed at full quality. Small characters stay sharp and readable. Results merged at the end — up to 2× faster.

✓ Full quality · Up to 2× faster
05 / 07
Results

Did it work?

Yes — significantly.
50%+
Fewer Errors

Reduction in character mistakes vs. standard Tesseract OCR on real Khmer ID cards.

×6
Scripts

Khmer model adapted to Thai, Lao, Burmese, Vietnamese, and Hindi with minimal extra training.

2D
Wins

Reading in both height and width consistently outperforms left-to-right-only models in 4 of 5 test categories.

Validated on smartphone-captured Khmer ID cards and PDF scans — not just clean synthetic data.
06 / 07
End of Presentation

Thank You

Dr. Rina · Dr.Iwamura · Dr.Sovila · Dr.Kise  ·  Osaka Metropolitan University
QR Code to Research Paper

Scan for Case Study → Research Paper

ieeexplore.ieee.org/document/10316307
07 / 07
Pipeline — Detail View
CNN + Transformer — Architecture Overview
🖼
Input Image
Variable height
Greyscale / RGB
🧱
CNN Backbone
Conv layers
BN + ReLU
Max-pool (H only)
Feature Map
C × H′ × W′
Width = time steps
📐
Positional Enc.
Sinusoidal
+ learned embed
Transformer Encoder
Multi-head Self-Attn
FFN × N layers
Layer norm
Transformer Decoder
Masked Self-Attn
Cross-Attn (enc)
FFN × N layers
Linear Projection
d_model → |vocab|
Softmax
Character Output
Khmer Unicode
sequence
CNN Stage Encoding Transformer Output
⚠ Old Way — Detail View
✓ New Way — Detail View
⚠ Old Way — Image
Old Way
✓ New Way — Image
New Way
Original — Paper Reference