Royal University Of Phnom Penh

Toward a Baseline for Khmer OCR

An Exploration of Low-Resource, Non-Latin Optical Character Recognition using CNN + Transformer Architecture

Subject: IDSE Lecturer: Dr. Chamroeun Khim

Dr.Rina Dr.Iwamura Dr.Sovila Dr.Kise Osaka Metropolitan University

01 / 07

The Challenge

Why can't existing tools read Khmer?

Complex Script

Characters stack vertically — above, below, and beside each other. Left-to-right OCR misses them entirely.

Long Text Lines

No word spaces means sentences are one huge image. Resizing to fit crushes the tiny stacked characters.

Scarce Data

Very few labeled Khmer text examples exist. AI needs millions of samples — Khmer has far less.

Khmer is classified as "non-Latin-complete" — solving it means unlocking OCR for many other complex scripts worldwide.

02 / 07

The Solution

A purpose-built AI that reads Khmer in two dimensions

2D Vision

Scans images both left-right and up-down — capturing stacked vowels and subscripts that 1D tools miss.

Image Slicing

Cuts long sentences into overlapping pieces, processes each at full quality, then merges the results.

Synthetic Data

2.8 million computer-generated Khmer training images from real fonts — solving the data scarcity problem.

Transfer to 5 Scripts

After Khmer, the model was rapidly adapted to Thai, Lao, Burmese, Vietnamese, and Hindi.

03 / 07

Figure 12 — The Reading Pipeline

How the system reads Khmer

Input Image

Photo or scan of a Khmer text line

CNN Features

Vision scanner captures shapes in height and width

Patch Encoder

Image split into grid patches, each tagged

Transformer Enc & Dec

Encoder learns context, decoder outputs characters

The pipeline processes raw input images without any destructive resizing — preserving the fine detail of stacked Khmer characters at every stage.

04 / 07

The Clever Trick

Slice. Process. Merge.

A smarter approach to long text lines

⚠ Old Way

Resize the entire image to a fixed width.

Tiny stacked characters get crushed — the AI cannot read them. Also very slow to process one giant image.

❌ Characters crushed · Slow

✓ New Way

Slice into overlapping sections — P1, P2, P3, P4.

Each piece is processed at full quality. Small characters stay sharp and readable. Results merged at the end — up to 2× faster.

✓ Full quality · Up to 2× faster

05 / 07

Results

Did it work?

Yes — significantly.

50%+

Fewer Errors

Reduction in character mistakes vs. standard Tesseract OCR on real Khmer ID cards.

×6

Scripts

Khmer model adapted to Thai, Lao, Burmese, Vietnamese, and Hindi with minimal extra training.

Wins

Reading in both height and width consistently outperforms left-to-right-only models in 4 of 5 test categories.

Validated on smartphone-captured Khmer ID cards and PDF scans — not just clean synthetic data.

06 / 07

End of Presentation

Thank You

Dr. Rina · Dr.Iwamura · Dr.Sovila · Dr.Kise · Osaka Metropolitan University

Scan for Case Study → Research Paper

ieeexplore.ieee.org/document/10316307

07 / 07

🖼

Input Image

Variable height
Greyscale / RGB

→

🧱

CNN Backbone

Conv layers
BN + ReLU
Max-pool (H only)

→

≡

Feature Map

C × H′ × W′
Width = time steps

📐

Positional Enc.

Sinusoidal
+ learned embed

→

⚙

Transformer Encoder

Multi-head Self-Attn
FFN × N layers
Layer norm

→

⚙

Transformer Decoder

Masked Self-Attn
Cross-Attn (enc)
FFN × N layers

⊞

Linear Projection

d_model → |vocab|
Softmax

→

ក

Character Output

Khmer Unicode
sequence

CNN Stage Encoding Transformer Output