The Challenge of Khmer Tokenization
Why standard BPE tokenizers fail on non-segmented scripts and how we can fix it using hybrid approaches.
"One small step for a man, one giant leap for mankind"
My name is Sansethireach Men (Tnaot). I am a Full Stack Developer and AI Researcher based in Cambodia. My journey didn't start with code; it started with numbers.
From a young age, I was fascinated by the purity of mathematics. and physics This passion led me to compete internationally, representing my school in Cambodia and led across Southeast asia. The discipline required to solve complex math problems became the bedrock of my engineering mindset.
Around 2022, during the pandemic, I discovered the world of Web3. I dove into the ecosystems of Ethereum, Ronin and Polygon. It wasn't just about the hype; it was about understanding decentralized systems and digital ownership.
By 2025, my focus shifted to the next frontier: Artificial Intelligence. I realized that while Web3 builds the infrastructure of value, AI builds the infrastructure of intelligence. Today, I am dedicated to solving low-resource language problems—specifically for Khmer—using Transformer models and Large Language Models (LLMs).
Why standard BPE tokenizers fail on non-segmented scripts and how we can fix it using hybrid approaches.
How Reamke.com uses AI not to replace tradition, but to visualize it for a new generation.