
awesome-khmer-language
A large collection of Khmer language resources. Khmer is a language used by Cambodia.
Stars: 79

Awesome Khmer Language is a comprehensive collection of resources for the Khmer language, including tools, datasets, research papers, projects/models, blogs/slides, and miscellaneous items. It covers a wide range of topics related to Khmer language processing, such as character normalization, word segmentation, part-of-speech tagging, optical character recognition, text-to-speech, and more. The repository aims to support the development of natural language processing applications for the Khmer language by providing a diverse set of resources and tools for researchers and developers.
README:
A large collection of Khmer language resources. Khmer is a language used by Cambodia.
Pull Requests are very welcomed!
- Khmer Characters - The Unicode Standard 15.0
- Khmer Encoding Structure - Unicode
- sillsdev/khmer-character-specification
- Khmer Layout Requirements
- wiki/Khmer_language
- wiki/Khmer_script
- wiki/Romanization_of_Khmer
- http://www.eki.ee/wgrs/rom1_km.pdf
- sillsdev/khmer-normalizer Normalize Khmer strings according to https://www.unicode.org/L2/L2022/22290-khmer-encoding.pdf
- automatic-phonemic-and-phonetic-transcription
- Khmer Word Segmentation - Rina Buoy
- Khmer natural language processing toolkit
- Khmer Limon to Unicode
- seanghay/split-khmer Split Khmer sentence into an array of words.
- seanghay/khmertokenizer
- seanghay/khmerword
- seanghay/khmernumber
- seanghay/khmernormalizer
- khmer-ocr-benchmark-dataset A standardized benchmark dataset for Khmer Optical Character Recognition (OCR) engine.
- Khmer utility functions
- Trey314159/KhmerSyllableReordering
- khmer-dictionary-tools
- nota/split-graphemes
- NextSpell - ពិនិត្យអក្ខរាវិរុទ្ធ, ខ្មែរ OCR, កាត់ពាក្យ
- khmercut A (fast) Khmer word segmentation toolkit.
- Socret360/akara-python AKARA: Open-Source Khmer Spell Checker
- khmer-latin-name-transformer
- native-khmer-g2p
- khmerphonemizer
- kfa A fast Khmer Forced Aligner powered by Wav2Vec2CTC and Phonetisaurus
- sosap(សូរសព្ទ) Python binding for Phonetisaurus
- khmer-unicode-converter Khmer Unicode Converter
- khmerpunctuate Punctuation Restoration for Khmer language
- khmerocr_tools Khmer OCR Synthetic Data Generator
- Socret360/jaws Just Another Word Segmenter (JAWS): A Graph Neural Network Model for Khmer Word Segmentation
- seanghay/khmersegment A Khmer word segmentation tool built for NIPTICT (now CADT) Khmer Word Segmentation CRF model.
- seanghay/khmer-acoustic-model-mfa Train an Acoustic Model for Khmer language with Montreal Forced Aligner
- seanghay/tha Tha (ថា) - A Khmer Text Normalization and Verbalization Toolkit
- seanghay/khmerpronounce Khmer Pronounciation Toolkit
- seanghay/khmer2number A Khmer word to number converter.
- khPOS (Khmer Part-of-Speech) Corpus for Khmer NLP Research and Developments
- ParaCrawl Corpus
- Asian Language Treebank (ALT) Project
- phylypo/segmentation-crf-khmer
- google/language-resources Lexicon, Text normalization and Verbalizer
- Illustrations and recordings for language learning Audio recodings and illustration
- seanghay/khmer-dictionary-44k
- seanghay/km-speech-corpus
- seanghay/bookmebus-reviews
- seanghay/khmer_mpwt_speech
- seanghay/khmer_kheng_info_speech
- seanghay/khmer_grkpp_speech
- High quality TTS data for Khmer
- Google FLEURS Audio Dataset
- mc4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus
- Khmer LineBreaking Dictionary
- Khmer tesseract-ocr
- Khmerlang Mobile Keyboard data
- Khmer Bible Recordings
- SleukRith Set
- Khmer annotation Annotated Khmer Dataset for Word spotting
- An End-to-End Khmer Optical Character Recognition using Sequence-to-Sequence with Attention
- Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search
- Khmer Text Classification Using Word Embedding and Neural Networks
- Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning
- Building WFST based Grapheme to Phoneme Conversion for Khmer
- Query Expansion for Khmer Information Retrieval
- Building a Syllable Database to Solve the Problem of Khmer Word Segmentation
- Khmer Word Segmentation based on Bi-Directional Maximal Matching for Plaintext and Microsoft Word Document
- Khmer printed character recognition using attention-based Seq2Seq network
- Khmer Word Segmentation Using Conditional Random Fields
- A Large-scale Study of Statistical Machine Translation Methods for Khmer Language
- A Rule-based Approach for Khmer Word Extraction
- Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences
- The Standard Khmer vowel system: An acoustic study
- Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
- Towards deep learning on speech recognition for Khmer language
- A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory
- Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence
- Detection and Correction of Homophonous Error Word for Khmer Language
- No Language Left Behind (NLLB)
- Phonological Principles And Automatic Phonemic And Phonetic Transcription Of Khmer Words
- Multi-lingual Transformer Training for Khmer Automatic Speech Recognition
- TriECCC: Trilingual Corpus of the Extraordinary Chambers in the Courts of Cambodia for Speech Recognition and Translation Studies
- Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language
- Khmer pronouncing dictionary: standard Khmer and Phnom Penh dialect
- ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition
- Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition
- Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
- facebookresearch/fairseq/mms Text to Speech and Speech to Text
- Khmer Language Model using ULMFiT
- KHMER WORD SEARCH BASE ON SEMANTIC RELATION
- Khmer Audio Dictionary
- Khmer to IPA Converter
- Khmer Phonemizer
- Khmer Text-to-Speech MMS
- Khmer Part of Speech Tagging with XLM RoBERTa
- Whisper Small Khmer Fine-tuned
- Joint Word Segmentation and POS Tagging in Keras
- Socret360/akara-android
- vitouphy/wav2vec2-xls-r-300m-khmer
- vitouphy/wav2vec2-xls-r-1b-khmer
- Khmer Text Classification
- khmerlang/khmer-text-summarizer
- khmerlang/KhmerWordPrediction
- khmerlang/elasticsearch-analysis-khmerlang
- Khmer Fingerspelling
- isi-nlp/uroman Universal Romanizer
- pisethx/khmer-word-segmentation
- khmer-forced-aligner
- Fast Khmer Dictionary
- SEANLP: Southeast Asia Natural Language Processing
- Khmerlang-Keyboard
- ericvida/khtransliterator
- Khmer Unicode Converter
- chantysothy/KhmerUnicodeConverter
- Pretrained-BERT-model-for-Khmer-language
- Khmer Language Model for Handwritten Text Recognition on Historical Documents
- Khmer Single Word TTS
- SeaLLMs Large Language Models for Southeast Asia
- XLM-RoBERTa-Khmer Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps. While being smaller than XLM-RoBERTa-Base
- Issues in Khmer syllable validation
- Khmer Machine Learning (ML) Experiment
- Using AI to Generate Khmer Baby Names
- How domnung.com Ranks Khmer News
- Text Classification with scikit-learn on Khmer Documents
- Multi-Class Text Classification on Khmer News Articles
- Word Segmentation of Khmer Text Using Conditional Random Fields
- Khmer Language Model Using ULMFiT (Feb 2020)
- Creating a Khmer Language Model using BERT
- Building a Khmer Spelling Checker
- khmerlang.com
- Khmer word spell correction using BK-Tree data structure and Levenshtein distance
- Introduction to kNN algorithm by experiment on Khmer Handwriting classification using Java 8
- Speech Synthesis and Low Resource Languages
- ការបញ្ចូលអក្សរខ្មែរក្នុងយូនីកូដ ឯកសារឆ្នាំ 1996
- harfbuzz A text shaping engine that supports Khmer language.
- xlm-roberta-base A better BERT with multiligual support.
- mt5-base Google T5 multiligual support.
- byt5-base Google T5 without tokenizer.
- sentencepiece A tool to create a tokenizer
- huggingface/transformers
- tiktoken
- montreal-forced-aligner Accoustic Model & Alignment
- pair_ngram Building Grapheme to Phoneme
- fastText
- Phonetisaurus Building Grapheme to Phoneme
- Compact Language Detector v3 Language Detection tool
Khmer is not a low-resource language.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for awesome-khmer-language
Similar Open Source Tools

awesome-khmer-language
Awesome Khmer Language is a comprehensive collection of resources for the Khmer language, including tools, datasets, research papers, projects/models, blogs/slides, and miscellaneous items. It covers a wide range of topics related to Khmer language processing, such as character normalization, word segmentation, part-of-speech tagging, optical character recognition, text-to-speech, and more. The repository aims to support the development of natural language processing applications for the Khmer language by providing a diverse set of resources and tools for researchers and developers.

RAG-Retrieval
RAG-Retrieval provides full-chain RAG retrieval fine-tuning and inference code. It supports fine-tuning any open-source RAG retrieval models, including vector (embedding, graph a), delayed interactive models (ColBERT, graph d), interactive models (cross encoder, graph c). For inference, RAG-Retrieval focuses on ranking (reranker) and has developed a lightweight Python library rag-retrieval, providing a unified way to call any different RAG ranking models.

UI-TARS-desktop
UI-TARS-desktop is a desktop application that provides a native GUI Agent based on the UI-TARS model. It offers features such as natural language control powered by Vision-Language Model, screenshot and visual recognition support, precise mouse and keyboard control, cross-platform support (Windows/MacOS/Browser), real-time feedback and status display, and private and secure fully local processing. The application aims to enhance the user's computer experience, introduce new browser operation features, and support the advanced UI-TARS-1.5 model for improved performance and precise control.

pyspur
PySpur is a graph-based editor designed for LLM (Large Language Models) workflows. It offers modular building blocks, node-level debugging, and performance evaluation. The tool is easy to hack, supports JSON configs for workflow graphs, and is lightweight with minimal dependencies. Users can quickly set up PySpur by cloning the repository, creating a .env file, starting docker services, and accessing the portal. PySpur can also work with local models served using Ollama, with steps provided for configuration. The roadmap includes features like canvas, async/batch execution, support for Ollama, new nodes, pipeline optimization, templates, code compilation, multimodal support, and more.

ComfyUI-fal-API
ComfyUI-fal-API is a repository containing custom nodes for using Flux models with fal API in ComfyUI. It provides nodes for image generation, video generation, language models, and vision language models. Users can easily install and configure the repository to access various nodes for different tasks such as generating images, creating videos, processing text, and understanding images. The repository also includes troubleshooting steps and is licensed under the Apache License 2.0.

llm.hunyuan.T1
Hunyuan-T1 is a cutting-edge large-scale hybrid Mamba reasoning model driven by reinforcement learning. It has been officially released as an upgrade to the Hunyuan Thinker-1-Preview model. The model showcases exceptional performance in deep reasoning tasks, leveraging the TurboS base and Mamba architecture to enhance inference capabilities and align with human preferences. With a focus on reinforcement learning training, the model excels in various reasoning tasks across different domains, showcasing superior abilities in mathematical, logical, scientific, and coding reasoning. Through innovative training strategies and alignment with human preferences, Hunyuan-T1 demonstrates remarkable performance in public benchmarks and internal evaluations, positioning itself as a leading model in the field of reasoning.

biniou
biniou is a self-hosted webui for various GenAI (generative artificial intelligence) tasks. It allows users to generate multimedia content using AI models and chatbots on their own computer, even without a dedicated GPU. The tool can work offline once deployed and required models are downloaded. It offers a wide range of features for text, image, audio, video, and 3D object generation and modification. Users can easily manage the tool through a control panel within the webui, with support for various operating systems and CUDA optimization. biniou is powered by Huggingface and Gradio, providing a cross-platform solution for AI content generation.

rtp-llm
**rtp-llm** is a Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It is widely used within Alibaba Group, supporting LLM service across multiple business units including Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada. The rtp-llm project is a sub-project of the havenask.

Awesome-AI-Data-Guided-Projects
A curated list of data science & AI guided projects to start building your portfolio. The repository contains guided projects covering various topics such as large language models, time series analysis, computer vision, natural language processing (NLP), and data science. Each project provides detailed instructions on how to implement specific tasks using different tools and technologies.

deepchat
DeepChat is a versatile chat tool that supports multiple model cloud services and local model deployment. It offers multi-channel chat concurrency support, platform compatibility, complete Markdown rendering, and easy usability with a comprehensive guide. The tool aims to enhance chat experiences by leveraging various AI models and ensuring efficient conversation management.

big-AGI
big-AGI is an AI suite designed for professionals seeking function, form, simplicity, and speed. It offers best-in-class Chats, Beams, and Calls with AI personas, visualizations, coding, drawing, side-by-side chatting, and more, all wrapped in a polished UX. The tool is powered by the latest models from 12 vendors and open-source servers, providing users with advanced AI capabilities and a seamless user experience. With continuous updates and enhancements, big-AGI aims to stay ahead of the curve in the AI landscape, catering to the needs of both developers and AI enthusiasts.

OpenManus-RL
OpenManus-RL is an open-source initiative focused on enhancing reasoning and decision-making capabilities of large language models (LLMs) through advanced reinforcement learning (RL)-based agent tuning. The project explores novel algorithmic structures, diverse reasoning paradigms, sophisticated reward strategies, and extensive benchmark environments. It aims to push the boundaries of agent reasoning and tool integration by integrating insights from leading RL tuning frameworks and continuously updating progress in a dynamic, live-streaming fashion.

llms-interview-questions
This repository contains a comprehensive collection of 63 must-know Large Language Models (LLMs) interview questions. It covers topics such as the architecture of LLMs, transformer models, attention mechanisms, training processes, encoder-decoder frameworks, differences between LLMs and traditional statistical language models, handling context and long-term dependencies, transformers for parallelization, applications of LLMs, sentiment analysis, language translation, conversation AI, chatbots, and more. The readme provides detailed explanations, code examples, and insights into utilizing LLMs for various tasks.

awesome-mobile-robotics
The 'awesome-mobile-robotics' repository is a curated list of important content related to Mobile Robotics and AI. It includes resources such as courses, books, datasets, software and libraries, podcasts, conferences, journals, companies and jobs, laboratories and research groups, and miscellaneous resources. The repository covers a wide range of topics in the field of Mobile Robotics and AI, providing valuable information for enthusiasts, researchers, and professionals in the domain.

sokuji
Sokuji is a desktop application that provides live speech translation using advanced AI models from OpenAI, Google Gemini, CometAPI, Palabra.ai, and Kizuna AI. It aims to bridge language barriers in live conversations by capturing audio input, processing it through AI models, and delivering real-time translated output. The tool goes beyond basic translation by offering audio routing solutions with virtual device management (Linux only) for seamless integration with other applications. It features a modern interface with real-time audio visualization, comprehensive logging, and support for multiple AI providers and models.

databend
Databend is an open-source cloud data warehouse built in Rust, offering fast query execution and data ingestion for complex analysis of large datasets. It integrates with major cloud platforms, provides high performance with AI-powered analytics, supports multiple data formats, ensures data integrity with ACID transactions, offers flexible indexing options, and features community-driven development. Users can try Databend through a serverless cloud or Docker installation, and perform tasks such as data import/export, querying semi-structured data, managing users/databases/tables, and utilizing AI functions.
For similar tasks

awesome-khmer-language
Awesome Khmer Language is a comprehensive collection of resources for the Khmer language, including tools, datasets, research papers, projects/models, blogs/slides, and miscellaneous items. It covers a wide range of topics related to Khmer language processing, such as character normalization, word segmentation, part-of-speech tagging, optical character recognition, text-to-speech, and more. The repository aims to support the development of natural language processing applications for the Khmer language by providing a diverse set of resources and tools for researchers and developers.

EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.