awesome-japanese-nlp-resources
A curated list of resources dedicated to Python libraries, LLMs, dictionaries, and corpora of NLP for Japanese
Stars: 914
README:
A curated list of resources dedicated to Python libraries, llms, dictionaries, and corpora of NLP for Japanese
- Listed information on 823 GitHub repositories
- Listed information on 237 Hugging Face repositories (models and datasets)
- Released a tool ๐ for searching through a large number of repository information
English | ๆฅๆฌ่ช (Japanese) | ็น้ซไธญๆ (Chinese) | ็ฎไฝไธญๆ (Chinese)
Python
- edinet-mcp - EDINET XBRL parsing library and MCP server for Japanese financial data.
Corpus
- jfinqa - Japanese financial question answering benchmark with 1,000 questions from 68 listed companies.
- kotowaza - Structured JSON dataset of Japanese proverbs (kotowaza) with meanings in Indonesian & English, examples, JLPT levels, and tags.
- selective-rag-kasensabo - ๅปบ่จญใฎๆ่กๅบๆบใซ้ขใใ่ณชๅใฎๅฐ้ๆง็ฒๅบฆ๏ผ็ดฐใใ/็ฒใ๏ผใ96%ๆญฃ็ขบใซ่ชๅๅคๅฎใใๆ้ฉใชRAGใทในใใ ๏ผColBERT/Naive๏ผใ้ธๆใใๅฎ็จ็ใชAgentic RAGใทในใใ ใฎMVPใงใใ2025ๅนด11ๆใซๅ ฌ้ใใใๆฒณๅท็ ้ฒใใ ใฎๆ่กๅบๆบใๅฏพ่ฑกใซ๏ผใคใฎRAGใทในใใ ใๆง็ฏใใๅฐ้ๆงใฎ็ฒๅบฆใ็ฐใชใ200ๅใฎ่ณชๅใซๅฏพใใฆใ็ฒพๅบฆใจ้ๅบฆใๆฏ่ผใใใ
Updated on Feb 17, 2026
- Hugging Face
- Python library
- C++
- Rust crate
- JavaScript
- Go
- Java
- Pretrained model
- ChatGPT
- Dictionary and IME
- Corpus
- Tutorial
- Research summary
- Reference
- Contributors
Libraries that split Japanese text into words or morphemes and assign part-of-speech and base forms
- sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
- Janome - Japanese morphological analysis engine written in pure Python
- mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
- mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
- fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
- nagisa - A Japanese tokenizer based on recurrent neural networks
- pyknp - A Python Module for JUMAN++/KNP
- Mykytea-python - Python wrapper for KyTea
- konoha - Konoha: Simple wrapper of Japanese Tokenizers
- natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
- rakutenma-python - Rakuten MA (Python version)
- python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
- dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
- rhoknp - Yet another Python binding for Juman++/KNP
- python-vibrato - Viterbi-based accelerated tokenizer (Python wrapper)
- jagger-python - Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)
- Mecari - Mecari (Japanese Morphological Analysis with Graph Neural Networks)
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ SudachiPy | ๐ฅ 387k | ๐ฆ 60M | โญ 427 | ๐ด october 2022 |
| ๐ Janome | ๐ฅ 35k | ๐ฆ 11M | โญ 902 | ๐ก october 2025 |
| ๐ mecab-python3 | ๐ฅ 299k | ๐ฆ 35M | โญ 580 | ๐ก november 2025 |
| ๐ mecab | ๐ฅ 3k | ๐ฆ 646k | โญ 271 | ๐ด october 2024 |
| ๐ fugashi | ๐ฅ 129k | ๐ฆ 13M | โญ 507 | ๐ก october 2025 |
| ๐ nagisa | ๐ฅ 67k | ๐ฆ 7M | โญ 412 | ๐ข last thursday |
| ๐ pyknp | ๐ฅ 1k | ๐ฆ 3M | โญ 92 | ๐ข january |
| ๐ Mykytea-python | ๐ฅ 1k | ๐ฆ 553k | โญ 36 | ๐ด january 2024 |
| ๐ konoha | ๐ฅ 37k | ๐ฆ 5M | โญ 261 | ๐ก april 2025 |
| ๐ natto-py | ๐ฅ 127k | ๐ฆ 32M | โญ 95 | ๐ด november 2023 |
| ๐ rakutenma-python | ๐ฅ 113 | ๐ฆ 27k | โญ 23 | ๐ด may 2017 |
| ๐ python-vaporetto | ๐ฅ 292 | ๐ฆ 173k | โญ 21 | ๐ก june 2025 |
| ๐ dango | ๐ฅ 90 | ๐ฆ 25k | โญ 25 | ๐ด november 2021 |
| ๐ rhoknp | ๐ฅ 10k | ๐ฆ 977k | โญ 37 | ๐ข january |
| ๐ python-vibrato | ๐ฅ 160 | ๐ฆ 115k | โญ 43 | ๐ด september 2024 |
| ๐ jagger-python | ๐ฅ 2k | ๐ฆ 295k | โญ 12 | ๐ด march 2024 |
| ๐ Mecari | - | - | โญ 38 | ๐ก september 2025 |
Libraries that analyze syntactic and dependency structures of Japanese sentences
- ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
- cabocha - Yet Another Japanese Dependency Structure Analyzer
- UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
- camphr - Camphr - NLP libary for creating pipeline components
- SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
- depccg - A* CCG Parser with a Supertag and Dependency Factored Model
- bertknp - A Japanese dependency parser based on BERT
- esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
- yomikata - Heteronym disambiguation library using a fine-tuned BERT model.
- jdepp-python - Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers)
- lightblue - A CCG parser for Japanese with DTS-representations
- natsume-simple - natsume-simpleใฏๆฅๆฌ่ชใฎไฟใๅใ้ขไฟๆค็ดขใทในใใ
- jdeppy - Python wrapper for J.DepP, fast Japanese Dependency Parser
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ ginza | ๐ฅ 10k | ๐ฆ 2M | โญ 832 | ๐ด march 2024 |
| ๐ cabocha | ๐ฅ 110 | ๐ฆ 54k | โญ 7 | ๐ด august 2022 |
| ๐ UniDic2UD | ๐ฅ 423 | ๐ฆ 326k | โญ 38 | ๐ข december 2025 |
| ๐ camphr | ๐ฅ 614 | ๐ฆ 268k | โญ 338 | ๐ด august 2021 |
| ๐ SuPar-UniDic | ๐ฅ 471 | ๐ฆ 118k | โญ 20 | ๐ด repo not found |
| ๐ depccg | ๐ฅ 117 | ๐ฆ 46k | โญ 98 | ๐ด august 2023 |
| ๐ bertknp | - | - | โญ 23 | ๐ด october 2021 |
| ๐ esupar | ๐ฅ 604 | ๐ฆ 167k | โญ 54 | ๐ข january |
| ๐ yomikata | ๐ฅ 35 | ๐ฆ 49k | โญ 32 | ๐ด october 2023 |
| ๐ jdepp-python | ๐ฅ 2k | ๐ฆ 279k | โญ 4 | ๐ด february 2024 |
| ๐ lightblue | - | - | โญ 26 | ๐ข last thursday |
| ๐ natsume-simple | - | - | โญ 5 | ๐ด february 2025 |
| ๐ jdeppy | ๐ฅ 41 | ๐ฆ 11k | โญ 3 | ๐ด february 2022 |
Libraries that convert between character types such as kana, romaji, and full-width/half-width forms
- pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
- cutlet - Japanese to romaji converter in Python
- alphabet2kana - Convert English alphabet to Katakana
- Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
- mozcpy - Mozc for Python: Kana-Kanji converter
- jamorasep - Japanese text parser to separate Hiragana/Katakana string into morae (syllables).
- text2phoneme - ๆฅๆฌ่ชๆใ้ณ็ด ๅใธๅคๆใใในใฏใชใใ
- jntajis-python - A fast character conversion and transliteration library based on the scheme defined for Japan National Tax Agency (ๅฝ็จๅบ) 's
- wiredify - Convert japanese kana from ba-bi-bu-be-bo into va-vi-vu-ve-vo
- mecab-text-cleaner - Simple Python package (CLI/Python API) for getting japanese readings (yomigana) and accents using MeCab.
- pynormalizenumexp - ๆฐ้่กจ็พใๆ้่กจ็พใฎๆฝๅบใปๆญฃ่ฆๅใ่กใNormalizeNumexpใฎPythonๅฎ่ฃ
- Jusho - Easy wrapper for the postal code data of Japan
- yurenizer - Japanese text normalizer that resolves spelling inconsistencies. ๏ผๆฅๆฌ่ช่กจ่จๆบใ่งฃๆถใใผใซ๏ผ
- e2k - A tool for automatic English to Katakana conversion
- alkana.py - A tool to get the katakana reading of an alphabetical string.
- englishtokanaconverter - ่ฑ่ชๆๅญๅใใซใฟใซใใซๅคๆใใใใญใฐใฉใ
- kanjiconv - Kanji Converter to Hiragana, Katakana, Roman alphabet.
- kanjize - Kanjize(ใซใณใธใฃใคใบ): Easy converter between Kanji-Number and Integer
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ pykakasi | ๐ฅ 230k | ๐ฆ 28M | โญ 443 | ๐ด july 2022 |
| ๐ cutlet | ๐ฅ 21k | ๐ฆ 1M | โญ 372 | ๐ก june 2025 |
| ๐ alphabet2kana | ๐ฅ 336 | ๐ฆ 56k | โญ 14 | ๐ข yesterday |
| ๐ Convert-Numbers-to-Japanese | - | - | โญ 50 | ๐ด november 2020 |
| ๐ mozcpy | ๐ฅ 176 | ๐ฆ 12k | โญ 46 | ๐ด february 2025 |
| ๐ jamorasep | ๐ฅ 322 | ๐ฆ 9k | โญ 11 | ๐ข today |
| ๐ text2phoneme | - | - | โญ 13 | ๐ด may 2023 |
| ๐ jntajis-python | ๐ฅ 348 | ๐ฆ 105k | โญ 21 | ๐ด june 2023 |
| ๐ wiredify | ๐ฅ 34 | ๐ฆ 6k | โญ 3 | ๐ข december 2025 |
| ๐ mecab-text-cleaner | ๐ฅ 50 | ๐ฆ 4k | โญ 7 | ๐ด november 2024 |
| ๐ pynormalizenumexp | ๐ฅ 72 | ๐ฆ 14k | โญ 8 | ๐ด april 2024 |
| ๐ Jusho | ๐ฅ 704 | ๐ฆ 53k | โญ 11 | ๐ด june 2024 |
| ๐ yurenizer | ๐ฅ 74 | ๐ฆ 17k | โญ 4 | ๐ก march 2025 |
| ๐ e2k | ๐ฅ 821 | ๐ฆ 23k | โญ 15 | ๐ก november 2025 |
| ๐ alkana.py | - | - | โญ 33 | ๐ด october 2021 |
| ๐ englishtokanaconverter | - | - | โญ 4 | ๐ข yesterday |
| ๐ kanjiconv | ๐ฅ 111 | ๐ฆ 12k | โญ 16 | ๐ก october 2025 |
| ๐ kanjize | ๐ฅ 8k | ๐ฆ 1M | โญ 68 | ๐ก june 2025 |
Libraries that normalize and clean text before analysis
- neologdn - Japanese text normalizer for mecab-neologd
- jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
- mojimoji - A fast converter between Japanese hankaku and zenkaku characters
- text-cleaning - A powerful text cleaner for Japanese web texts
- HojiChar - ่คๆฐใฎๅๅฆ็ใๆงๆใใฆ็ฎก็ใใใใญในใๅๅฆ็ใใผใซ
- utsuho - Utsuho is a Python module that facilitates bidirectional conversion between half-width katakana and full-width katakana in Japanese.
- python-habachen - Yet Another Fast Japanese String Converter
- kairyou - Quickly preprocesses Japanese text using NLP/NER from SpaCy for Japanese translation or other NLP tasks.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ neologdn | ๐ฅ 6k | ๐ฆ 1M | โญ 286 | ๐ข december 2025 |
| ๐ jaconv | ๐ฅ 560k | ๐ฆ 60M | โญ 340 | ๐ข february |
| ๐ mojimoji | ๐ฅ 60k | ๐ฆ 11M | โญ 153 | ๐ด january 2024 |
| ๐ text-cleaning | - | - | โญ 12 | ๐ด november 2022 |
| ๐ HojiChar | ๐ฅ 17k | ๐ฆ 725k | โญ 125 | ๐ก november 2025 |
| ๐ utsuho | ๐ฅ 80 | ๐ฆ 19k | โญ 4 | ๐ก october 2025 |
| ๐ python-habachen | ๐ฅ 5k | ๐ฆ 2M | โญ 6 | ๐ก october 2025 |
| ๐ kairyou | ๐ฅ 126 | ๐ฆ 30k | โญ 6 | ๐ก june 2025 |
Libraries that automatically detect sentence boundaries and split text
- Bunkai - Sentence boundary disambiguation tool for Japanese texts (ๆฅๆฌ่ชๆๅข็ๅคๅฎๅจ)
- japanese-sentence-breaker - Japanese Sentence Breaker
- sengiri - Yet another sentence-level tokenizer for the Japanese text
- budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
- ja_sentence_segmenter - japanese sentence segmentation library for python
- hasami - A tool to perform sentence segmentation on Japanese text
- kuzukiri - Japanese Text Segmenter for Python written in Rust
- ja-senter-benchmark - Comparison of Japanese Sentence Segmentation Tools
- fast-bunkai - Japanese sentence splitting(ๆฅๆฌ่ชๆๅข็ๅคๅฎๅจ), 40โ250ร faster via a Rust-accelerated Python library with near-perfect API compatibility with megagonlabs/bunkai.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ bunkai | ๐ฅ 511 | ๐ฆ 105k | โญ 199 | ๐ด august 2023 |
| ๐ japanese-sentence-breaker | ๐ฅ 38 | ๐ฆ 5k | โญ 14 | ๐ด february 2021 |
| ๐ sengiri | ๐ฅ 80 | ๐ฆ 136k | โญ 24 | ๐ก november 2025 |
| ๐ budoux | ๐ฅ 7k | ๐ฆ 386k | โญ 1.6k | ๐ข today |
| ๐ ja_sentence_segmenter | ๐ฅ 1k | ๐ฆ 179k | โญ 73 | ๐ด april 2023 |
| ๐ hasami | ๐ฅ 316 | ๐ฆ 37k | โญ 6 | ๐ด february 2021 |
| ๐ kuzukiri | ๐ฅ 101 | ๐ฆ 26k | โญ 6 | ๐ก june 2025 |
| ๐ ja-senter-benchmark | - | - | โญ 9 | ๐ด february 2023 |
| ๐ fast-bunkai | ๐ฅ 108 | ๐ฆ 3k | โญ 65 | ๐ก october 2025 |
Libraries that detect emotions or polarity in text
- oseti - Dictionary based Sentiment Analysis for Japanese
- negapoji - Japanese negative positive classification.ๆฅๆฌ่ชๆๆธใฎใใฌใใธใๅคๅฎใ
- pymlask - Emotion analyzer for Japanese text
- asari - Japanese sentiment analyzer implemented in Python.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ oseti | ๐ฅ 314 | ๐ฆ 166k | โญ 97 | ๐ก august 2025 |
| ๐ negapoji | - | - | โญ 151 | ๐ด august 2017 |
| ๐ pymlask | ๐ฅ 81 | ๐ฆ 66k | โญ 116 | ๐ด july 2024 |
| ๐ asari | ๐ฅ 180 | ๐ฆ 79k | โญ 152 | ๐ด october 2022 |
Libraries that automatically translate text between languages
- jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
- JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
- PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
- VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
- plamo-translate-cli - A command-line interface for translation using the plamo-2-translate model with local execution.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ jparacrawl-finetune | - | - | โญ 105 | ๐ด april 2021 |
| ๐ JASS | - | - | โญ 16 | ๐ด january 2022 |
| ๐ PheMT | - | - | โญ 18 | ๐ด february 2021 |
| ๐ VISA | - | - | โญ 14 | ๐ด october 2022 |
| ๐ plamo-translate-cli | - | - | โญ 325 | ๐ก october 2025 |
Libraries that extract names of people, places, and organizations from text
- namaco - Character Based Named Entity Recognition.
- entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
- noyaki - Converts character span label information to tokenized text-based label information.
- bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTใขใใซใฎใใกใคใณใใฅใผใใณใฐใงๅบๆ่กจ็พๆฝๅบ็จใฟในใฏใฎใขใใซใไฝๆใปไฝฟ็จใใใตใณใใซใงใ
- joint-information-extraction-hs - ่ฉณ็ดฐใชใขใใใผใทใงใณๅบๆบใซๅบใฅใ็ไพๅ ฑๅใณใผใในใใใฎๅบๆ่กจ็พๅใณ้ขไฟใฎๆฝๅบ็ฒพๅบฆใฎๆจ่ซใ่กใใณใผใ
- pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
- bert-ner-japanese - BERTใซใใๆฅๆฌ่ชๅบๆ่กจ็พๆฝๅบใฎใใกใคใณใใฅใผใใณใฐ็จใใญใฐใฉใ
- huggingface-finetune-japanese - Examples to finetune encoder-only and encoder-decoder transformers for Japanese language (Hugging Face) Resources
- novelanalysisbyner - BERTใฎfine-tuningใซใใๅบๆ่กจ็พๆฝๅบ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ namaco | - | - | โญ 40 | ๐ด february 2018 |
| ๐ entitypedia | - | - | โญ 13 | ๐ด december 2018 |
| ๐ noyaki | ๐ฅ 100 | ๐ฆ 20k | โญ 5 | ๐ด august 2022 |
| ๐ bert-japanese-ner-finetuning | - | - | โญ 11 | ๐ด june 2022 |
| ๐ joint-information-extraction-hs | - | - | โญ 1 | ๐ด november 2021 |
| ๐ pygeonlp | ๐ฅ 237 | ๐ฆ 21k | โญ 22 | ๐ก october 2025 |
| ๐ bert-ner-japanese | - | - | โญ 5 | ๐ด september 2022 |
| ๐ huggingface-finetune-japanese | - | - | โญ 16 | ๐ด october 2023 |
| ๐ novelanalysisbyner | - | - | โญ 2 | ๐ด june 2024 |
Libraries that recognize and extract text from images
- Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
- mokuro - Read Japanese manga inside browser with selectable text.
- handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
- OCR_Japanease - ๆฅๆฌ่ชOCR
- ndlocr_cli - NDLOCRใฎใขใใชใฑใผใทใงใณ
- donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
- JMTrans - manga translator - get japanese manga from url to translate manga image
- Kindai-OCR - OCR system for recognizing modern Japanese magazines
- text_recognition - NDLOCR็จใใญในใ่ช่ญใขใธใฅใผใซ
- Poricom - Optical character recognition in manga images. Manga OCR desktop application
- owocr - Optical character recognition for Japanese text
- yomitoku - Yomitoku is an AI-powered document image analysis package designed specifically for the Japanese language.
- findtextcenternet - Japanese OCR with CenterNet
- simple-ocr-for-manga - A simple OCR for manga (Japanese traditional and Japanese vertical)
- jp-ocr-evaluation - ๆฅๆฌ่ชใฎๆ็ซ ็ปๅใซๅฏพใใOCRใฎๆง่ฝใ่ฉไพก
- paddleocr-vl-sft-for-japanese-manga-on-rtx-3060 - Fine-tune PaddleOCR-VL on the Manga109s dataset for Japanese manga text recognition. The base model struggles with vertical Japanese text reading order in manga. After fine-tuning, the model correctly handles manga-specific text layouts.
- MangaOCR - A lightweight OCR model for Japanese text, especially in Manga
- meikiocr - high-speed, high-accuracy, local ocr for japanese video games
- meikipop - universal japanese ocr popup dictionary for windows, linux and macos
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ manga-ocr | ๐ฅ 4k | ๐ฆ 240k | โญ 2.5k | ๐ก june 2025 |
| ๐ mokuro | ๐ฅ 759 | ๐ฆ 89k | โญ 1.5k | ๐ก june 2025 |
| ๐ handwritten-japanese-ocr | - | - | โญ 38 | ๐ด april 2022 |
| ๐ OCR_Japanease | - | - | โญ 244 | ๐ด april 2021 |
| ๐ ndlocr_cli | - | - | โญ 567 | ๐ก september 2025 |
| ๐ donut | ๐ฅ 342 | ๐ฆ 196k | โญ 6.8k | ๐ด july 2023 |
| ๐ JMTrans | - | - | โญ 88 | ๐ด january 2021 |
| ๐ Kindai-OCR | - | - | โญ 153 | ๐ด july 2023 |
| ๐ text_recognition | - | - | โญ 8 | ๐ด july 2023 |
| ๐ Poricom | - | - | โญ 415 | ๐ด june 2023 |
| ๐ owocr | - | - | โญ 194 | ๐ข last friday |
| ๐ yomitoku | ๐ฅ 1k | ๐ฆ 75k | โญ 1.3k | ๐ข january |
| ๐ findtextcenternet | - | - | โญ 55 | ๐ก august 2025 |
| ๐ simple-ocr-for-manga | - | - | โญ 7 | ๐ด repo not found |
| ๐ jp-ocr-evaluation | - | - | โญ 1 | ๐ด march 2024 |
| ๐ paddleocr-vl-sft-for-japanese-manga-on-rtx-3060 | - | - | โญ 7 | ๐ข december 2025 |
| ๐ MangaOCR | - | - | โญ 34 | ๐ด may 2024 |
| ๐ meikiocr | ๐ฅ 707 | ๐ฆ 15k | โญ 55 | ๐ข january |
| ๐ meikipop | - | - | โญ 206 | ๐ข last saturday |
Libraries that utilize pretrained models to improve accuracy and efficiency
- JGLUE - JGLUE: Japanese General Language Understanding Evaluation
- ginza-transformers - Use custom tokenizers in spacy-transformers
- t5_japanese_dialogue_generation - T5ใซใใไผ่ฉฑ็ๆ
- japanese_text_classification - To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
- Japanese-BERT-Sentiment-Analyzer - Deploying sentiment analysis server with FastAPI and BERT
- jmlm_scoring - Masked Language Model-based Scoring for Japanese and Vietnamese
- allennlp-shiba-model - AllenNLP integration for Shiba: Japanese CANINE model
- evaluate_japanese_w2v - script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
- gector-ja - BERT-based GEC tagging for Japanese
- Japanese-BPEEncoder - Japanese-BPEEncoder
- Japanese-BPEEncoder_V2 - Japanese-BPEEncoder Version 2
- transformer-copy - ๆฅๆฌ่ชๆๆณ่ชคใ่จๆญฃใใผใซ
- japanese-stable-diffusion - Japanese Stable Diffusion is a Japanese specific latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
- nagisa_bert - A BERT model for nagisa
- prefix-tuning-gpt - Example code for prefix-tuning GPT/GPT-NeoX models and for inference with trained prefixes
- JGLUE-benchmark - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
- jptranstokenizer - Japanese Tokenizer for transformers library
- jp-stable - JP Language Model Evaluation Harness
- compare-ja-tokenizer - How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese-ACL SRW 2023
- lm-evaluation-harness-jp-stable - A framework for few-shot evaluation of autoregressive language models.
- llm-lora-classification - llm-lora-classification
- jp-stable - JP Language Model Evaluation Harness
- rinna_gpt-neox_ggml-lora - The repository contains scripts and merge scripts that have been modified to adapt an Alpaca-Lora adapter for LoRA tuning when assuming the use of the "rinna/japanese-gpt-neox..." [gpt-neox] model converted to ggml.
- japanese-llm-roleplay-benchmark - ใใฎใชใใธใใชใฏๆฅๆฌ่ชLLMใฎใญใฃใฉใฏใฟใผใญใผใซใใฌใคใซ้ขใใๆง่ฝใ่ฉไพกใใใใใซไฝๆใใพใใใ
- japanese-llm-ranking - This repository supports YuzuAI's Rakuda leaderboard of Japanese LLMs, which is a Japanese-focused analogue of LMSYS' Vicuna eval.
- llm-jp-eval - ใใฎใใผใซใฏใ่คๆฐใฎใใผใฟใปใใใๆจชๆญใใฆๆฅๆฌ่ชใฎๅคง่ฆๆจก่จ่ชใขใใซใ่ชๅ่ฉไพกใใใใฎใงใ๏ผ
- llm-jp-sft - This repository contains the code for supervised fine-tuning of LLM-jp models.
- llm-jp-tokenizer - LLMๅๅผทไผ๏ผLLM-jp๏ผใง้็บใใฆใใLLM็จใฎใใผใฏใใคใถใผ้ข้ฃใใพใจใใใชใใธใใชใงใ๏ผ
- japanese-lm-fin-harness - Japanese Language Model Financial Evaluation Harness
- ja-vicuna-qa-benchmark - Japanese Vicuna QA Benchmark
- swallow-evaluation - Swallowใใญใธใงใฏใ ๅคง่ฆๆจก่จ่ชใขใใซ ่ฉไพกในใฏใชใใ
- swallow-evaluation-instruct - Swallowใใญใธใงใฏใ ไบๅพๅญฆ็ฟใใฟๅคง่ฆๆจก่จ่ชใขใใซ ่ฉไพกใใฌใผใ ใฏใผใฏ
- pretrained_doc2vec_ja - pretrained doc2vec models on Japanese Wikipedia
- pl-bert-ja - A repository of Japanese Phoneme-Level BERT
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ JGLUE | - | - | โญ 333 | ๐ก march 2025 |
| ๐ ginza-transformers | ๐ฅ 790 | ๐ฆ 173k | โญ 16 | ๐ด august 2022 |
| ๐ t5_japanese_dialogue_generation | - | - | โญ 3 | ๐ด november 2021 |
| ๐ japanese_text_classification | - | - | โญ 9 | ๐ด january 2020 |
| ๐ Japanese-BERT-Sentiment-Analyzer | - | - | โญ 2 | ๐ด april 2021 |
| ๐ jmlm_scoring | - | - | โญ 5 | ๐ด february 2022 |
| ๐ allennlp-shiba-model | ๐ฅ 137 | ๐ฆ 20k | โญ 12 | ๐ด june 2021 |
| ๐ evaluate_japanese_w2v | - | - | โญ 12 | ๐ด november 2024 |
| ๐ gector-ja | - | - | โญ 19 | ๐ด june 2021 |
| ๐ Japanese-BPEEncoder | - | - | โญ 41 | ๐ด september 2021 |
| ๐ Japanese-BPEEncoder_V2 | - | - | โญ 41 | ๐ด january 2023 |
| ๐ transformer-copy | - | - | โญ 29 | ๐ด september 2020 |
| ๐ japanese-stable-diffusion | - | - | โญ repo not found | ๐ด repo not found |
| ๐ nagisa_bert | ๐ฅ 923 | ๐ฆ 56k | โญ 5 | ๐ด december 2023 |
| ๐ prefix-tuning-gpt | - | - | โญ repo not found | ๐ด repo not found |
| ๐ JGLUE-benchmark | - | - | โญ 18 | ๐ข february |
| ๐ jptranstokenizer | ๐ฅ 115 | ๐ฆ 27k | โญ 5 | ๐ด february 2024 |
| ๐ jp-stable | - | - | โญ 154 | ๐ด november 2023 |
| ๐ compare-ja-tokenizer | - | - | โญ 6 | ๐ด june 2023 |
| ๐ lm-evaluation-harness-jp-stable | - | - | โญ 1 | ๐ด june 2023 |
| ๐ llm-lora-classification | - | - | โญ 98 | ๐ด july 2023 |
| ๐ jp-stable | - | - | โญ 154 | ๐ด november 2023 |
| ๐ rinna_gpt-neox_ggml-lora | - | - | โญ 18 | ๐ด may 2023 |
| ๐ japanese-llm-roleplay-benchmark | - | - | โญ 40 | ๐ด november 2023 |
| ๐ japanese-llm-ranking | - | - | โญ 50 | ๐ด march 2024 |
| ๐ llm-jp-eval | - | - | โญ 147 | ๐ข december 2025 |
| ๐ llm-jp-sft | - | - | โญ 62 | ๐ด june 2024 |
| ๐ llm-jp-tokenizer | - | - | โญ 45 | ๐ด february 2025 |
| ๐ japanese-lm-fin-harness | - | - | โญ 77 | ๐ข january |
| ๐ ja-vicuna-qa-benchmark | - | - | โญ 33 | ๐ด june 2024 |
| ๐ swallow-evaluation | - | - | โญ 23 | ๐ก september 2025 |
| ๐ swallow-evaluation-instruct | - | - | โญ 24 | ๐ก october 2025 |
| ๐ pretrained_doc2vec_ja | - | - | โญ 25 | ๐ด january 2019 |
| ๐ pl-bert-ja | - | - | โญ 22 | ๐ด december 2023 |
General-purpose tools supporting Japanese language processing
- namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
- asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
- python_asa - python็ๆฅๆฌ่ชๆๅณๅฝนๅฒไปไธใทในใใ ๏ผASA๏ผ
- toiro - A comparison tool of Japanese tokenizers
- ja-timex - ่ช็ถ่จ่ชใงๆธใใใๆ้ๆ ๅ ฑ่กจ็พใๆฝๅบ/่ฆๆ ผๅใใใซใผใซใใผในใฎ่งฃๆๅจ
- JapaneseTokenizers - A set of metrics for feature selection from text data
- daaja - This repository has implementations of data augmentation for NLP for Japanese.
- accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation netโฆ
- kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
- nlplot - Visualization Module for Natural Language Processing
- rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
- jel - Japanese Entity Linker.
- MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
- zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
- AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
- showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
- darts-clone-python - Darts-clone python binding
- jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
- desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
- HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
- nlp-recipes-ja - Samples codes for natural language processing in Japanese
- Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
- DNorm-J - Japanese version of DNorm
- pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
- ishi - Ishi: A volition classifier for Japanese
- python-npylm - ใใคใบ้ๅฑค่จ่ชใขใใซใซใใๆๅธซใชใๅฝขๆ ็ด ่งฃๆ
- python-npycrf - ๆกไปถไป็ขบ็ๅ ดใจใใคใบ้ๅฑค่จ่ชใขใใซใฎ็ตฑๅใซใใๅๆๅธซใใๅฝขๆ ็ด ่งฃๆ
- unsupervised-pos-tagging - ๆๅธซใชใๅ่ฉใฟใฐๆจๅฎ
- negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
- YouyakuMan - Extractive summarizer using BertSum as summarization model
- japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
- kantan - Lookup japanese words by radical patterns
- make-meidai-dialogue - Get Japanese dialogue corpus
- japanese_summarizer - A summarizer for Japanese articles.
- chirptext - ChirpText is a collection of text processing tools for Python.
- yubin - Japanese Address Munger
- jawiki-cleaner - Japanese Wikipedia Cleaner
- japanese2phoneme - A python library to convert Japanese to phoneme.
- anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
- aozora_classification - This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
- aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
- JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
- NTM - Testing of Neural Topic Modeling for Japanese articles
- EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
- text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
- chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
- unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
- mbart-finetuning - Code to perform finetuning of the mBART model.
- xvector_jtubespeech - xvector model on jtubespeech
- TinySegmenterMaker - TinySegmenter็จใฎๅญฆ็ฟใขใใซใ่ชไฝใใใใใฎใใผใซ๏ผ
- Grongish - ๆฅๆฌ่ชใจใฐใญใณใฎ่ชใฎ็ธไบๅคๆในใฏใชใใ
- WordCloud-Japanese - WordCloudใงใฎๆฅๆฌ่ชๆ็ซ ใMecab๏ผๅฝขๆ ็ด ่งฃๆใจใณใธใณ๏ผใไฝฟ็จใใใซๅฝขๆ ็ด ่งฃๆใใใฏใช่กจ็คบใๅฎ็พใใในใฏใชใใ
- snark - ๆฅๆฌ่ชใฏใผใใใใใๅฉ็จใใDBใขใฏใปในใฉใคใใฉใช
- toEmoji - ๆฅๆฌ่ชๆใ็ตตๆๅญใ ใใฎๆใซๅคๆใใใชใซใ
- termextract - - ๅฐ้็จ่ชๆฝๅบใขใซใดใชใบใ ใฎๅฎ่ฃ ใฎ็ทด็ฟ
- JDT-with-KenLM-scoring - Japanese-Dialog-Transformerใฎๅฟ็ญๅ่ฃใซๅฏพใใฆใKenLMใซใใN-gram่จ่ชใขใใซใงในใณใขใชใณใฐใใใใฃใซใฟใชใณใฐ่ฅใใใฏใชใฉใณใญใณใฐใ่กใใ
- mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (ๆททๅใฆใใฐใฉใ ใขใใซใจ็ก้ๆททๅใฆใใฐใฉใ ใขใใซ)
- hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (้ ใใใซใณใใขใใซใจ็ก้้ ใใใซใณใใขใใซ)
- Ngram-language-model - Ngram language model in Python. (Nใฐใฉใ ่จ่ชใขใใซ)
- ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
- neural_ime - Neural IME: Neural Input Method Engine
- neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
- tinysegmenter - tokenizer specified for Japanese
- AugLy-jp - Data Augmentation for Japanese Text on AugLy
- furigana4epub - A Python script for adding furigana to Japanese epub books using Mecab and Unidic.
- PyKatsuyou - Japanese verb/adjective inflections tool
- jageocoder - Pure Python Japanese address geocoder
- pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
- nksnd - New kana-kanji conversion engine
- JaMIE - A Japanese Medical Information Extraction Toolkit
- fasttext-vs-word2vec-on-twitter-data - fasttextใจword2vecใฎๆฏ่ผใจใๅฎ่กในใฏใชใใใๅญฆ็ฟในใฏใชใใใงใ
- minimal-search-engine - ๆๅฐใฎใตใผใใจใณใธใณ/PageRank/tf-idf
- 5ch-analysis - 5chใฎ้ๅปใญใฐใในใฏใฌใคใใณใฐใใฆใ้ๅปๆต่กใฃใๅ่ช(ex, ้ฆๅ ทๅธซ, orz)ใชใฉใ่ฟฝ่ทก่ชฟๆป
- tweet_extructor - Twitterๆฅๆฌ่ช่ฉๅคๅๆใใผใฟใปใใใฎใใใฎใใคใผใใใฆใณใญใผใ
- japanese-word-aggregation - Aggregating Japanese words based on Juman++ and ConceptNet5.5
- jinf - A Japanese inflection converter
- kwja - A unified language analyzer for Japanese
- mlm-scoring-transformers - Reproduced package based on Masked Language Model Scoring (ACL2020).
- ClipCap-for-Japanese - [PyTorch] ClipCap for Japanese
- SAT-for-Japanese - [PyTorch] Show, Attend and Tell for Japanese
- cihai - Python library for CJK (Chinese, Japanese, and Korean) language dictionary
- marine - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation
- whisper-asr-finetune - Finetuning Whisper ASR model
- japanese_chatbot - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
- radicalchar - ้จ้ฆๆๅญๆญฃ่ฆๅใฉใคใใฉใช
- akaza - Yet another Japanese IME for IBus/Linux
- posuto - Japanese postal code data.
- tacotron2-japanese - Tacotron2 implementation of Japanese
- ibus-hiragana - ใฒใใใชIME for IBus
- furiganapad - ใตใใใชใใใ
- chikkarpy - Japanese synonym library
- ja-tokenizer-docker-py - Mecab + NEologd + Docker + Python3
- JapaneseEmbeddingEval - JapaneseEmbeddingEval
- gptuber-by-langchain - GPTใYouTuberใใใใพใ
- shuwa - Extend GNOME On-Screen Keyboard for Input Methods
- japanese-nli-model - This repository provides the code for Japanese NLI model, a fine-tuned masked language model.
- tra-fugu - A tool for Japanese-English translation and English-Japanese translation by using FuguMT
- fugumt - ใทใใผใตใใถใใใใทใจ ใงๅ ฌ้ใใๆฉๆขฐ็ฟป่จณใจใณใธใณใๅฉ็จใใ็ฟป่จณ็ฐๅขใงใใ ใใฉใผใ ใซๅ ฅๅใใใๆๅญๅใฎ็ฟป่จณใPDFใฎ็ฟป่จณใๅฏ่ฝใงใใ
- JaSPICE - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
- Retrieval-based-Voice-Conversion-WebUI-JP-localization - jp-localization
- pyopenjtalk - Python wrapper for OpenJTalk
- yomigana-ebook - Make learning Japanese easier by adding readings for every kanji in the eBook
- N46Whisper - Whisper based Japanese subtitle generator
- japanese_llm_simple_webui - Rinna-3.6BใOpenCALM็ญใฎๆฅๆฌ่ชๅฏพๅฟLLM(ๅคง่ฆๆจก่จ่ชใขใใซ)็จใฎ็ฐกๆWebใคใณใฟใใงใผในใงใ
- pdf-translator - pdf-translator translates English PDF files into Japanese, preserving the original layout.
- japanese_qa_demo_with_haystack_and_es - Haystack + Elasticsearch + wikipedia(ja) ใ็จใใใๆฅๆฌ่ชใฎ่ณชๅๅฟ็ญใทในใใ ใฎใตใณใใซ
- mozc-devices - Automatically exported from code.google.com/p/mozc-morse
- natsume - A Japanese text frontend processing toolkit
- vits-japros-webui - ๆฅๆฌ่ชTTS๏ผVITS๏ผใฎๅญฆ็ฟใจ้ณๅฃฐๅๆใฎGradio WebUI
- ja-law-parser - A Japanese law parser
- dictation-kit - Japanese dictation kit using Julius
- julius4seg - Juliusใไฝฟใฃใใปใฐใกใณใใผใทใงใณๆฏๆดใใผใซ
- voicevox_engine - ็กๆใงไฝฟใใไธญๅ่ณชใชใใญในใ่ชญใฟไธใใฝใใใฆใงใขใVOICEVOXใฎ้ณๅฃฐๅๆใจใณใธใณ
- LLaVA-JP - LLaVA-JP is a Japanese VLM trained by LLaVA method
- RAG-Japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
- bertjsc - Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTใซๅบใฅใใฆๆฅๆฌ่ชๆ กๆญฃ
- llm-leaderboard - Project of llm evaluation to Japanese tasks
- jglue-evaluation-scripts - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
- BLIP2-Japanese - Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
- wikipedia-passages-jawiki-embeddings-utils - wikipedia ๆฅๆฌ่ชใฎๆใใๅ็จฎๆฅๆฌ่ชใฎ embeddings ใ faiss index ใธใจๅคๆใใในใฏใชใใ็ญใ
- simple-simcse-ja - Exploring Japanese SimCSE
- wikipedia-japanese-open-rag - Wikipediaใฎๆฅๆฌ่ช่จไบใๅ ใซใใฆใผใถใฎ่ณชๅใซๅ็ญใใGradioใใผในใฎRAGใฎใตใณใใซ
- gpt4-autoeval - GPT-4 ใ็จใใฆใ่จ่ชใขใใซใฎๅฟ็ญใ่ชๅ่ฉไพกใใในใฏใชใใ
- t5-japanese - ๆฅๆฌ่ชT5ใขใใซ
- japanese_llm_eval - A repo for evaluating Japanese LLMsใใปใๆฅๆฌ่ชLLMใ่ฉไพกใใใฌใ
- jmteb - The evaluation scripts of JMTEB (Japanese Massive Text Embedding Benchmark)
- pydomino - ๆฅๆฌ่ช้ณๅฃฐใซๅฏพใใฆ้ณ็ด ใฉใใซใใขใฉใคใณใกใณใใใใใใฎใใผใซใงใ
- easynovelassistant - ่ปฝ้ใง่ฆๅถใๆค้ฒใใชใๆฅๆฌ่ชใญใผใซใซ LLMใLightChatAssistant-TypeBใใซใใใ็ฐกๅใชใใใซ็ๆใขใทในใฟใณใใงใใใญใผใซใซ็นๆจฉใฎๆฐธ็ถ็ๆ Generate forever ใงใๅฝใใใฌใใฃใ็ฉใฟไธใใพใใ่ชญใฟไธใใซใๅฏพๅฟใ
- clip-japanese - ๆฅๆฌ่ชใใผใฟใปใใใงใฎqlora instruction tuningๅญฆ็ฟใตใณใใซใณใผใ
- rime-jaroomaji - Japanese rลmaji input schema for Rime IME
- deep-question-generation - ๆทฑๅฑคๅญฆ็ฟใ็จใใใฏใคใบ่ชๅ็ๆ๏ผๆฅๆฌ่ชT5ใขใใซ๏ผ
- magpie-nemotron - MagpieใจใใๆๆณใจNemotron-4-340B-Instructใ็จใใฆๅๆๅฏพ่ฉฑใใผใฟใปใใใไฝใใณใผใ
- qlora_ja - ๆฅๆฌ่ชใใผใฟใปใใใงใฎqlora instruction tuningๅญฆ็ฟใตใณใใซใณใผใ
- mozcdic-ut-jawiki - Mozc UT Jawiki Dictionary is a dictionary generated from the Japanese Wikipedia for Mozc.
- shisa-v2 - Japanese / English Bilingual LLM
- llm-translator - Mixtral-based Ja-En (En-Ja) Translation model
- llm-jp-asr - Whisperใฎใใณใผใใllm-jp-1.3b-v1.0ใซ็ฝฎใๆใใ้ณๅฃฐ่ช่ญใขใใซใๅญฆ็ฟใใใใใใฎใณใผใ
- rag-japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
- monaka - A Japanese Parser (including historical Japanese)
- jp-translate.cloud - A state-of-the-art open-source Japanese <--> English machine translation system based on the latest NMT research.
- substring-word-finder - ้ฃ็ถ้จๅๆๅญๅใฎๅ่ชๅคๅฎใ่กใใพใ
- heron-vlm-leaderboard - This project is a benchmarking tool for evaluating and comparing the performance of various Vision Language Models (VLMs). It uses two datasets: LLaVA-Bench-In-the-Wild and Japanese HERON Bench to measure model performance.
- text2dataset - Easily turn large English text datasets into Japanese text datasets using open LLMs.
- mecab-web-api - MeCabใๅฉ็จใใๆฅๆฌ่ชๅฝขๆ ็ด ่งฃๆWebAPI
- mecab_controller - Mecab wrapper to generate furigana readings.
- vits - VITSใซใใใใญในใ่ชญใฟไธใๅจ&ใใคในใใงใณใธใฃใผ
- akari_chatgpt_bot - ้ณๅฃฐ่ช่ญใๆ็ซ ็ๆใ้ณๅฃฐๅๆใไฝฟใฃใฆๅฏพ่ฉฑใใใใฃใใใใใใขใใช
- kudasai - Streamlining Japanese-English Translation with Advanced Preprocessing and Integrated Translation Technologies
- mecab-visualizer - MeCabใฎๅฝขๆ ็ด ่งฃๆ็ตๆใๅฏ่ฆๅใใใใผใซ
- add-dictionary - OpenJTalkใฎใฆใผใถ่พๆธใGUIใง่ฟฝๅ ใใใขใใช
- j-moshi - J-Moshi: A Japanese Full-duplex Spoken Dialogue System
- jatts - JATTS: Japanese TTS (for research)
- tsukasa-speech - a Frontier Japanese Speech Generation net
- symptom-expression-search - ElasticsearchใGiNZAใๆฃ่ ่กจ็พ่พๆธใไฝฟใฃใๆฃ่ ่กจ็พๆบใๅธๅใใๆๅณๆง้ ๆค็ดขใ่ฉฆใใ
- llm-jp-judge - ็ๆ่ชๅ่ฉไพกใ่กใใใใฎPythonใใผใซ
- asagi-vlm-colaboratory-sample - ColaboratoryไธใงAsagi(ๅๆใใผใฟใปใใใๆดป็จใใๅคง่ฆๆจกๆฅๆฌ่ชVLM)ใใ่ฉฆใใใใตใณใใซ
- llm-jp-eval-mm - This tool automatically evaluates Japanese multi-modal large language models across multiple datasets.
- llm-jp-judge - ็ๆ่ชๅ่ฉไพกใ่กใใใใฎPythonใใผใซ
- manga109api - Simple python API to read annotation data of Manga109
- fastrtc-jp - fastrtc็จใฎๆฅๆฌ่ชTTSใจSTT่ฟฝๅ ใญใใ
- whisper-transcription - Pythonใไฝฟ็จใใWhisperใขใใซใซใใ้ณๅฃฐๆๅญ่ตทใใใใผใซ
- pocket-researcher - LLMใๆดป็จใใ่ชๅพ่ชฟๆปใจใผใธใงใณใใๆ่ปฝใซๆ ๅ ฑๅ้ใๆฆ่ฆๆๆกใ
- jtransbench - A tool to easily benchmark Japanese translation skills
- easyllasa - EasyLlasa ใฏ 5๏ฝ15็งใฎๆฅๆฌ่ช้ณๅฃฐใจๆฅๆฌ่ชใใญในใใใๆฅๆฌ่ช้ณๅฃฐใ็ๆใใ TSTS (TextSpeechToSpeech) ใงใใ
- kanjikana-model - ๆฐๅๆผขๅญใซใ็ชๅใขใใซ
- deep-openreview-research-ja - OpenReview่ซๆใ่ชๅใง็บ่ฆใปๅๆใใๆฅๆฌ่ชๅฏพๅฟAIใจใผใธใงใณใ
- pitchbench - Experimental Japanese pitch accent based LLM Benchmark
- mini-transformer-from-scratch - English to Japanese Transformer from scratch
- vv_core_inference - VOICEVOXใฎใณใขๅ ใง็จใใใใฆใใใใฃใผใใฉใผใใณใฐใขใใซใฎๆจ่ซใณใผใ
- pyopenjtalk-plus - pyopenjtalk-plus: A Python wrapper for OpenJTalk with additional improvements
- japanese_spelling_correction - Japanese Spelling Correction
- py-kaomoji - python kaomoji
- llm-jp-vila - This repository contains the code for training llm-jp/llm-jp-3-vila-14b, modified from VILA repository.
- kanjivg-radical - kanjivg-radical
- japanese-wordnet-visualization - This project visualizes the Japanese Wordnet (ๆฅๆฌ่ชใฏใผใใใใ) with web application built by Django
- piper-plus - Enhanced Piper TTS with Japanese support, WebAssembly, multi-GPU training, and quality improvements.
- Japanera - Easy Tools for Japanese Era System
- bert-abstractive-text-summarization - Japanese Sentence Summarization with BERT
- kyujipy - A Python library to convert Japanese texts from Shinjitai (ๆฐๅญไฝ) to Kyujitai (่ๅญ้ซ) and vice versa
- jitenbot - Web crawler for creating personal copies of Japanese dictionaries
- ja-icd10 - ICD-10 ๅฝ้็พ็ ๅ้กใฎๆฅๆฌ่ชๆ ๅ ฑใๆฑใใใใฎPythonใใใฑใผใธ
- pl-bert-vits2 - VITS2 using Phoneme-Level Japanese BERT
- ndc_predictor - NDCPredictorใฎๆฉๆขฐๅญฆ็ฟใขใใซ๏ผๆธ่ชๆ ๅ ฑใใๆฅๆฌๅ้ฒๅ้กใๆจๆธฌใใfastTextใฎๅญฆ็ฟๆธใฟใขใใซ๏ผ
- pfmt-bench-fin-ja - pfmt-bench-fin-ja: Preferred Multi-turn Benchmark for Finance in Japanese
- marine-plus - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation (Also supported Windows)
- ja-tokenizer-benchmark - Compare the speed of various Japanese tokenizers in Python.
- yat - yat: Yet Another Tokenizer for Japanese NLP
- igakuqa119 - Evaluating LLMs on the 119th Japanese Medical Licensing Examination
- japanese-luw-tokenizer - Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers
- ibus-jig - ibus-jig: Japanese-language Input-method using GPT-4
- jp-stopword-filter - A lightweight Python library designed to filter stopwords from Japanese text based on customizable rules.
- yasumail - Synthetic Japanese business email generator for ML training data
- himotoki - A Python-based Japanese Tokenizer, Dictionary, Morphological Analyzer and Romanization Tool. Based on JMDict for Language Learning.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ namedivider-python | ๐ฅ 475 | ๐ฆ 78k | โญ 251 | ๐ก november 2025 |
| ๐ asa-python | ๐ฅ 106 | ๐ฆ 30k | โญ 11 | ๐ด february 2019 |
| ๐ python_asa | - | - | โญ 22 | ๐ด january 2020 |
| ๐ toiro | ๐ฅ 69 | ๐ฆ 26k | โญ 121 | ๐ก november 2025 |
| ๐ ja-timex | ๐ฅ 440 | ๐ฆ 89k | โญ 140 | ๐ด november 2023 |
| ๐ JapaneseTokenizers | - | - | โญ 137 | ๐ด march 2019 |
| ๐ daaja | ๐ฅ 113 | ๐ฆ 25k | โญ 64 | ๐ด february 2023 |
| ๐ accel-brain-code | ๐ฅ 346 | ๐ฆ 148k | โญ 320 | ๐ด december 2023 |
| ๐ JGLUE | - | - | โญ 333 | ๐ก march 2025 |
| ๐ kyoto-reader | ๐ฅ 518 | ๐ฆ 49k | โญ 10 | ๐ด june 2024 |
| ๐ nlplot | ๐ฅ 163 | ๐ฆ 108k | โญ 240 | ๐ด september 2022 |
| ๐ rake-ja | - | - | โญ 21 | ๐ด october 2018 |
| ๐ jel | ๐ฅ 45 | ๐ฆ 8k | โญ 11 | ๐ด july 2021 |
| ๐ MedNER-J | - | - | โญ 18 | ๐ด may 2022 |
| ๐ zunda-python | ๐ฅ 23 | ๐ฆ 6k | โญ 10 | ๐ด november 2019 |
| ๐ AIO2_DPR_baseline | - | - | โญ 16 | ๐ด january 2022 |
| ๐ showcase | ๐ฅ 32 | ๐ฆ 7k | โญ 6 | ๐ด june 2018 |
| ๐ darts-clone-python | ๐ฅ 3k | ๐ฆ 9M | โญ 20 | ๐ด april 2022 |
| ๐ jrte-corpus_example | - | - | โญ 3 | ๐ด november 2021 |
| ๐ desuwa | ๐ฅ 59 | ๐ฆ 10k | โญ 6 | ๐ด may 2022 |
| ๐ HotPepperGourmetDialogue | - | - | โญ 278 | ๐ด may 2016 |
| ๐ nlp-recipes-ja | - | - | โญ 65 | ๐ด april 2021 |
| ๐ Japanese_nlp_scripts | - | - | โญ 26 | ๐ด june 2019 |
| ๐ DNorm-J | - | - | โญ 9 | ๐ด june 2022 |
| ๐ pyknp-eventgraph | ๐ฅ 130 | ๐ฆ 65k | โญ 9 | ๐ด september 2022 |
| ๐ ishi | ๐ฅ 38 | ๐ฆ 6k | โญ 2 | ๐ด may 2020 |
| ๐ python-npylm | - | - | โญ 34 | ๐ด january 2019 |
| ๐ python-npycrf | - | - | โญ 11 | ๐ด march 2018 |
| ๐ unsupervised-pos-tagging | - | - | โญ 16 | ๐ด october 2017 |
| ๐ negima | ๐ฅ 39 | ๐ฆ 16k | โญ 14 | ๐ด august 2018 |
| ๐ YouyakuMan | - | - | โญ 52 | ๐ด september 2020 |
| ๐ japanese-numbers-python | ๐ฅ 351 | ๐ฆ 2M | โญ 21 | ๐ด april 2020 |
| ๐ kantan | - | - | โญ 8 | ๐ด october 2024 |
| ๐ make-meidai-dialogue | - | - | โญ 40 | ๐ด september 2017 |
| ๐ japanese_summarizer | - | - | โญ 10 | ๐ด august 2022 |
| ๐ chirptext | ๐ฅ 1k | ๐ฆ 192k | โญ 7 | ๐ด october 2022 |
| ๐ yubin | ๐ฅ 23 | ๐ฆ 3k | โญ 3 | ๐ด october 2019 |
| ๐ jawiki-cleaner | ๐ฅ 265 | ๐ฆ 24k | โญ 6 | ๐ด february 2021 |
| ๐ japanese2phoneme | ๐ฅ 41 | ๐ฆ 4k | โญ 1 | ๐ด february 2022 |
| ๐ anlp_nlp2021_d3-1 | - | - | โญ 1 | ๐ด march 2022 |
| ๐ aozora_classification | - | - | โญ 11 | ๐ด september 2017 |
| ๐ aozora-corpus-generator | - | - | โญ 8 | ๐ก june 2025 |
| ๐ JLM | - | - | โญ 111 | ๐ด june 2019 |
| ๐ NTM | - | - | โญ 13 | ๐ด july 2019 |
| ๐ EN-JP-ML-Lexicon | - | - | โญ 40 | ๐ด march 2021 |
| ๐ text-generation | - | - | โญ 19 | ๐ก august 2025 |
| ๐ chainer_nic | - | - | โญ 17 | ๐ด december 2018 |
| ๐ unihan-lm | - | - | โญ 2 | ๐ด november 2020 |
| ๐ mbart-finetuning | - | - | โญ 3 | ๐ด october 2021 |
| ๐ xvector_jtubespeech | - | - | โญ 47 | ๐ด november 2023 |
| ๐ TinySegmenterMaker | - | - | โญ 72 | ๐ด september 2022 |
| ๐ Grongish | - | - | โญ 25 | ๐ข december 2025 |
| ๐ WordCloud-Japanese | - | - | โญ 9 | ๐ด january 2020 |
| ๐ snark | - | - | โญ 11 | ๐ด march 2020 |
| ๐ toEmoji | - | - | โญ 4 | ๐ด april 2018 |
| ๐ termextract | - | - | โญ 18 | ๐ด september 2018 |
| ๐ JDT-with-KenLM-scoring | - | - | โญ 1 | ๐ด july 2022 |
| ๐ mixture-of-unigram-model | - | - | โญ 6 | ๐ด june 2017 |
| ๐ hidden-markov-model | - | - | โญ 5 | ๐ด june 2017 |
| ๐ Ngram-language-model | - | - | โญ 5 | ๐ด december 2017 |
| ๐ ASRDeepSpeech | - | - | โญ 69 | ๐ด september 2022 |
| ๐ neural_ime | - | - | โญ 67 | ๐ด december 2016 |
| ๐ neural_japanese_transliterator | - | - | โญ 178 | ๐ด september 2017 |
| ๐ tinysegmenter | ๐ฅ 131k | ๐ฆ 171k | โญ repo not found | ๐ด november 2015 |
| ๐ AugLy-jp | ๐ฅ 173 | ๐ฆ 30k | โญ 7 | ๐ด september 2021 |
| ๐ furigana4epub | ๐ฅ 60 | ๐ฆ 12k | โญ 29 | ๐ด september 2021 |
| ๐ PyKatsuyou | ๐ฅ 76 | ๐ฆ 19k | โญ 12 | ๐ก march 2025 |
| ๐ jageocoder | ๐ฅ 7k | ๐ฆ 320k | โญ 93 | ๐ก september 2025 |
| ๐ pygeonlp | ๐ฅ 237 | ๐ฆ 21k | โญ 22 | ๐ก october 2025 |
| ๐ nksnd | - | - | โญ 26 | ๐ด may 2018 |
| ๐ JaMIE | - | - | โญ 9 | ๐ด may 2023 |
| ๐ fasttext-vs-word2vec-on-twitter-data | - | - | โญ 48 | ๐ด august 2017 |
| ๐ minimal-search-engine | - | - | โญ 19 | ๐ด july 2019 |
| ๐ 5ch-analysis | - | - | โญ 75 | ๐ด november 2018 |
| ๐ tweet_extructor | - | - | โญ 3 | ๐ด august 2022 |
| ๐ japanese-word-aggregation | - | - | โญ 2 | ๐ด august 2018 |
| ๐ jinf | ๐ฅ 107 | ๐ฆ 54k | โญ 4 | ๐ด december 2022 |
| ๐ kwja | ๐ฅ 597 | ๐ฆ 55k | โญ 138 | ๐ก august 2025 |
| ๐ mlm-scoring-transformers | - | - | โญ 6 | ๐ด december 2022 |
| ๐ ClipCap-for-Japanese | - | - | โญ 12 | ๐ด october 2022 |
| ๐ SAT-for-Japanese | - | - | โญ 2 | ๐ด october 2022 |
| ๐ cihai | ๐ฅ 832 | ๐ฆ 209k | โญ 93 | ๐ข last saturday |
| ๐ marine | ๐ฅ 109 | ๐ฆ 15k | โญ 36 | ๐ด september 2022 |
| ๐ whisper-asr-finetune | - | - | โญ 32 | ๐ด december 2022 |
| ๐ japanese_chatbot | - | - | โญ repo not found | ๐ด repo not found |
| ๐ radicalchar | - | - | โญ 8 | ๐ด december 2022 |
| ๐ akaza | - | - | โญ 233 | ๐ข today |
| ๐ posuto | ๐ฅ 6k | ๐ฆ 653k | โญ 224 | ๐ข february |
| ๐ tacotron2-japanese | - | - | โญ 269 | ๐ด september 2022 |
| ๐ ibus-hiragana | - | - | โญ 78 | ๐ก november 2025 |
| ๐ furiganapad | - | - | โญ 19 | ๐ก april 2025 |
| ๐ chikkarpy | ๐ฅ 312 | ๐ฆ 58k | โญ 55 | ๐ด february 2022 |
| ๐ ja-tokenizer-docker-py | - | - | โญ 36 | ๐ด may 2022 |
| ๐ JapaneseEmbeddingEval | - | - | โญ 183 | ๐ด october 2024 |
| ๐ gptuber-by-langchain | - | - | โญ 63 | ๐ด january 2023 |
| ๐ shuwa | - | - | โญ 145 | ๐ด december 2022 |
| ๐ japanese-nli-model | - | - | โญ 5 | ๐ด october 2022 |
| ๐ tra-fugu | - | - | โญ 6 | ๐ด march 2023 |
| ๐ fugumt | - | - | โญ 65 | ๐ด february 2021 |
| ๐ JaSPICE | ๐ฅ 35 | ๐ฆ 2k | โญ 9 | ๐ด november 2023 |
| ๐ Retrieval-based-Voice-Conversion-WebUI-JP-localization | - | - | โญ 48 | ๐ด april 2023 |
| ๐ pyopenjtalk | ๐ฅ 15k | ๐ฆ 1M | โญ 243 | ๐ก april 2025 |
| ๐ yomigana-ebook | ๐ฅ 39 | ๐ฆ 7k | โญ 25 | ๐ด february 2024 |
| ๐ N46Whisper | - | - | โญ 1.7k | ๐ด february 2025 |
| ๐ japanese_llm_simple_webui | - | - | โญ 17 | ๐ด may 2024 |
| ๐ pdf-translator | - | - | โญ 337 | ๐ด may 2024 |
| ๐ japanese_qa_demo_with_haystack_and_es | - | - | โญ 1 | ๐ด december 2022 |
| ๐ mozc-devices | - | - | โญ 2.7k | ๐ก november 2025 |
| ๐ natsume | ๐ฅ 0 | ๐ฆ 3k | โญ repo not found | ๐ด repo not found |
| ๐ vits-japros-webui | - | - | โญ 42 | ๐ด january 2024 |
| ๐ ja-law-parser | - | - | โญ 25 | ๐ด january 2024 |
| ๐ dictation-kit | - | - | โญ 164 | ๐ด april 2019 |
| ๐ julius4seg | - | - | โญ 7 | ๐ด august 2021 |
| ๐ voicevox_engine | - | - | โญ 1.6k | ๐ข today |
| ๐ LLaVA-JP | - | - | โญ 64 | ๐ด june 2024 |
| ๐ RAG-Japanese | - | - | โญ 10 | ๐ก may 2025 |
| ๐ bertjsc | - | - | โญ 14 | ๐ด august 2024 |
| ๐ llm-leaderboard | - | - | โญ 91 | ๐ก september 2025 |
| ๐ jglue-evaluation-scripts | - | - | โญ 18 | ๐ข february |
| ๐ BLIP2-Japanese | - | - | โญ 13 | ๐ก september 2025 |
| ๐ wikipedia-passages-jawiki-embeddings-utils | - | - | โญ 11 | ๐ด march 2024 |
| ๐ simple-simcse-ja | - | - | โญ 69 | ๐ด october 2023 |
| ๐ wikipedia-japanese-open-rag | - | - | โญ repo not found | ๐ด repo not found |
| ๐ gpt4-autoeval | - | - | โญ 16 | ๐ด june 2024 |
| ๐ t5-japanese | - | - | โญ 116 | ๐ก september 2025 |
| ๐ japanese_llm_eval | - | - | โญ 5 | ๐ด april 2024 |
| ๐ jmteb | - | - | โญ 84 | ๐ก november 2025 |
| ๐ pydomino | - | - | โญ 36 | ๐ก august 2025 |
| ๐ easynovelassistant | - | - | โญ 216 | ๐ด july 2024 |
| ๐ clip-japanese | - | - | โญ 13 | ๐ก september 2025 |
| ๐ rime-jaroomaji | - | - | โญ 47 | ๐ข last thursday |
| ๐ deep-question-generation | - | - | โญ 12 | ๐ด march 2023 |
| ๐ magpie-nemotron | - | - | โญ 9 | ๐ด july 2024 |
| ๐ qlora_ja | - | - | โญ 1 | ๐ด july 2024 |
| ๐ mozcdic-ut-jawiki | - | - | โญ 26 | ๐ข february |
| ๐ shisa-v2 | - | - | โญ 28 | ๐ข december 2025 |
| ๐ llm-translator | - | - | โญ 20 | ๐ด january 2025 |
| ๐ llm-jp-asr | - | - | โญ 9 | ๐ด september 2024 |
| ๐ rag-japanese | - | - | โญ 10 | ๐ก may 2025 |
| ๐ monaka | - | - | โญ 4 | ๐ด january 2025 |
| ๐ jp-translate.cloud | - | - | โญ 3 | ๐ด september 2024 |
| ๐ substring-word-finder | - | - | โญ 4 | ๐ก november 2025 |
| ๐ heron-vlm-leaderboard | - | - | โญ 6 | ๐ด december 2024 |
| ๐ text2dataset | - | - | โญ 26 | ๐ด january 2025 |
| ๐ mecab-web-api | - | - | โญ 40 | ๐ด july 2022 |
| ๐ mecab_controller | - | - | โญ 18 | ๐ข january |
| ๐ vits | - | - | โญ 91 | ๐ด february 2023 |
| ๐ akari_chatgpt_bot | - | - | โญ 48 | ๐ก october 2025 |
| ๐ kudasai | - | - | โญ 26 | ๐ก june 2025 |
| ๐ mecab-visualizer | - | - | โญ 2 | ๐ด september 2023 |
| ๐ add-dictionary | - | - | โญ 3 | ๐ก october 2025 |
| ๐ j-moshi | - | - | โญ 298 | ๐ก june 2025 |
| ๐ jatts | - | - | โญ 44 | ๐ก may 2025 |
| ๐ tsukasa-speech | - | - | โญ 60 | ๐ก may 2025 |
| ๐ symptom-expression-search | - | - | โญ 2 | ๐ด february 2021 |
| ๐ llm-jp-judge | - | - | โญ 38 | ๐ข december 2025 |
| ๐ asagi-vlm-colaboratory-sample | - | - | โญ 1 | ๐ก march 2025 |
| ๐ llm-jp-eval-mm | - | - | โญ 41 | ๐ข january |
| ๐ llm-jp-judge | - | - | โญ 38 | ๐ข december 2025 |
| ๐ manga109api | ๐ฅ 194 | ๐ฆ 45k | โญ 128 | ๐ด march 2022 |
| ๐ fastrtc-jp | - | - | โญ 5 | ๐ก may 2025 |
| ๐ whisper-transcription | - | - | โญ 16 | ๐ข january |
| ๐ pocket-researcher | - | - | โญ 11 | ๐ก april 2025 |
| ๐ jtransbench | - | - | โญ 13 | ๐ก october 2025 |
| ๐ easyllasa | - | - | โญ 25 | ๐ก september 2025 |
| ๐ kanjikana-model | - | - | โญ 111 | ๐ข december 2025 |
| ๐ deep-openreview-research-ja | - | - | โญ 13 | ๐ก november 2025 |
| ๐ pitchbench | - | - | โญ 1 | ๐ข december 2025 |
| ๐ mini-transformer-from-scratch | - | - | โญ 2 | ๐ก november 2025 |
| ๐ vv_core_inference | - | - | โญ 31 | ๐ข december 2025 |
| ๐ pyopenjtalk-plus | ๐ฅ 6k | ๐ฆ 298k | โญ 55 | ๐ก november 2025 |
| ๐ japanese_spelling_correction | - | - | โญ 14 | ๐ด september 2023 |
| ๐ py-kaomoji | ๐ฅ 81 | ๐ฆ 37k | โญ 6 | ๐ด december 2018 |
| ๐ llm-jp-vila | - | - | โญ 10 | ๐ก august 2025 |
| ๐ kanjivg-radical | - | - | โญ 105 | ๐ด august 2018 |
| ๐ japanese-wordnet-visualization | - | - | โญ 3 | ๐ด november 2022 |
| ๐ piper-plus | - | - | โญ 29 | ๐ข last wednesday |
| ๐ Japanera | ๐ฅ 2k | ๐ฆ 342k | โญ 35 | ๐ก june 2025 |
| ๐ bert-abstractive-text-summarization | - | - | โญ 49 | ๐ด december 2019 |
| ๐ kyujipy | ๐ฅ 77 | ๐ฆ 22k | โญ 22 | ๐ข january |
| ๐ jitenbot | - | - | โญ 4 | ๐ด december 2024 |
| ๐ ja-icd10 | - | - | โญ 5 | ๐ด july 2021 |
| ๐ pl-bert-vits2 | - | - | โญ 14 | ๐ด december 2023 |
| ๐ ndc_predictor | - | - | โญ 11 | ๐ด august 2021 |
| ๐ pfmt-bench-fin-ja | - | - | โญ 9 | ๐ก march 2025 |
| ๐ marine-plus | ๐ฅ 120 | ๐ฆ 11k | โญ 8 | ๐ก october 2025 |
| ๐ ja-tokenizer-benchmark | - | - | โญ 7 | ๐ด february 2022 |
| ๐ yat | - | - | โญ 7 | ๐ด june 2018 |
| ๐ igakuqa119 | - | - | โญ 7 | ๐ข january |
| ๐ japanese-luw-tokenizer | - | - | โญ 6 | ๐ด december 2021 |
| ๐ ibus-jig | - | - | โญ 4 | ๐ด december 2023 |
| ๐ jp-stopword-filter | ๐ฅ 42 | ๐ฆ 5k | โญ 3 | ๐ด november 2024 |
| ๐ yasumail | - | - | โญ 2 | ๐ข january |
| ๐ himotoki | ๐ฅ 437 | ๐ฆ 3k | โญ 2 | ๐ข last friday |
High-performance libraries for Japanese morphological analysis
- mecab - Yet another Japanese morphological analyzer
- jumanpp - Juman++ (a Morphological Analyzer Toolkit)
- kytea - The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.
- juman - Japanese Morphological Analysis System JUMAN
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ mecab | - | - | โญ 1.1k | ๐ด february 2025 |
| ๐ jumanpp | - | - | โญ 407 | ๐ด march 2023 |
| ๐ kytea | - | - | โญ 212 | ๐ด april 2020 |
| ๐ juman | - | - | โญ 11 | ๐ด december 2021 |
Libraries for dependency and syntactic parsing of Japanese sentences
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ cabocha | - | - | โญ 121 | ๐ด february 2025 |
| ๐ knp | - | - | โญ 33 | ๐ด november 2023 |
Other Japanese NLP and text processing libraries
- jsc - Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.
- aquaskk - An input method without morphological analysis.
- mozc - Mozc - a Japanese Input Method Editor designed for multi-platform
- trimatch - Trimatch: An (Exact|Prefix|Approximate) String Matching Library
- resembla - Resembla: Word-based Japanese similar sentence search library
- corvusskk - โฝโผ SKK-like Japanese Input Method Editor for Windows
- mozuku - ๆฅๆฌ่ชๆ็ซ ใฎ่งฃๆใปๆ กๆญฃใ่กใ LSP ใตใผใใผใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ jsc | - | - | โญ 15 | ๐ด december 2012 |
| ๐ aquaskk | - | - | โญ 367 | ๐ด july 2023 |
| ๐ mozc | - | - | โญ 2.8k | ๐ข last thursday |
| ๐ trimatch | - | - | โญ 2 | ๐ข february |
| ๐ resembla | - | - | โญ 73 | ๐ก august 2025 |
| ๐ corvusskk | - | - | โญ 352 | ๐ข january |
| ๐ mozuku | - | - | โญ 402 | ๐ข december 2025 |
Fast Japanese morphological analysis crates written in Rust
- lindera - A morphological analysis library.
- vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
- goya - Japanese Morphological Analysis written in Rust
- vibrato - vibrato: Viterbi-based accelerated tokenizer
- yoin - A Japanese Morphological Analyzer written in pure Rust
- mecab-rs - Safe Rust bindings for mecab a part-of-speech and morphological analyzer library
- awabi - A morphological analyzer using mecab dictionary
- kanpyo - Japanese Morphological Analyzer written in Rust
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ lindera | - | ๐ฆ 863k | โญ 601 | ๐ข last tuesday |
| ๐ vaporetto | - | ๐ฆ 172k | โญ 251 | ๐ข february |
| ๐ goya | - | ๐ฆ 11k | โญ 81 | ๐ด december 2021 |
| ๐ vibrato | - | ๐ฆ 55k | โญ 399 | ๐ข february |
| ๐ yoin | - | ๐ฆ 3k | โญ 25 | ๐ด october 2017 |
| ๐ mecab-rs | - | ๐ฆ 39k | โญ 66 | ๐ด september 2023 |
| ๐ awabi | - | ๐ฆ 24k | โญ 10 | ๐ก november 2025 |
| ๐ kanpyo | - | ๐ฆ 2.5k | โญ 107 | ๐ข january |
Crates for script and character conversion in Japanese text
- wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
- unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[ๅ่ง๏ฝถ๏พ ] and Wide-alphanumeric[ๅ จ่ง่ฑๆฐ] into normal ones
- kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
- kanaria - ใใฎใฉใคใใฉใชใฏใใฒใใใชใปใซใฟใซใใๅ่งใปๅ จ่งใฎ็ธไบๅคๆใๅคๅฅใๅงใใจใใๆฉ่ฝใๆไพใใพใใ
- japanese-address-parser - ๆฅๆฌใฎไฝๆใ้ฝ้ๅบ็/ๅธๅบ็บๆ/็บๅ/ใใฎไปใซๅๅฒใใใฉใคใใฉใชใงใ
- yosina - Yosina is a transliteration library deals with the letters and symbols used in Japanese writing.
- mojimoji-rs - Rust implementation of a fast converter between Japanese hankaku and zenkaku characters, mojimoji.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ wana_kana_rust | - | ๐ฆ 298k | โญ 88 | ๐ก march 2025 |
| ๐ unicode-jp-rs | - | ๐ฆ 61k | โญ 19 | ๐ด april 2020 |
| ๐ kana | - | - | โญ 12 | ๐ด january 2023 |
| ๐ kanaria | - | - | โญ 21 | ๐ข last saturday |
| ๐ japanese-address-parser | - | - | โญ 10 | ๐ก november 2025 |
| ๐ yosina | - | - | โญ 20 | ๐ก september 2025 |
| ๐ mojimoji-rs | - | - | โญ 4 | ๐ด november 2022 |
Libraries for Japanese full-text search and indexing
- lindera-tantivy - Lindera tokenizer for Tantivy.
- tantivy-vibrato - A Tantivy tokenizer using Vibrato.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ lindera-tantivy | - | ๐ฆ 160k | โญ 67 | ๐ข january |
| ๐ tantivy-vibrato | - | ๐ฆ 1.5k | โญ 3 | ๐ด january 2023 |
Supplementary crates for Japanese text and IME processing
- daachorse - A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
- find-simdoc - Finding all pairs of similar documents time- and memory-efficiently
- crawdad - Rust library of natural language dictionaries using character-wise double-array tries.
- tokenizer-speed-bench - Comparison code of various tokenizers
- stringmatch-bench - Here provides benchmark tools to compare the performance of data structures for string matching.
- vime - Using Vim as an input method for X11 apps
- voicevox_core - ็กๆใงไฝฟใใไธญๅ่ณชใชใใญในใ่ชญใฟไธใใฝใใใฆใงใขใVOICEVOXใฎใณใข
- akaza - Yet another Japanese IME for IBus/Linux
- Jotoba - A free online, self-hostable, multilang Japanese dictionary.
- dvorakjp-romantable - Google ๆฅๆฌ่ชๅ ฅๅ็จDvorakJPใญใผใๅญใใผใใซ / DvorakJP Roman Table for Google Japanese Input
- niinii - Japanese glossator for assisted reading of text using Ichiran
- cskk - SKK (Simple Kana Kanji henkan) library
- japanki - Learn Japanese vocabs ๐ฏ๐ต by doing quizzes on CLI!
- jpreprocess - Japanese text preprocessor for Text-to-Speech applications (OpenJTalk rewrite in rust language)
- listup_precedent - ่ฃๅคไพใฎใใผใฟไธ่ฆงใ่ฃๅคๆใฎใใผใ ใใผใธ(https://www.courts.go.jp/index.html) ใในใฏใฌใคใใณใฐใใฆ็ๆใใใฝใใใฆใงใข
- jisho - Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.
- kanalizer - ่ฑๅ่ชใใ่ชญใฟใๆจๆธฌใใใฉใคใใฉใชใ
- koharu - Automated manga translation tool with LLM, written in Rust.
- yomine - A Japanese vocabulary mining tool designed to help language learners mine new words and expressions.
- matsuba - lightweight japanese ime written in rust
- hujiang_dictionary - ๆฅๆฌ่ช่พๆธ by Rust, support Telegram bot, AWS Lambda and Cloudflare Workers. Support LLM and search RAG.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ daachorse | - | ๐ฆ 651k | โญ 242 | ๐ข january |
| ๐ find-simdoc | - | ๐ฆ 29k | โญ 62 | ๐ก march 2025 |
| ๐ crawdad | - | ๐ฆ 59k | โญ 36 | ๐ด january 2025 |
| ๐ tokenizer-speed-bench | - | - | โญ 4 | ๐ด march 2023 |
| ๐ stringmatch-bench | - | - | โญ 3 | ๐ด september 2022 |
| ๐ vime | - | - | โญ 230 | ๐ด november 2022 |
| ๐ voicevox_core | - | - | โญ 1k | ๐ข today |
| ๐ akaza | - | - | โญ 233 | ๐ข today |
| ๐ Jotoba | - | - | โญ 197 | ๐ด january 2024 |
| ๐ dvorakjp-romantable | - | - | โญ 55 | ๐ข february |
| ๐ niinii | - | - | โญ 15 | ๐ข february |
| ๐ cskk | - | - | โญ 79 | ๐ข january |
| ๐ japanki | - | - | โญ 3 | ๐ด october 2023 |
| ๐ jpreprocess | - | - | โญ 51 | ๐ข february |
| ๐ listup_precedent | - | - | โญ 5 | ๐ด february 2025 |
| ๐ jisho | - | - | โญ 17 | ๐ข january |
| ๐ kanalizer | - | - | โญ 26 | ๐ก november 2025 |
| ๐ koharu | - | - | โญ 635 | ๐ข last saturday |
| ๐ yomine | - | - | โญ 47 | ๐ข last thursday |
| ๐ matsuba | - | - | โญ 18 | ๐ด march 2023 |
| ๐ hujiang_dictionary | - | - | โญ 69 | ๐ข last tuesday |
Japanese morphological analysis libraries for browser and Node.js
- kuromoji.js - JavaScript implementation of Japanese morphological analyzer
- rakutenma - Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
- node-mecab-ya - Yet another mecab wrapper for nodejs
- juman-bin - a User-Extensible Morphological Analyzer for Japanese. ๆฅๆฌ่ชๅฝขๆ ็ด ่งฃๆใทในใใ
- node-mecab-async - Asynchronous japanese morphological analyser using MeCab.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ kuromoji.js | ๐ฅ 149k/week | ๐ฆ 7.9M | โญ 966 | ๐ด november 2018 |
| ๐ rakutenma | ๐ฅ 11/week | ๐ฆ 879 | โญ 473 | ๐ด january 2015 |
| ๐ node-mecab-ya | ๐ฅ 102/week | ๐ฆ 7.2k | โญ 110 | ๐ด repo not found |
| ๐ juman-bin | ๐ฅ 6/week | ๐ฆ 297 | โญ 3 | ๐ด may 2017 |
| ๐ node-mecab-async | ๐ฅ 4.8k/week | ๐ฆ 340k | โญ 103 | ๐ด october 2017 |
Libraries for converting Japanese scripts and readings
- kuroshiro - Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
- kuroshiro-analyzer-kuromoji - Kuromoji morphological analyzer for kuroshiro.
- hepburn - Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation
- japanese-numerals-to-number - Converts Japanese Numerals into number
- jslingua - Javascript libraries to process text: Arabic, Japanese, etc.
- WanaKana - Javascript library for detecting and transliterating Hiragana <--> Katakana <--> Romaji
- node-romaji-name - Normalize and fix common issues with Romaji-based Japanese names.
- kyujitai.js - Utility collections for making Japanese text old-fashioned
- normalize-japanese-addresses - ใชใผใใณใฝใผในใฎไฝๆๆญฃ่ฆๅใฉใคใใฉใชใ
- jaconv - ๆฅๆฌ่ชๆๅญๅคๆใฉใคใใฉใช (javascript)
- romaji-conv - Convert romaji into hiragana
- japanese-addresses-v2 - ๅ จๅฝใฎไฝๆใใผใฟAPI
- jptext-to-emoji - ใใญในใใฎๅ่ชใ็ตตๆๅญใซๅคๆใใ
- japanese.js - Util collection for Japanese text processing. Hiraganize, Katakanize, and Romanize.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ kuroshiro | ๐ฅ 11k/week | ๐ฆ 366k | โญ 948 | ๐ด june 2021 |
| ๐ kuroshiro-analyzer-kuromoji | ๐ฅ 11k/week | ๐ฆ 343k | โญ 67 | ๐ด august 2018 |
| ๐ hepburn | ๐ฅ 126k/week | ๐ฆ 3.1M | โญ 137 | ๐ก september 2025 |
| ๐ japanese-numerals-to-number | ๐ฅ 38k/week | ๐ฆ 2.2M | โญ 59 | ๐ด february 2023 |
| ๐ jslingua | ๐ฅ 192/week | ๐ฆ 7.4k | โญ 52 | ๐ด october 2023 |
| ๐ WanaKana | ๐ฅ 33k/week | ๐ฆ 2.1M | โญ 901 | ๐ก september 2025 |
| ๐ node-romaji-name | ๐ฅ 68/week | ๐ฆ 12k | โญ 41 | ๐ด december 2023 |
| ๐ kyujitai.js | ๐ฅ 12/week | ๐ฆ 1.1k | โญ 23 | ๐ด august 2020 |
| ๐ normalize-japanese-addresses | - | - | โญ 945 | ๐ก july 2025 |
| ๐ jaconv | - | - | โญ 86 | ๐ก june 2025 |
| ๐ romaji-conv | - | - | โญ 26 | ๐ข december 2025 |
| ๐ japanese-addresses-v2 | - | - | โญ 69 | ๐ด january 2025 |
| ๐ jptext-to-emoji | - | - | โญ 2 | ๐ข february |
| ๐ japanese.js | - | - | โญ 167 | ๐ด august 2020 |
Other libraries for Japanese NLP in JavaScript
- bangumi-data - Raw data for Japanese Anime
- yomichan - Japanese pop-up dictionary extension for Chrome and Firefox.
- proofreading-tool - GUIใงๅไฝใใๆๆธๆ กๆญฃใใผใซ GUI tool for textlinting.
- kanjigrid - A web-app displaying the 2200 kanji characters taught in James Heisig's "Remembering the Kanji", 6th edition.
- japanese-toolkit - Monorepo for Kanji, Furigana, Japanese DB, and others
- analyze-desumasu-dearu - ๆใฎๆฌไฝ(ใงใใพใ่ชฟ)ใๅธธไฝ(ใงใใ่ชฟ)ใ่งฃๆใใJavaScriptใฉใคใใฉใช
- hatsuon - Japanese pitch accent utils
- sentiment_ja_js - Sentiment Analysis in Japanese. sentiment_ja with JavaScript
- mecab-ipadic-seed - mecab-ipadic seed dictionary reader
- Japanese-Word-Of-The-Day - Well, a different Japanese word everyday.
- oskim - Extend GNOME On-Screen Keyboard for Input Methods
- tweetMapping - ๆฑๆฅๆฌๅคง้็ฝ็บ็ใใ24ๆ้ไปฅๅ ใซใคใถใใใใใธใชใฟใฐไปใใใคใผใใฎใใธใฟใซใขใผใซใคใใงใใ
- pitch-accent - Predict pitch accent in Japanese
- kana2ipa - ใใฒใใใชใใพใใฏใใซใฟใซใใใๆฅๆฌ่ชใง็บ้ณใใ้ใฎ้ณๅฃฐ่จๅท(IPA)ใซๅคๆใใใณใใณใ
- voicevox - ็กๆใงไฝฟใใไธญๅ่ณชใชใใญในใ่ชญใฟไธใใฝใใใฆใงใขใVOICEVOXใฎใจใใฃใฟใผ
- kamiya-codec - Towards a Japanese verb conjugator and deconjugator based on Taeko Kamiya's The Handbook of Japanese Verbs and The Handbook of Japanese Adjectives and Adverbs opuses.
- closewords - ๆใไผผใๅ่ชใๅ่ช็พคใใๆค็ดขใใๆฅๆฌ่ช(ๆผขๅญๅซใ)ๅฏพๅฟใฎใฉใคใใฉใช
- japanese-analyzer - Japanese Sentence Analyzer (ๆฅๆฌ่ชๆ็ซ ่งฃๆๅจ)
- japanese-furigana-normalize - Normalize Japanese Furigana
- yama - acquire Japanese vocabulary on any website
- kaitai - An application for analyzing Japanese sentence structure using AI. This tool visualizes how words and phrases relate to each other, showing grammatical relationships with interactive diagrams.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ bangumi-data | ๐ฅ 1.1k/week | ๐ฆ 60k | โญ 592 | ๐ข february |
| ๐ yomichan | - | - | โญ 1.1k | ๐ด february 2023 |
| ๐ proofreading-tool | - | - | โญ 87 | ๐ก october 2025 |
| ๐ kanjigrid | - | - | โญ 44 | ๐ด november 2018 |
| ๐ japanese-toolkit | - | - | โญ 62 | ๐ด january 2023 |
| ๐ analyze-desumasu-dearu | ๐ฅ 83k/week | ๐ฆ 4.8M | โญ 18 | ๐ด january 2025 |
| ๐ hatsuon | ๐ฅ 21/week | ๐ฆ 948 | โญ 37 | ๐ด march 2022 |
| ๐ sentiment_ja_js | - | - | โญ 10 | ๐ด december 2021 |
| ๐ mecab-ipadic-seed | ๐ฅ 76/week | ๐ฆ 5.6k | โญ 8 | ๐ด july 2016 |
| ๐ Japanese-Word-Of-The-Day | ๐ฅ 5/week | ๐ฆ 292 | โญ repo not found | ๐ด repo not found |
| ๐ oskim | - | - | โญ 2 | ๐ด february 2023 |
| ๐ tweetMapping | - | - | โญ 25 | ๐ข today |
| ๐ pitch-accent | ๐ฅ 1/week | ๐ฆ 84 | โญ 2 | ๐ด september 2023 |
| ๐ kana2ipa | - | - | โญ 17 | ๐ด october 2020 |
| ๐ voicevox | - | - | โญ 3k | ๐ข today |
| ๐ kamiya-codec | - | - | โญ 20 | ๐ก may 2025 |
| ๐ closewords | - | - | โญ 3 | ๐ก august 2025 |
| ๐ japanese-analyzer | - | - | โญ 683 | ๐ข december 2025 |
| ๐ japanese-furigana-normalize | - | - | โญ 6 | ๐ด july 2024 |
| ๐ yama | - | - | โญ 7 | ๐ข last wednesday |
| ๐ kaitai | - | - | โญ 1 | ๐ข last saturday |
Lightweight Japanese morphological analysis libraries in Go
- kagome - Self-contained Japanese Morphological Analyzer written in pure Go
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ kagome | - | - | โญ 943 | ๐ข january |
Additional Go-based Japanese text processing libraries
- ojosama - ใใญในใใๅฃฑ็พๆบๅคฉๅใตใญใกใๅฌขๆง้ขจใฎๅฃ่ชฟใซๅคๆใใพใ
- nihongo - Japanese Dictionary
- yomichan-import - External dictionary importer for Yomichan.
- imas-ime-dic - THE IDOLM@STER words dictionary for Japanese IME (by imas-db.jp)
- go-kakasi - Kanji transliteration to hiragana/katakana/romaji, in Go
- go-moji - A Go library for Zenkaku/Hankaku conversion
- ojichat - ใใใใใLINEใใกใผใซใง้ใฃใฆใใใใชๆใ็ๆใใ
- name - Name Searcher in Japanese
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ ojosama | - | - | โญ 388 | ๐ข last tuesday |
| ๐ nihongo | - | - | โญ 83 | ๐ด february 2024 |
| ๐ yomichan-import | - | - | โญ 85 | ๐ด february 2023 |
| ๐ imas-ime-dic | - | - | โญ 31 | ๐ข january |
| ๐ go-kakasi | - | - | โญ 6 | ๐ข december 2025 |
| ๐ go-moji | - | - | โญ 20 | ๐ด april 2019 |
| ๐ ojichat | - | - | โญ 1.3k | ๐ด october 2024 |
| ๐ name | - | - | โญ 11 | ๐ด january 2025 |
Japanese morphological analysis and dictionary management libraries
- kuromoji - Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
- Sudachi -ใA Japanese Tokenizer for Business
- SudachiDict - A lexicon for Sudachi
- meval - ๅฝขๆ ็ด ่งฃๆๅจๆง่ฝ่ฉไพกใทในใใ MevAL
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ kuromoji | - | - | โญ 1k | ๐ด september 2019 |
| ๐ Sudachi | - | - | โญ 932 | ๐ด november 2024 |
| ๐ SudachiDict | - | - | โญ 277 | ๐ข january |
| ๐ meval | - | - | โญ 7 | ๐ด august 2019 |
Java libraries for Japanese NLP and OCR
- kanjitomo-ocr - Java library for identifying Japanese characters from images
- jakaroma - Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
- kakasi-java - Kanji transliteration to hiragana/katakana/romaji, in Java
- Kamite - A desktop language immersion companion for learners of Japanese
- react-native-japanese-tokenizer - Async Japanese Tokenizer Native Plugin for React Native for iOS and Android
- elasticsearch-analysis-japanese - Japanese analyzer uses kuromoji japanese tokenizer for ElasticSearch
- moji4j - A Java library to converts between Japanese Hiragana, Katakana, and Romaji scripts.
- neologdn-java - Japanese text normalizer for mecab-neologd
- elasticsearch-sudachi - The Japanese analysis plugin for elasticsearch
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ kanjitomo-ocr | - | - | โญ 203 | ๐ด may 2021 |
| ๐ jakaroma | - | - | โญ 67 | ๐ก june 2025 |
| ๐ kakasi-java | - | - | โญ 55 | ๐ด april 2016 |
| ๐ Kamite | - | - | โญ 129 | ๐ก march 2025 |
| ๐ react-native-japanese-tokenizer | - | - | โญ 38 | ๐ด june 2023 |
| ๐ elasticsearch-analysis-japanese | - | - | โญ 29 | ๐ด march 2012 |
| ๐ moji4j | - | - | โญ 33 | ๐ด june 2022 |
| ๐ neologdn-java | - | - | โญ 5 | ๐ข february |
| ๐ elasticsearch-sudachi | - | - | โญ 218 | ๐ข january |
Models that convert words into numeric vectors to capture semantic similarity
- japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
- chiVe - Japanese word embedding with Sudachi and NWJC
- elmo-japanese - elmo-japanese
- embedrank - Python Implementation of EmbedRank
- aovec - Easy aozorabunko Word2Vec Builder - ้็ฉบๆๅบซๅ จๆธ็ฑใฎWord2Vecใใซใใผ+ๆง็ฏๆธใฟใขใใซ
- dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "ไฟใๅใใซๅบใฅใๆฅๆฌ่ชๅ่ชๅ่พผ (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
- jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
- jawiki_word_vector_updater - ๆๆฐใฎๆฅๆฌ่ชWikipediaใฎใใณใใใผใฟใใ๏ผMeCabใ็จใใฆIPA่พๆธใจๆๆฐใฎNeologd่พๆธใฎไธกๆนใงๅฝขๆ ็ด ่งฃๆใๅฎๆฝใ๏ผใใฎ็ตๆใซๅบใฅใใ word2vec๏ผfastText๏ผGloVeใฎๅ่ชๅๆฃ่กจ็พใๅญฆ็ฟใใใใใฎในใฏใชใใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ japanese-words-to-vectors | - | - | โญ 87 | ๐ด august 2020 |
| ๐ chiVe | - | - | โญ 169 | ๐ด march 2024 |
| ๐ elmo-japanese | - | - | โญ 5 | ๐ด october 2019 |
| ๐ embedrank | - | - | โญ 48 | ๐ด march 2019 |
| ๐ aovec | ๐ฅ 425 | ๐ฆ 81k | โญ 3 | ๐ด january 2023 |
| ๐ dependency-based-japanese-word-embeddings | - | - | โญ 8 | ๐ด august 2019 |
| ๐ jawikivec | - | - | โญ 2 | ๐ด november 2018 |
| ๐ jawiki_word_vector_updater | - | - | โญ 11 | ๐ด may 2020 |
Models that use self-attention to understand context and perform advanced language tasks
- bert-japanese - BERT models for Japanese text.
- japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
- bert-japanese - BERT with SentencePiece for Japanese text.
- SudachiTra - Japanese tokenizer for Transformers
- japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
- shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
- Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
- language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
- medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
- ILYS-aoba-chatbot - ILYS-aoba-chatbot
- t5-japanese - Codes to pre-train Japanese T5 models
- pytorch_bert_japanese - PytorchใงBERTใฎๆฅๆฌ่ชๅญฆ็ฟๆธใฟใขใใซใๅฉ็จใใ
- Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
- RoBERTa-japanese - Japanese BERT Pretrained Model
- aMLP-japanese - aMLP Transformer Model for Japanese
- bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
- sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
- BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
- gpt2-japanese - Japanese GPT2 Generation Model
- text2text-japanese - gpt-2 based text2text conversion model
- gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
- friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
- albert-japanese - BERT with SentencePiece for Japanese text.
- ja_text_bert - ๆฅๆฌ่ชWikipediaใณใผใในใงBERTใฎPre-Trainedใขใใซใ็ๆใใใใใฎใชใใธใใช
- DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
- bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
- Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
- luke - LUKE -- Language Understanding with Knowledge-based Embeddings
- GPTSAN - General-purpose Swich transformer based Japanese language mode
- japanese-clip - Japanese CLIP by rinna Co., Ltd.
- AcademicBART - We pretrained a BART-based Japanese masked language model on paper abstracts from the academic database CiNii Articles
- AcademicRoBERTa - We pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles.
- LINE-DistilBERT-Japanese - DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
- Japanese-Alpaca-LoRA - ๆฅๆฌ่ชใซ็ฟป่จณใใStanford Alpacaใฎใใผใฟใปใใใ็จใใฆLLaMAใใใกใคใณใใฅใผใใณใฐใไฝๆใใLow-Rank AdapterใฎใชใณใฏใจGenerateใตใณใใซใณใผใ
- albert-japanese-tinysegmenter - Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert) on Japanese Wikipedia Resources
- japanese-llama-experiment - Japanese LLaMa experiment
- easylightchatassistant - EasyLightChatAssistant ใฏ่ปฝ้ใงๆค้ฒใ่ฆๅถใฎใชใใญใผใซใซๆฅๆฌ่ชใขใใซใฎLightChatAssistant ใใKoboldCpp ใง็ฐกๅใซใ่ฉฆใใใ็ฐๅขใงใใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ bert-japanese | - | - | โญ 543 | ๐ด march 2024 |
| ๐ japanese-pretrained-models | - | - | โญ repo not found | ๐ด repo not found |
| ๐ bert-japanese | - | - | โญ 498 | ๐ด february 2021 |
| ๐ SudachiTra | ๐ฅ 331 | ๐ฆ 159k | โญ 79 | ๐ด december 2023 |
| ๐ japanese-dialog-transformers | - | - | โญ 245 | ๐ด june 2023 |
| ๐ shiba | ๐ฅ 39 | ๐ฆ 7k | โญ 89 | ๐ด november 2023 |
| ๐ Dialog | - | - | โญ 73 | ๐ด october 2020 |
| ๐ language-pretraining | - | - | โญ 50 | ๐ด may 2023 |
| ๐ medbertjp | - | - | โญ 12 | ๐ด november 2020 |
| ๐ ILYS-aoba-chatbot | - | - | โญ 23 | ๐ด october 2021 |
| ๐ t5-japanese | - | - | โญ 40 | ๐ด september 2021 |
| ๐ pytorch_bert_japanese | - | - | โญ 35 | ๐ด june 2019 |
| ๐ Laboro-BERT-Japanese | - | - | โญ 73 | ๐ด may 2022 |
| ๐ RoBERTa-japanese | - | - | โญ 23 | ๐ด november 2021 |
| ๐ aMLP-japanese | - | - | โญ 16 | ๐ด may 2022 |
| ๐ bert-japanese-aozora | - | - | โญ 40 | ๐ด august 2020 |
| ๐ sbert-ja | - | - | โญ 11 | ๐ด august 2021 |
| ๐ BERT-Japan-vaccination | - | - | โญ 7 | ๐ด may 2022 |
| ๐ gpt2-japanese | - | - | โญ 325 | ๐ด september 2023 |
| ๐ text2text-japanese | - | - | โญ 33 | ๐ด july 2021 |
| ๐ gpt-ja | - | - | โญ 3 | ๐ด september 2021 |
| ๐ friendly_JA-Model | - | - | โญ 1 | ๐ด may 2022 |
| ๐ albert-japanese | - | - | โญ 33 | ๐ด october 2021 |
| ๐ ja_text_bert | - | - | โญ 115 | ๐ด november 2018 |
| ๐ DistilBERT-base-jp | - | - | โญ 161 | ๐ด april 2020 |
| ๐ bert | - | - | โญ 28 | ๐ด april 2022 |
| ๐ Laboro-DistilBERT-Japanese | - | - | โญ 16 | ๐ด december 2020 |
| ๐ luke | - | - | โญ 726 | ๐ด june 2023 |
| ๐ GPTSAN | - | - | โญ 118 | ๐ด september 2023 |
| ๐ japanese-clip | - | - | โญ repo not found | ๐ด repo not found |
| ๐ AcademicBART | - | - | โญ 2 | ๐ด july 2024 |
| ๐ AcademicRoBERTa | - | - | โญ 9 | ๐ด september 2024 |
| ๐ LINE-DistilBERT-Japanese | - | - | โญ 46 | ๐ด march 2023 |
| ๐ Japanese-Alpaca-LoRA | - | - | โญ 141 | ๐ด april 2023 |
| ๐ albert-japanese-tinysegmenter | - | - | โญ 13 | ๐ด september 2023 |
| ๐ japanese-llama-experiment | - | - | โญ 54 | ๐ข december 2025 |
| ๐ easylightchatassistant | - | - | โญ 39 | ๐ด april 2024 |
Resources for using ChatGPT and APIs for Japanese dialogue and text generation
- VRChatGPT - ChatGPTใไฝฟใฃใฆVRChatไธใงใๅใๅบๆฅใใใใซใใใใญใฐใฉใ ใ
- AITuberDegikkoMirii - AITuberใฎๅบ็คใจใชใ้จๅใ้็บใใฆใใพใ
- wanna - Shell command launcher with natural language
- ChatdollKit - ChatdollKit enables you to make your 3D model into a chatbot
- ChuanhuChatGPTJapanese - GUI for ChatGPT API For Japanese
- AISisterAIChan - ChatGPT3.5ใๆญ่ผใใไผบใใดใผในใใAIๅฆนใขใคใกใใใใงใใๅฉ็จใซใฏๅฅ้ChatGPTใฎAPIใญใผใๅฟ ่ฆใงใใ
- vrchatbot - VRChatใซAI Botใไฝใใใใฎใชใใธใใช
- gptuber-by-langchain - GPTใYouTuberใใใใพใ
- openai-chatfriend - A chatbox application built using Nuxt 3 powered by Open AI Text completion endpoint. You can select different personality of your AI friend. The default will respond in Japanese. You can use this app to practice your Nihongo skills!
- chrome-ext-translate-to-hiragana-with-chatgpt - This Chrome extension can translate selected Japanese text to Hiragana by using ChatGPT.
- azure-search-openai-demo - ใใฎใตใณใใซใงใฏใRetrieval Augmented Generation ใใฟใผใณใไฝฟ็จใใฆใ็ฌ่ชใฎใใผใฟใซๅฏพใใฆChatGPT ใฎใใใชไฝ้จใไฝๆใใใใใฎใใใคใใฎใขใใญใผใใ็คบใใฆใใพใใ
- chatvrm - ChatVRMใฏใใฉใฆใถใง็ฐกๅใซ3Dใญใฃใฉใฏใฟใผใจไผ่ฉฑใใงใใใใขใขใใชใฑใผใทใงใณใงใใ
- sftly-replace - A Chrome extention to replace the selected text softly
- summarize_arxv - Summarize arXiv paper with figures
- aiavatarkit - Building AI-based conversational avatars lightning fast
- pva-aoai-integration-solution - ใใฎใชใใธใใชใฏใ็ฅๆธๅธๅฝนๆใงใฎChatGPTใฎ่ฉฆ่กๅฉ็จใซๅใใฆไฝๆใใใใญใผ็ญใใฝใชใฅใผใทใงใณๅใๅ ฌ้ใใใใฎใงใใ
- jp-azureopenai-samples - Azure OpenAIใๆดป็จใใใขใใชใฑใผใทใงใณๅฎ่ฃ ใฎใชใใกใฌใณในใ็ฎ็ใจใใฆใใขใใชใฎใตใณใใซ๏ผใชใใกใฌใณในใขใผใญใใฏใใฃใใตใณใใซใณใผใใจใใใญใคๆ้ ๏ผใ็กๅๆไพใใฆใใพใใ
- character_chat - OpenAIใฎAPIใๅฉ็จใใฆใ่จญๅฎใใใญใฃใฉใฏใฟใผใจๆฅๆฌ่ชใงไผ่ฉฑใใใใฃใใในใฏใชใใใงใใ
- chatgpt-slackbot - OpenAIใฎChatGPT APIใSlackไธใงๅฉ็จใใใใใฎSlackbotในใฏใชใใ (ๆฅๆฌ่ชใงใฎๅฉ็จใๅๆ)
- chatgpt-prompt-sample-japanese - ChatGPT ใฎ Prompt ใฎใตใณใใซใงใใ
- kanji-flashcard-app-gpt4 - A Japanese Kanji Flashcard App built using Python and Langchain, enhanced with the intelligence of GPT-4.
- IgakuQA - Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
- japagen - ๆฅๆฌ่ชใฟในใฏใซใใใLLMใ็จใใ็ไผผๅญฆ็ฟใใผใฟ็ๆใฎๆค่จ
- generativeai-prompt-sample-japanese - ChatGPTใCopilotใชใฉๅ็จฎ็ๆAI็จใฎใๆฅๆฌ่ช]ใฎ Prompt ใฎใตใณใใซ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ VRChatGPT | - | - | โญ 15 | ๐ด march 2023 |
| ๐ AITuberDegikkoMirii | - | - | โญ 5 | ๐ด march 2023 |
| ๐ wanna | ๐ฅ 35 | ๐ฆ 20k | โญ 141 | ๐ด april 2023 |
| ๐ ChatdollKit | - | - | โญ 1.1k | ๐ข last saturday |
| ๐ ChuanhuChatGPTJapanese | - | - | โญ 1 | ๐ด march 2023 |
| ๐ AISisterAIChan | - | - | โญ 27 | ๐ด may 2023 |
| ๐ vrchatbot | - | - | โญ 28 | ๐ด december 2022 |
| ๐ gptuber-by-langchain | - | - | โญ 63 | ๐ด january 2023 |
| ๐ openai-chatfriend | - | - | โญ 16 | ๐ด april 2023 |
| ๐ chrome-ext-translate-to-hiragana-with-chatgpt | - | - | โญ 1 | ๐ด april 2023 |
| ๐ azure-search-openai-demo | - | - | โญ 46 | ๐ด december 2023 |
| ๐ chatvrm | - | - | โญ 818 | ๐ก may 2025 |
| ๐ sftly-replace | - | - | โญ 4 | ๐ด may 2023 |
| ๐ summarize_arxv | - | - | โญ 173 | ๐ด may 2023 |
| ๐ aiavatarkit | - | - | โญ 532 | ๐ข last wednesday |
| ๐ pva-aoai-integration-solution | - | - | โญ repo not found | ๐ด repo not found |
| ๐ jp-azureopenai-samples | - | - | โญ 280 | ๐ก september 2025 |
| ๐ character_chat | - | - | โญ 16 | ๐ด june 2023 |
| ๐ chatgpt-slackbot | - | - | โญ 64 | ๐ด july 2024 |
| ๐ chatgpt-prompt-sample-japanese | - | - | โญ 420 | ๐ข today |
| ๐ kanji-flashcard-app-gpt4 | - | - | โญ 6 | ๐ด october 2023 |
| ๐ IgakuQA | - | - | โญ 48 | ๐ด march 2023 |
| ๐ japagen | - | - | โญ 1 | ๐ด october 2024 |
| ๐ generativeai-prompt-sample-japanese | - | - | โญ 420 | ๐ข today |
Resources for Japanese dictionaries and input method editors (IME)
- mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
- tdmelodic - A Japanese accent dictionary generator
- jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
- unidic-py - Unidic packaged for installation via pip.
- Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
- manbyo-sudachi - Sudachiๅใไธ็ ่พๆธ
- jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
- JIWC-Dictionary - dictionary to find emotion related to text
- JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
- ipadic-py - IPAdic packaged for easy use from Python.
- unidic-lite - A small version of UniDic for easy pip installs.
- emoji-ime-dictionary - ๆฅๆฌ่ชใง็ตตๆๅญๅ ฅๅใใใใใใฎ IME ่ฟฝๅ ่พๆธ orange_book Google ๆฅๆฌ่ชๅ ฅๅใชใฉใงๆฅๆฌ่ชใใ็ตตๆๅญใธใฎๅคๆใๅฏ่ฝใซใใ IME ๆกๅผต่พๆธ
- google-ime-dictionary - ๆฅ่ฑๅคๆใป่ฑ่ช็ฅ่ชๅฑ้ใฎใใใฎ IME ่ฟฝๅ ่พๆธ orange_book ๆฅๆฌ่ชใใ่ฑ่ชใธใฎๅ่ฑๅคๆใ่ฑ่ช็ฅ่ชใฎๅฑ้ใ Google ๆฅๆฌ่ชๅ ฅๅใ ATOK ใชใฉใงๅฏ่ฝใซใใ IME ๆกๅผต่พๆธ
- dic-nico-intersection-pixiv - ใใณใใณๅคง็พ็งใจใใฏใทใ็พ็งไบๅ ธใฎๅ ฑ้้จๅใฎIME่พๆธ
- google-ime-user-dictionary-ja-en - GoogleIME็จใซใฟใซใ่ช่พๆธใใญใธใงใฏใใฎใขใผใซใคใใงใใProject archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
- emoticon - Googleๆฅๆฌ่ชๅ ฅๅใฎ้กๆๅญ่พๆธโฉ(,,รโฟร,,)โฉ
- mecab-mozcdic - open source mozc dictionaryใMeCab่พๆธใฎใใฉใผใใใใซๅคๆใใใใฎใงใใ
- denonbu-ime-dic - ้ป้ณIME: Microsoft IMEใชใฉใงๅฉ็จใใใใจใๆณๅฎใใใ้ป้ณ้จใ้ข้ฃ็จ่ชใฎ่พๆธ
- nijisanji-ime-dic - Microsoft IMEใชใฉใงๅฉ็จใใใใจใๆณๅฎใใใใซใใใใใ้ข้ฃ็จ่ชใฎ็จ่ช่พๆธใงใใ
- pokemon-ime-dic - Microsoft IMEใชใฉใงๅฉ็จใใใใจใๆณๅฎใใใ็พ็ถๅคๆใใฆใใๅ จใฆใฎใใฑใขใณใฎๅๅใ็ถฒ็พ ใใ็จ่ช่พๆธใงใใ
- EJDict - English-Japanese Dictionary data (Public Domain) EJDict-hand
- Ayashiy-Nipongo-Dic - ่ดตๆจฃใฐใใ่พ็ซใไฝฟใฆๆญฃใฌใๆฅๆฌ่ฏญใไฝฟใฉใใจใๅบไพใใ
- genshin-dict - Windows/macOSใงไฝฟใใๅ็ฅใฎๅ่ช่พๆธใงใ
- jmdict-simplified - JMdict and JMnedict in JSON format
- mozcdict-ext - Convert external words into Mozc system dictionary
- mh-dict-jp - MonsterHunterใฎใฆใผใถใผ่พๆธใไฝใใใโฆ
- jitenbot - Convert data from Japanese dictionary websites and applications into portable file formats
- mecab-unidic-neologd - Neologism dictionary based on the language resources on the Web for mecab-unidic
- hololive-dictionary - ใใญใฉใคใ๏ผใใญใฉใคใใใญใใฏใทใงใณ๏ผใซ้ขใใ่พๆธใใกใคใซใงใใ./dictionary ใใฉใซใๅ ใฎใใญในใใใกใคใซใไฝฟใฃใฆใIMEใซๅ่ชใ่ฟฝๅ ใงใใพใใ่ฉณ็ดฐใฏREADME.mdใใ่ฆงใใ ใใใ
- jmdict-yomitan - JMdict, JMnedict, KANJIDIC for Yomitan/Yomichan.
- yomichan-jlpt-vocab - JLPT level tags for words in Yomichan
- Jitendex - A free and openly licensed Japanese-to-English dictionary compatible with multiple dictionary clients
- jiten - japanese android/cli/web dictionary based on jmdict/kanjidic โ ๆฅๆฌ่ชใ่พๅ ธใๅ่ฑ่พๅ ธใๆผข่ฑๅญๅ ธใๅ็ฌ่พๅ ธใๅ่ญ่พๅ ธ
- pixiv-yomitan - Pixiv Encyclopedia Dictionary for Yomitan
- uchinaaguchi_dict - ใใกใชใผใใก่พๅ ธ๏ผๆฒ็ธ่ช่พๅ ธ๏ผ
- yomitan-dictionaries - Japanese and Chinese dictionaries for Yomitan.
- mouse_over_dictionary - ใใฆในใชใผใใผใใๅ่ชใ่ชๅใง่ชญใฟๅใๆฑ็จ่พๆธใใผใซ
- jisyo - ใใชๆผขๅญๅคๆใจใณใธใณ SKKใฎใใใฎๆฐใใ่พๆธๅฝขๅผ
- skk-jisyo.emoji-ja - ๆฅๆฌ่ชใฎ่ชญใฟใใ Emoji ใซๅคๆใใใใใฎ SKK ่พๆธ ๐
- anthy - Anthy is a kana-kanji conversion engine for Japanese. It converts roma-ji to kana, and the kana text to a mixed kana and kanji.
- aws_dic_for_google_ime - AWSใตใผใในๅใฎGoogleๆฅๆฌ่ชๅ ฅๅๅใใฎ่พๆธ
- cl-skkserv - Common LispใซใใSKK่พๆธใตใผใใผใจใใฎๆกๅผต
- anthy - Anthy maintenance
- anthy-unicode - Anthy Unicode - Another Anthy
- azooKey - azooKey: A Japanese Keyboard iOS Application Fully Developed in Swift
- azookey-desktop - Japanese Input Method "azooKey" for Desktop, supporting macOS
- fcitx5-hazkey - Japanese input method for fcitx5, powered by azooKey engine
- mozcdic-ut-place-names - Mozc UT Place Name Dictionary is a dictionary converted from the Japan Post's ZIP code data for Mozc.
- azookeykanakanjiconverter - Kana-Kanji Conversion Module written in Swift
- libkkc - Japanese Kana Kanji conversion input method library
- libskk - Japanese SKK input method library
- kanayomi-dict - openjtalkๅฝขๅผใฎใฆใผใถใผ่พๆธ
- cjkvi-dict - ๆผขๅญใใผใฟใใผในใฎ่พๆธ้ข้ฃใใผใฟ
- wlsp-classical - ๅคๅ ธๆฅๆฌ่ชใฎๅ้ก่ชๅฝ่กจใใผใฟ
- kanji-dict - ๆผขๅญใฎๆธใ้ (็ญ้ )ใป่ชญใฟๆนใป็ปๆฐใป้จ้ฆใป็จไพใปๆใ็ซใกใ่ชฟในใใใใฎๆผขๅญ่พๆธใงใใUnicode 15.1 ใฎใในใฆใฎๆผขๅญ 98,682ๅญใๅ้ฒใใฆใใพใใ
- Kaomoji_proj - (เน แด แด แด เน)ใฟใใใใใใ๏ผๆงKaomoji_proj๏ผใฏMicrosoft็คพใฎๅ ฅๅใฝใใใMicrosoft IMEๅใใฎ้กๆๅญใฎ่พๆธใไฝๆใใใใญใธใงใฏใใงใใ
- kotlin-kana-kanji-converter - Kotlin ใใชๆผขๅญๅคๆใใญใฐใฉใ
- alfred-japanese-dictionary - Japanese-English Dictionary using jisho.org with audio, csv export of entries, and preview of dictionary sites.
- ichiran - Linguistic tools for texts in Japanese language
- mikan - A Japanese input method.
- colloquial-kansai-dictionary - A quick reference for the material taught in Colloquial Kansai Japanese.
- jisho-open - Web frontend for the JMdict Japanese-English dictionary project, with study list support!
- macskk - Yet Another macOS SKK Input Method
- nandoku - ้ฃ่ชญๆผขๅญใๅญฆๅนดๅฅใซใพใจใใ่พๆธใงใใ
- japanese_android_ime - A FOSS Japanese IME for Android
- anthywl - Japanese input method for Sway using libanthy
- sekka - Yet another Japanese Input Method inspired by SKK.
- sumibi - Japanese input method powered by ChatGPT API
- jinmei-dict - ่พๆธใใผใฟใใไบบๅใ ใใๆใๅบใใ่ชญใฟไปฎๅ๏ผใซใฟใซใ๏ผใใญใผใจใใฆใๅ่ฃใจใชใๆธใๆๅญใใชในใใงไฟๆใใใใใชJSONๅฝขๅผใซๆดๅฝขใใฆใใพใใ
- japanesekeyboard - ในใใฌ ๅฎๅ จใชใใฉใคใณใฎๆฅๆฌ่ชใญใผใใผใใขใใช
- japanesearabic - JapaneseArabic Dictionary (ๆฅๆฌ่ชใปใขใฉใใข่ช่พๆธ) ูุงู ูุณ ุงููุบุฉ ุงููุงุจุงููุฉ ูุงูุนุฑุจูุฉ (Yomitan)
- o-dic - ๆฒ็ธ่พๆธ
- skk-emoji-jisyo - SKK ็ตตๆๅญ่พๆธ
- mozcdic-ut-personal-names - A personal name dictionary for Mozc.
- mozcdic-ut-sudachidict - A dictionary converted from SudachiDict for Mozc.
- nihongo - japanese language data and dictionary
- kagome-dict - Dictionary Library for Kagome v2
- canna - Canna Japanese input system
- kansai-accent-dictionary - ไบฌ้ชๅผใขใฏใปใณใ๏ผ้ข่ฅฟๅผ๏ผ่พๆธ - 4,615่ชใๅ้ฒใใๆฅๆฌ่ชๆน่จใขใฏใปใณใ่พๆธ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ mecab-ipadic-neologd | - | - | โญ 2.8k | ๐ด september 2020 |
| ๐ tdmelodic | - | - | โญ 123 | ๐ด march 2024 |
| ๐ jamdict | ๐ฅ 469 | ๐ฆ 51k | โญ 164 | ๐ด june 2021 |
| ๐ unidic-py | ๐ฅ 86k | ๐ฆ 9M | โญ 108 | ๐ด february 2025 |
| ๐ Japanese-Company-Lexicon | - | - | โญ 100 | ๐ด january 2023 |
| ๐ manbyo-sudachi | - | - | โญ 7 | ๐ด april 2021 |
| ๐ jawiki-kana-kanji-dict | - | - | โญ 59 | ๐ข february |
| ๐ JIWC-Dictionary | - | - | โญ 40 | ๐ด january 2021 |
| ๐ JumanDIC | - | - | โญ 4 | ๐ด august 2022 |
| ๐ ipadic-py | ๐ฅ 89k | ๐ฆ 6M | โญ 24 | ๐ด october 2021 |
| ๐ unidic-lite | ๐ฅ 86k | ๐ฆ 9M | โญ 49 | ๐ด september 2020 |
| ๐ emoji-ime-dictionary | - | - | โญ 365 | ๐ด january 2023 |
| ๐ google-ime-dictionary | - | - | โญ 101 | ๐ด january 2023 |
| ๐ dic-nico-intersection-pixiv | - | - | โญ 83 | ๐ด september 2024 |
| ๐ google-ime-user-dictionary-ja-en | - | - | โญ 58 | ๐ด december 2016 |
| ๐ emoticon | - | - | โญ 43 | ๐ด may 2020 |
| ๐ mecab-mozcdic | - | - | โญ 10 | ๐ด january 2018 |
| ๐ denonbu-ime-dic | - | - | โญ 2 | ๐ด november 2022 |
| ๐ nijisanji-ime-dic | - | - | โญ 35 | ๐ข february |
| ๐ pokemon-ime-dic | - | - | โญ 0 | ๐ด january 2020 |
| ๐ EJDict | - | - | โญ 241 | ๐ก november 2025 |
| ๐ Ayashiy-Nipongo-Dic | - | - | โญ 26 | ๐ด may 2024 |
| ๐ genshin-dict | - | - | โญ 126 | ๐ข february |
| ๐ jmdict-simplified | - | - | โญ 329 | ๐ข today |
| ๐ mozcdict-ext | - | - | โญ 68 | ๐ก september 2025 |
| ๐ mh-dict-jp | - | - | โญ 5 | ๐ก april 2025 |
| ๐ jitenbot | - | - | โญ repo not found | ๐ด repo not found |
| ๐ mecab-unidic-neologd | - | - | โญ 87 | ๐ด september 2020 |
| ๐ hololive-dictionary | - | - | โญ 24 | ๐ด december 2024 |
| ๐ jmdict-yomitan | - | - | โญ 237 | ๐ข today |
| ๐ yomichan-jlpt-vocab | - | - | โญ 119 | ๐ก august 2025 |
| ๐ Jitendex | - | - | โญ 446 | ๐ข today |
| ๐ jiten | - | - | โญ 125 | ๐ด december 2023 |
| ๐ pixiv-yomitan | - | - | โญ 49 | ๐ข january |
| ๐ uchinaaguchi_dict | - | - | โญ 4 | ๐ข february |
| ๐ yomitan-dictionaries | - | - | โญ 712 | ๐ก august 2025 |
| ๐ mouse_over_dictionary | - | - | โญ 72 | ๐ด january 2020 |
| ๐ jisyo | - | - | โญ 28 | ๐ด september 2023 |
| ๐ skk-jisyo.emoji-ja | - | - | โญ 30 | ๐ด march 2018 |
| ๐ aws_dic_for_google_ime | - | - | โญ 7 | ๐ด november 2019 |
| ๐ cl-skkserv | - | - | โญ 31 | ๐ด october 2024 |
| ๐ anthy | - | - | โญ 3 | ๐ด july 2013 |
| ๐ anthy-unicode | - | - | โญ 41 | ๐ข last friday |
| ๐ azooKey | - | - | โญ 633 | ๐ข yesterday |
| ๐ azookey-desktop | - | - | โญ 790 | ๐ข yesterday |
| ๐ fcitx5-hazkey | - | - | โญ 160 | ๐ข last friday |
| ๐ mozcdic-ut-place-names | - | - | โญ 20 | ๐ข february |
| ๐ azookeykanakanjiconverter | - | - | โญ 135 | ๐ข last saturday |
| ๐ libkkc | - | - | โญ 112 | ๐ด august 2024 |
| ๐ libskk | - | - | โญ 98 | ๐ข last friday |
| ๐ kanayomi-dict | - | - | โญ repo not found | ๐ด repo not found |
| ๐ cjkvi-dict | - | - | โญ 108 | ๐ด september 2017 |
| ๐ wlsp-classical | - | - | โญ 2 | ๐ก november 2025 |
| ๐ kanji-dict | - | - | โญ 6 | ๐ข february |
| ๐ Kaomoji_proj | - | - | โญ 11 | ๐ก october 2025 |
| ๐ kotlin-kana-kanji-converter | - | - | โญ 5 | ๐ก november 2025 |
| ๐ alfred-japanese-dictionary | - | - | โญ 6 | ๐ข february |
| ๐ ichiran | - | - | โญ 386 | ๐ข january |
| ๐ mikan | - | - | โญ 24 | ๐ก june 2025 |
| ๐ colloquial-kansai-dictionary | - | - | โญ 9 | ๐ข last saturday |
| ๐ jisho-open | - | - | โญ 57 | ๐ก june 2025 |
| ๐ macskk | - | - | โญ 258 | ๐ข february |
| ๐ nandoku | - | - | โญ 1 | ๐ข february |
| ๐ japanese_android_ime | - | - | โญ 2 | ๐ก september 2025 |
| ๐ anthywl | - | - | โญ 34 | ๐ก april 2025 |
| ๐ sekka | - | - | โญ 24 | ๐ก july 2025 |
| ๐ sumibi | - | - | โญ 36 | ๐ข last saturday |
| ๐ jinmei-dict | - | - | โญ 6 | ๐ด april 2020 |
| ๐ japanesekeyboard | - | - | โญ 202 | ๐ข last wednesday |
| ๐ japanesearabic | - | - | โญ 17 | ๐ก may 2025 |
| ๐ o-dic | - | - | โญ 5 | ๐ก march 2025 |
| ๐ skk-emoji-jisyo | - | - | โญ 140 | ๐ด january 2025 |
| ๐ mozcdic-ut-personal-names | - | - | โญ 24 | ๐ข february |
| ๐ mozcdic-ut-sudachidict | - | - | โญ 20 | ๐ข february |
| ๐ nihongo | - | - | โญ 19 | ๐ด january 2025 |
| ๐ kagome-dict | - | - | โญ 15 | ๐ข last tuesday |
| ๐ canna | - | - | โญ 3 | ๐ก august 2025 |
| ๐ kansai-accent-dictionary | - | - | โญ 1 | ๐ข december 2025 |
Corpora annotated with part-of-speech tags and named entities
- ner-wikipedia-dataset - Wikipediaใ็จใใๆฅๆฌ่ชใฎๅบๆ่กจ็พๆฝๅบใใผใฟใปใใ
- IOB2Corpus - Japanese IOB2 tagged corpus for Named Entity Recognition.
- TwitterCorpus - ้ฆ้ฝๅคงๆฅๆฌ่ช Twitter ใณใผใใน
- UD_Japanese-PUD - Parallel Universal Dependencies.
- UD_Japanese-GSD - Japanese data from the Google UDT 2.0.
- KWDLC - Kyoto University Web Document Leads Corpus
- AnnotatedFKCCorpus - Annotated Fuman Kaitori Center Corpus
- UD_Japanese-GSDLUW - Long-unit-word version of UD_Japanese-GSD
- ud_japanese-bccwj - This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation.
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ ner-wikipedia-dataset | - | - | โญ 142 | ๐ด september 2023 |
| ๐ IOB2Corpus | - | - | โญ 61 | ๐ด february 2020 |
| ๐ TwitterCorpus | - | - | โญ 21 | ๐ด march 2016 |
| ๐ UD_Japanese-PUD | - | - | โญ 0 | ๐ด may 2020 |
| ๐ UD_Japanese-GSD | - | - | โญ 28 | ๐ด may 2022 |
| ๐ KWDLC | - | - | โญ 83 | ๐ด december 2023 |
| ๐ AnnotatedFKCCorpus | - | - | โญ 18 | ๐ด december 2023 |
| ๐ anthy | - | - | โญ 15 | ๐ด february 2023 |
| ๐ UD_Japanese-GSDLUW | - | - | โญ 3 | ๐ก november 2025 |
| ๐ ud_japanese-bccwj | - | - | โญ 26 | ๐ก november 2025 |
Bilingual corpora containing aligned sentences for translation tasks
- small_parallel_enja - 50k English-Japanese Parallel Corpus for Machine Translation Benchmark.
- Web-Crawled-Corpus-for-Japanese-Chinese-NMT - A Web Crawled Corpus for Japanese-Chinese NMT
- CourseraParallelCorpusMining - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
- JESC - A large parallel corpus of English and Japanese
- AMI-Meeting-Parallel-Corpus - AMI Meeting Parallel Corpus
- giant_ja-en_parallel_corpus - This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanfordโs JESC project.
- jesc_small - Small Japanese-English Subtitle Corpus
- graded-enja-corpus - ็ฆๆญข็จ่ชใๅ่ชใฌใใซใ่ๆ ฎใใๆฅ่ฑๅฏพ่จณใณใผใในใงใใ
- cjk-compsci-terms - CJK computer science terms comparison / ไธญๆฅ้้ป่ ฆ็งๅญธ่ก่ชๅฐ็ ง / ๆฅไธญ้ใฎใณใณใใฅใผใฟ็งๅญฆใฎ็จ่ชๅฏพ็ ง / ํยท์คยท์ผ ์ ์ฐํ ์ฉ์ด ๋์กฐ
- Laboro-ParaCorpus - Scripts for creating a Japanese-English parallel corpus and training NMT models
- google-vs-deepl-je - google-vs-deepl-je
- matcha - ่จชๆฅ่ฆณๅ ๅฎขๅใใกใใฃใขMATCHAใฎ่จไบใใใๆฅๆฌ่ชใฎใใญในใๅนณๆๅใฎใใใฎใใผใฟใปใใใๆง็ฏใใพใใใ
- en-ja-el - EnJaEL: En-Ja Parallel Entity Linking Dataset (Version 1.0)
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ small_parallel_enja | - | - | โญ 98 | ๐ด september 2019 |
| ๐ Web-Crawled-Corpus-for-Japanese-Chinese-NMT | - | - | โญ 15 | ๐ด september 2023 |
| ๐ CourseraParallelCorpusMining | - | - | โญ 15 | ๐ด august 2024 |
| ๐ JESC | - | - | โญ 87 | ๐ด november 2017 |
| ๐ AMI-Meeting-Parallel-Corpus | - | - | โญ 11 | ๐ด december 2020 |
| ๐ giant_ja-en_parallel_corpus | - | - | โญ 5 | ๐ด august 2019 |
| ๐ jesc_small | - | - | โญ 3 | ๐ด july 2019 |
| ๐ graded-enja-corpus | - | - | โญ 6 | ๐ก august 2025 |
| ๐ cjk-compsci-terms | - | - | โญ 145 | ๐ข january |
| ๐ Laboro-ParaCorpus | - | - | โญ 18 | ๐ด november 2021 |
| ๐ google-vs-deepl-je | - | - | โญ 4 | ๐ด march 2020 |
| ๐ matcha | - | - | โญ 6 | ๐ด january 2025 |
| ๐ en-ja-el | - | - | โญ 2 | ๐ด january 2025 |
Collections of conversation data for training dialogue systems
- JMRD - Japanese Movie Recommendation Dialogue dataset
- open2ch-dialogue-corpus - ใใผใทใ2ใกใใใญใใใฏใญใผใซใใฆไฝๆใใๅฏพ่ฉฑใณใผใใน
- BSD - The Business Scene Dialogue corpus
- asdc - Accommodation Search Dialog Corpus (ๅฎฟๆณๆฝ่จญๆข็ดขๅฏพ่ฉฑใณใผใใน)
- japanese-corpus - ๆฅๆฌ่ชใฎๅฏพ่ฉฑใใผใฟ for seq2seq etc
- BPersona-chat - This repository contains the JapaneseโEnglish bilingual chat corpus BPersona-chat published in the paper Chat Translation Error Detection for Assisting Cross-lingual Communications at AACL-IJCNLP 2022's Workshop Eval4NLP 2022.
- japanese-daily-dialogue - Japanese Daily Dialogue, or ๆฅๆฌ่ชๆฅๅธธๅฏพ่ฉฑใณใผใใน in Japanese, is a high-quality multi-turn dialogue dataset containing daily conversations on five topics: dailylife, school, travel, health, and entertainment.
- llm-japanese-dataset - LLMๆง็ฏ็จใฎๆฅๆฌ่ชใใฃใใใใผใฟใปใใ
- kokorochat - ใญใผใซใใฌใคใงๅ้ใใๆฅๆฌ่ชใฎใซใฆใณใปใชใณใฐๅฏพ่ฉฑใใผใฟใปใใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ JMRD | - | - | โญ 28 | ๐ด july 2022 |
| ๐ open2ch-dialogue-corpus | - | - | โญ 98 | ๐ด june 2021 |
| ๐ BSD | - | - | โญ 73 | ๐ด november 2021 |
| ๐ asdc | - | - | โญ 25 | ๐ด august 2023 |
| ๐ japanese-corpus | - | - | โญ 3 | ๐ด october 2018 |
| ๐ BPersona-chat | - | - | โญ 5 | ๐ด january 2023 |
| ๐ japanese-daily-dialogue | - | - | โญ 54 | ๐ด march 2023 |
| ๐ llm-japanese-dataset | - | - | โญ 87 | ๐ด january 2024 |
| ๐ kokorochat | - | - | โญ 16 | ๐ก august 2025 |
Corpora for tasks such as question answering or entailment recognition
- jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
- kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
- JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
- simple-jppdb - A paraphrase database for Japanese text simplification
- chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
- JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
- JaNLI - Japanese Adversarial Natural Language Inference Dataset
- ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
- emoji-ja - UNICODE็ตตๆๅญใฎๆฅๆฌ่ช่ชญใฟ/ใญใผใฏใผใ/ๅ้ก่พๆธ
- nayose-wikipedia-ja - Wikipediaใใไฝๆใใๆฅๆฌ่ชๅๅฏใใใผใฟใปใใ
- ja.text8 - Japanese text8 corpus for word embedding.
- ThreeLineSummaryDataset - 3่ก่ฆ็ดใใผใฟใปใใ
- japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
- kanji-frequency - Kanji usage frequency data collected from various sources
- TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
- CoARiJ - Corpus of Annual Reports in Japan
- technological-book-corpus-ja - ๆฅๆฌ่ชใงๆธใใใๆ่กๆธใๅ้ใใ็ใณใผใใน/ใใผใซ
- ita-corpus-chuwa - Chunked word annotation for ITA corpus
- wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
- inappropriate-words-ja - ๆฅๆฌ่ชใซใใใไธ้ฉๅ่กจ็พใๅ้ใใพใใ่ช็ถ่จ่ชๅฆ็ใฎๆใฎใใผใฟใฏใชใผใใณใฐ็จ็ญใซไฝฟใใใจๆใใพใใ
- house-of-councillors - ๅ่ญฐ้ขใฎๅ ฌๅผใฆใงใใตใคใใใไผๆดพใ่ญฐๅกใ่ญฐๆกใ่ณชๅไธปๆๆธใฎใใผใฟใๆด็ใใพใใใ
- house-of-representatives - ๅฝไผ่ญฐๆกใใผใฟใใผใน๏ผ่ก่ญฐ้ข
- STAIR-captions - STAIR captions: large-scale Japanese image caption dataset
- Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
- speechBSD - An extension of the BSD corpus with audio and speaker attribute information
- ita-corpus - ITAใณใผใในใฎๆ็ซ ใชในใ
- rohan4600 - ใขใผใฉใใฉใณในๅๆฅๆฌ่ชใณใผใใน
- anlp-jp-history - ่จ่ชๅฆ็ๅญฆไผๅนดๆฌกๅคงไผ่ฌๆผใฎๅ จใชในใใปๆฉๆขฐๅฏ่ชญ็ใชใฉ
- keigo_transfer_task - ๆฌ่ชๅคๆใฟในใฏใซใใใ่ฉไพก็จใใผใฟใปใใ
- loanwords_gairaigo - English loanwords in Japanese
- jawikicorpus - Japanese-Wikipedia Wikification Corpus
- GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
- wrime - WRIME: ไธป่ฆณใจๅฎข่ฆณใฎๆๆ ๅๆใใผใฟใปใใ
- jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
- WikipediaWordFrequencyList - ๆฅๆฌ่ชWikipediaใงไฝฟ็จใใใ้ ปๅบๅ่ชใฎใชในใ
- kokkosho_data - ่ปไธกไธๅ ทๅๆ ๅ ฑใซ้ขใใใใผใฟใปใใ
- pdmocrdataset-part1 - ใใธใฟใซๅ่ณๆOCRใใญในใๅไบๆฅญใซใใใฆไฝๆใใใOCRๅญฆ็ฟ็จใใผใฟใปใใ
- huriganacorpus-ndlbib - ๅ จๅฝๆธ่ชใใผใฟใใไฝๆใใๆฏใไปฎๅใฎใใผใฟใปใใ
- jvs_hiho - JVS (Japanese versatile speech) ใณใผใในใฎ่ชไฝใฎใฉใใซ
- hirakanadic - Allows Sudachi to normalize from hiragana to katakana from any compound word list
- animedb - ็ด100ๅนดใซๆธกใใขใใกไฝๅใชในใใใผใฟใใผใน
- security_words - ใตใคใใผใปใญใฅใชใใฃใซ้ข้ฃใใๅ ฌ็ใช็ต็นใฎๆฅ่ฑๅฏพๅฟ
- Data-on-Japanese-Diet-Members - ๆฅๆฌใฎๅฝไผ่ญฐๅกใฎใใผใฟ
- honkoku-data - ๆญดๅฒ่ณๆใฎๅธๆฐๅๅ ๅ็ฟปๅปใใฉใใใใฉใผใ ใใฟใใชใง็ฟปๅปใใฎใใญในใใใผใฟ็ฝฎใๅ ดใงใใ / Transcription texts created on Minna de Honkoku (https://honkoku.org), a crowdsourced transcription platform for historical Japanese documents.
- wikihow_japanese - wikiHow dataset (Japanese version)
- engineer-vocabulary-list - Engineer Vocabulary List in Japanese/English
- JSICK - Japanese Sentences Involving Compositional Knowledge (JSICK) Dataset/JSICK-stress Test Set
- phishurl-list - Phishing URL dataset from JPCERT/CC
- jcms - A Japanese Corpus of Many Specialized Domains (JCMS)
- aozorabunko_text - text-only archives of www.aozora.gr.jp
- friendly_JA-Corpus - friendly_JA is a parallel Japanese-to-Japanese corpus aimed at making Japanese easier by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
- topokanji - Topologically ordered lists of kanji for effective learning
- isbn4groups - ISBN-13ใซใใใๆฅๆฌ่ชใงใฎๅบ็็ฉ (978-4-XXXXXXXXX) ใซ้ขใใใใผใฟ็ญ
- NMeCab - NMeCab: About Japanese morphological analyzer on .NET
- ndlngramdata - ใใธใฟใซๅ่ณๆใใไฝๆใใOCRใใญในใใใผใฟใฎngram้ ปๅบฆ็ตฑ่จๆ ๅ ฑใฎใใผใฟใปใใ
- ndlngramviewer_v2 - 2023ๅนด1ๆใซใชใใฅใผใขใซใใNDL Ngram Viewerใฎใฝใผในใณใผใ็ญไธๅผ
- data_set - ๆณๅพใปๅคไพ้ขไฟใฎใใผใฟใปใใ
- huggingface-datasets_wrime - WRIME for huggingface datasets
- ndl-minhon-ocrdataset - NDLๅคๅ ธ็ฑOCRๅญฆ็ฟ็จใใผใฟใปใใ๏ผใฟใใชใง็ฟปๅปๅ ๅทฅใใผใฟ๏ผ
- PAX_SAPIENTICA - GIS & Archaeological Simulator. 2023 in development.
- j-liwc2015 - Japanese version of LIWC2015
- huggingface-datasets_livedoor-news-corpus - Japanese Livedoor news corpus for huggingface datasets
- huggingface-datasets_JGLUE - JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
- commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
- comet-atomic-ja - COMET-ATOMIC ja
- dcsg-ja - Dialogue Commonsense Graph in Japanese
- japanese-toxic-dataset - "Proposal and Evaluation of Japanese Toxicity Schema" provides a schema and dataset for toxicity in the Japanese language.
- camera - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) is the Japanese ad text generation dataset.
- Japanese-Fakenews-Dataset - ๆฅๆฌ่ชใใงใคใฏใใฅใผในใใผใฟใปใใ
- jpn_explainable_qa_dataset - jpn_explainable_qa_dataset
- copa-japanese - COPA Dataset in Japanese
- WLSP-familiarity - Word Familiarity Rate for 'Word List by Semantic Principles (WLSP)'
- ProSub - A cross-linguistic study of pronoun substitutes and address terms
- commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
- ramendb - ใชใใจใใใผใฟใใผใน( https://supleks.jp/ )ใใใฎในใฏใฌใคใใณใฐใใผใซใจๅ้ใใผใฟ
- huggingface-datasets_CAMERA - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) for huggingface datasets
- FactCheckSentenceNLI-FCSNLI- - FactCheckSentenceNLIใใผใฟใปใใ
- databricks-dolly-15k-ja - databricks/dolly-v2-12b ใฎๅญฆ็ฟใใผใฟใซไฝฟ็จใใใdatabricks-dolly-15k.jsonl ใๆฅๆฌ่ชใซ็ฟป่จณใใใใผใฟใปใใใซใชใใพใใ
- EaST-MELD - EaST-MELD is an English-Japanese dataset for emotion-aware speech translation based on MELD.
- meconaudio - Mecon Audio(Medical Conference Audio)ใฏๅ็ๅดๅ็ไธปๅฌใฎๅ ้ฒๅป็ไผ่ญฐใฎ่ญฐไบ้ฒใฎ่ชญใฟไธใใใผใฟใปใใใงใใ
- japanese-addresses - ๅ จๅฝใฎ็บไธ็ฎใฌใใซ๏ผ277,191ไปถ๏ผใฎไฝๆใใผใฟใฎใชใผใใณใใผใฟ
- aozorasearch - The full-text search system for Aozora Bunko by Groonga. ้็ฉบๆๅบซๅ จๆๆค็ดขใฉใคใใฉใชๅ ผWebใขใใชใ
- llm-jp-corpus - This repository contains scripts to reproduce the LLM-jp corpus.
- alpaca_ja - alpacaใใผใฟใปใใใๆฅๆฌ่ชๅใใใใฎใงใ
- instruction_ja - Japanese instruction data (ๆฅๆฌ่ชๆ็คบใใผใฟ)
- japanese-family-names - Top 5000 Japanese family names, with readings, ordered by frequency.
- kanji-data-media - Japanese language data on kanji, radicals, media files, fonts and related resources from Kanji alive
- reazonspeech - Construct large-scale Japanese audio corpus at home
- huriganacorpus-aozora - ้็ฉบๆๅบซๅใณใตใใจใฎ็นๅญใใผใฟใใไฝๆใใๆฏใไปฎๅใฎใใผใฟใปใใ
- koniwa - An open collection of annotated voices in Japanese language
- JMMLU - ๆฅๆฌ่ชใใซใใฟในใฏ่จ่ช็่งฃใใณใใใผใฏ Japanese Massive Multitask Language Understanding Benchmark
- hurigana-speech-corpus-aozora - ้็ฉบๆๅบซๆฏใไปฎๅๆณจ้ไปใ้ณๅฃฐใณใผใในใฎใใผใฟใปใใ
- jqara - JQaRA: Japanese Question Answering with Retrieval Augmentation - ๆค็ดขๆกๅผต(RAG)่ฉไพกใฎใใใฎๆฅๆฌ่ชQ&Aใใผใฟใปใใ
- jemhopqa - JEMHopQA (Japanese Explainable Multi-hop Question Answering) is a Japanese multi-hop QA dataset that can evaluate internal reasoning.
- jacred - Repository for Japanese Document-level Relation Extraction Dataset (plan to be released in March).
- jades - JADES is a dataset for text simplification in Japanese, described in "JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers" (the paper will be available soon).
- do-not-answer-ja - 2023ๅนด8ๆใซใกใซใใซใณๅคงๅญฆใใๅ ฌ้ใใใๅฎๅ จๆง่ฉไพกใใผใฟใปใใใDo-Not-Answerใใๆฅๆฌ่ชLLMใฎ่ฉไพกใซใใใฆใไฝฟ็จใงใใใใใซๆฅๆฌ่ชใซ่ชๅ็ฟป่จณใใใใใซๆฅๆฌๆๅใ่ๆ ฎใใฆไฟฎๆญฃใใใใผใฟใปใใใ
- oasst1-89k-ja - OpenAssistant ใฎใชใผใใณใฝใผในใใผใฟ OASST1 ใๆฅๆฌ่ชใซ็ฟป่จณใใใใผใฟใปใใใซใชใใพใใ
- jacwir - JaCWIR: Japanese Casual Web IR - ๆฅๆฌ่ชๆ ๅ ฑๆค็ดข่ฉไพกใฎใใใฎๅฐ่ฆๆจกใงใซใธใฅใขใซใชWebใฟใคใใซใจๆฆ่ฆใฎใใผใฟใปใใ
- japanese-technical-dict - ๆฅๆฌ่ชๅญฆ็ฟ่ ใฎใใใฎ็งๅญฆๆ่กๆฅญ็ใงใใไฝฟใใใ็ไปฎๅใจๅ ใฎๅ่ชๅฏพ็ ง่กจ
- j-unimorph - Dataset of UniMorph in Japanese
- GazeVQA - Dataset for the LREC-COLING 2024 paper "A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions"
- J-CRe3 - Code for J-CRe3 experiments (Ueda et al., LREC-COLING, 2024)
- jmed-llm - JMED-LLM: Japanese Medical Evaluation Dataset for Large Language Models
- lawtext - Plain text format for Japanese law
- pdmocrdataset-part2 - OCRๅฆ็ใใญใฐใฉใ ็ ็ฉถ้็บไบๆฅญใซใใใฆไฝๆใใใOCRๅญฆ็ฟ็จใใผใฟใปใใ
- japanesetopicwsd - ่ฉฑ้กใซๅบใฅใ่ช็พฉๆๆงๆง่งฃๆถ่ฉไพกใปใใ
- temporalNLI_dataset - Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating Generalization Capacity of Language Models
- JSeM - Japanese semantic test suite (FraCaS counterpart and extensions)
- niilc-qa - NIILC QA data
- chain-of-thought-ja-dataset - Dataset of paper "Verification of Chain-of-Thought Prompting in Japanese"
- WikipediaAnnotatedCorpus - This is a Japanese text corpus that consists of Wikipedia articles with various linguistic annotations.
- elaws-history - e-Gov ๆณไปคๆค็ดขใง้ ๅธใใใฆใใใๅ จใฆใฎๆณไปคใใผใฟใใๅฎๆ็ใซใใฆใณใญใผใใใใขใผใซใคใใใฆใใพใ
- Japanese-RP-Bench - Japanese-RP-BenchใฏLLMใฎๆฅๆฌ่ชใญใผใซใใฌใค่ฝๅใๆธฌๅฎใใใใใฎใใณใใใผใฏใงใใ
- hdic - HDIC : Integrated Database of Hanzi Dictionaries in Early Japan
- awesome-japan-opendata - Awesome Japan Open Data - ๆฅๆฌใฎใชใผใใณใใผใฟๆ ๅ ฑไธ่ฆงใปใพใจใ
- kanji-data - ๅธธ็จๆผขๅญ่กจไปใๆผขๅญใซ้ขใใใใผใฟ
- openchj-genji - ใๆบๆฐ็ฉ่ชใๅฝขๆ ่ซๆ ๅ ฑใใผใฟ
- AdParaphrase - This repository contains data for our paper "AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts".
- Jamp_sp - ใขในใใฏใใ่ๆ ฎใใๆฅๆฌ่ชๆ้ๆจ่ซใใผใฟใปใใใฎๆง็ฏ๏ผJamp_sp: Controlled Japanese Temporal Inference Dataset Considering Aspect๏ผ
- jnli-neg - ๅฆๅฎ็่งฃ่ฝๅใ่ฉไพกใใใใใฎๆฅๆฌ่ช่จ่ชๆจ่ซใใผใฟใปใใ JNLI-Neg ใฎๅ ฌ้็จใชใใธใใชใงใใ
- swallow-corpus - This repository provides Python implementation for building Swallow Corpus Version 1, a large Japanese web corpus (Okazaki et al., 2024), from Common Crawl archives.
- jalecon - A Dataset of Japanese Lexical Complexity for Non-Native Readers
- multils-japanese - MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.
- nwjc - NINJAL Web Japanese Corpus
- open-mantra-dataset - Dataset introduced in the paper "Towards Fully Automated Manga Translation" presented in AAAI21
- public-annotations - Various annotations of Manga109 dataset
- gimei - random Japanese name and address generator
- safety-boundary-test - ๆฅๆฌ่ช่จ่ชใขใใซใฎๅฎๅ จๆงใฎๆฏใ่ใใ่ฉไพกใใใในใใปใใ
- j-ono-data - A simple, open-source collection of Japanese onomatopoeic and mimetic sound words in JSON format. With manga samples.
- kanji - List of japanese kanji radicals to learn
- jethics - ๆฅๆฌ่ช้ๅพณ็่งฃๅบฆ่ฉไพก็จใใผใฟใปใใJETHICSใฎๆฆ่ชฌใใผใธ (to be update)
- waon - WAON: Large-Scale and High-Quality Japanese Image-Text Dataset for Vision-Language Models
- kuci - Kyoto University Commonsense Inference dataset (KUCI)
- japanese-address-testdata - ่งฃๆใ้ฃใใๆฅๆฌใฎไฝๆใฎใในใใใผใฟใปใใ
- jlpt-word-list - Japanese word list from JLPT vocabulary
- hiragana_mojigazo - ๆๅญ็ปๅใใผใฟใปใใ(ๅนณไปฎๅ73ๆๅญ็)
- lawqa_jp - ๆฅๆฌใฎๆณไปคใซ้ขใใๅค่ข้ธๆๅผQAใใผใฟใปใใ
- yjcaptions - YJ Captions 26k Dataset
- ja-vg-vqa - Japanese Visual Genome VQA dataset
- lawhub - Repository to track Japanese Law in text format
- japanese-subtitles-word-kanji-frequency-lists - A word frequency list derived from subtitles from Japanese drama, anime and films.
- jconj - A table-based Japanese word conjugator
- extract_jawp_names - Extracts personal names in Wikipedia Japanese.
- cejc_yomichan_freq_dict - Frequency dictionary for yomichan based on the Corpus of Everyday Japanese Conversation dataset
- wikidict-ja - Wikipedia Bilingual Reference Data (Japanese)
- ajimee-bench - AJIMEE-Bench (Advanced Japanese IME Evaluation Benchmark)
- j-spaw - J-SpAW: Japanese speech corpus for speaker verification and anti-spoofing
- camera3 - CAMERA3: An Evaluation Dataset for Controllable Ad Text Generation in Japanese
- jgpqa - Japanese translation of the GPQA dataset
- tanaka-corpus-plus - Tanaka Corpus ใฎใใคใบใ้คๅปใใฆใใพใใ
- emotioncorpusjapanesetokushimaa2lab - Japanese emotion corpus Tokushima Univ. A-2 Lab.
- osworld-jp - ่จ่ชใ่ๆ ฎใใ่ฉไพกใฎใใใฎใๆฅๆฌ่ช็ใณใณใใฅใผใฟใฆใผในใใณใใใผใฏ
- quasi_japanese_reviews - Quasi Japanese Reviews (ๆฌไผผใฌใใฅใผใใผใฟ)
- psychiatry-clinical-notes - ็ฒพ็ฅ็งๅ่จบใซใซใไฝๆใขใณใฑใผใ ใใผใฟใปใใ
- merged-town-names - ๅธ็บๆๅไฝตใชใฉใซใใๆถๆป ใใๆงๅฐๅใจๆฐๅฐๅใฎๅฏพๅฟ่กจ
- japanesetextemoticondata - Japanese text-emoticon data.
- mishearing-corpus - ่ใ้้ใใณใผใใน๏ธฑCSV๏ผTable Schema ใง็ด 1 ไธไปถใ็ฎก็ใใVS Code๏ผpre-commit๏ผFrictionless๏ผGitHub Actions ใง่ชๅๆค่จผใ่กใๆฅๆฌ่ชใใผใฟใปใใ
- kotowaza - Structured JSON dataset of Japanese proverbs (kotowaza) with meanings in Indonesian & English, examples, JLPT levels, and tags.
- selective-rag-kasensabo - ๅปบ่จญใฎๆ่กๅบๆบใซ้ขใใ่ณชๅใฎๅฐ้ๆง็ฒๅบฆ๏ผ็ดฐใใ/็ฒใ๏ผใ96%ๆญฃ็ขบใซ่ชๅๅคๅฎใใๆ้ฉใชRAGใทในใใ ๏ผColBERT/Naive๏ผใ้ธๆใใๅฎ็จ็ใชAgentic RAGใทในใใ ใฎMVPใงใใ2025ๅนด11ๆใซๅ ฌ้ใใใๆฒณๅท็ ้ฒใใ ใฎๆ่กๅบๆบใๅฏพ่ฑกใซ๏ผใคใฎRAGใทในใใ ใๆง็ฏใใๅฐ้ๆงใฎ็ฒๅบฆใ็ฐใชใ200ๅใฎ่ณชๅใซๅฏพใใฆใ็ฒพๅบฆใจ้ๅบฆใๆฏ่ผใใใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ jrte-corpus | - | - | โญ 77 | ๐ด june 2023 |
| ๐ kanji-data | - | - | โญ 201 | ๐ด december 2019 |
| ๐ JapaneseWordSimilarityDataset | - | - | โญ 102 | ๐ด december 2021 |
| ๐ simple-jppdb | - | - | โญ 32 | ๐ด march 2017 |
| ๐ chABSA-dataset | - | - | โญ 141 | ๐ด september 2018 |
| ๐ JaQuAD | - | - | โญ 108 | ๐ด january 2022 |
| ๐ JaNLI | - | - | โญ 17 | ๐ด may 2023 |
| ๐ ebe-dataset | - | - | โญ 18 | ๐ด december 2020 |
| ๐ emoji-ja | - | - | โญ 82 | ๐ก march 2025 |
| ๐ nayose-wikipedia-ja | - | - | โญ 35 | ๐ด march 2020 |
| ๐ ja.text8 | - | - | โญ 111 | ๐ด october 2017 |
| ๐ ThreeLineSummaryDataset | - | - | โญ 31 | ๐ด april 2018 |
| ๐ japanese | - | - | โญ 83 | ๐ด september 2018 |
| ๐ kanji-frequency | - | - | โญ 155 | ๐ข january |
| ๐ TEDxJP-10K | - | - | โญ 24 | ๐ด january 2021 |
| ๐ CoARiJ | - | - | โญ 94 | ๐ด december 2020 |
| ๐ technological-book-corpus-ja | - | - | โญ 26 | ๐ด july 2023 |
| ๐ ita-corpus-chuwa | - | - | โญ 5 | ๐ด august 2021 |
| ๐ wikipedia-utils | - | - | โญ 78 | ๐ด april 2024 |
| ๐ inappropriate-words-ja | - | - | โญ 201 | ๐ด december 2021 |
| ๐ house-of-councillors | - | - | โญ 105 | ๐ข yesterday |
| ๐ house-of-representatives | - | - | โญ 176 | ๐ข yesterday |
| ๐ STAIR-captions | - | - | โญ 90 | ๐ด july 2018 |
| ๐ Winograd-Schema-Challenge-Ja | - | - | โญ 6 | ๐ด january 2019 |
| ๐ speechBSD | - | - | โญ 3 | ๐ด february 2024 |
| ๐ ita-corpus | - | - | โญ 222 | ๐ด december 2024 |
| ๐ rohan4600 | - | - | โญ 66 | ๐ด february 2023 |
| ๐ anlp-jp-history | - | - | โญ 3 | ๐ด april 2024 |
| ๐ keigo_transfer_task | - | - | โญ 21 | ๐ด november 2022 |
| ๐ loanwords_gairaigo | - | - | โญ 19 | ๐ด january 2021 |
| ๐ jawikicorpus | - | - | โญ 4 | ๐ด november 2018 |
| ๐ GeneralPolicySpeechOfPrimeMinisterOfJapan | - | - | โญ 6 | ๐ด january 2020 |
| ๐ wrime | - | - | โญ 174 | ๐ก september 2025 |
| ๐ jtubespeech | - | - | โญ 229 | ๐ด march 2023 |
| ๐ WikipediaWordFrequencyList | - | - | โญ 2 | ๐ด april 2022 |
| ๐ kokkosho_data | - | - | โญ 1 | ๐ด july 2019 |
| ๐ pdmocrdataset-part1 | - | - | โญ 75 | ๐ด june 2024 |
| ๐ huriganacorpus-ndlbib | - | - | โญ 28 | ๐ด september 2021 |
| ๐ jvs_hiho | - | - | โญ 31 | ๐ด february 2021 |
| ๐ hirakanadic | ๐ฅ 61 | ๐ฆ 14k | โญ 7 | ๐ด july 2023 |
| ๐ animedb | - | - | โญ 328 | ๐ด january 2023 |
| ๐ security_words | - | - | โญ 27 | ๐ด august 2023 |
| ๐ Data-on-Japanese-Diet-Members | - | - | โญ 3 | ๐ด september 2022 |
| ๐ honkoku-data | - | - | โญ 17 | ๐ข yesterday |
| ๐ wikihow_japanese | - | - | โญ 35 | ๐ด december 2020 |
| ๐ engineer-vocabulary-list | - | - | โญ 1.9k | ๐ด november 2020 |
| ๐ JSICK | - | - | โญ 45 | ๐ด may 2023 |
| ๐ phishurl-list | - | - | โญ 199 | ๐ก november 2025 |
| ๐ jcms | - | - | โญ 9 | ๐ด november 2022 |
| ๐ aozorabunko_text | - | - | โญ 85 | ๐ด march 2023 |
| ๐ friendly_JA-Corpus | - | - | โญ repo not found | ๐ด repo not found |
| ๐ topokanji | - | - | โญ 199 | ๐ด january 2016 |
| ๐ isbn4groups | - | - | โญ 1 | ๐ด june 2024 |
| ๐ NMeCab | - | - | โญ 97 | ๐ด march 2024 |
| ๐ ndlngramdata | - | - | โญ 14 | ๐ด january 2023 |
| ๐ ndlngramviewer_v2 | - | - | โญ 3 | ๐ด july 2023 |
| ๐ data_set | - | - | โญ 49 | ๐ด january 2025 |
| ๐ huggingface-datasets_wrime | - | - | โญ 4 | ๐ด january 2023 |
| ๐ ndl-minhon-ocrdataset | - | - | โญ 18 | ๐ด february 2025 |
| ๐ PAX_SAPIENTICA | - | - | โญ 180 | ๐ข december 2025 |
| ๐ j-liwc2015 | - | - | โญ 13 | ๐ด november 2024 |
| ๐ huggingface-datasets_livedoor-news-corpus | - | - | โญ 2 | ๐ด october 2023 |
| ๐ huggingface-datasets_JGLUE | - | - | โญ 12 | ๐ก march 2025 |
| ๐ commonsense-moral-ja | - | - | โญ 15 | ๐ก november 2025 |
| ๐ comet-atomic-ja | - | - | โญ 31 | ๐ด march 2024 |
| ๐ dcsg-ja | - | - | โญ 6 | ๐ด march 2023 |
| ๐ japanese-toxic-dataset | - | - | โญ 21 | ๐ด january 2023 |
| ๐ camera | - | - | โญ 26 | ๐ด august 2024 |
| ๐ Japanese-Fakenews-Dataset | - | - | โญ 20 | ๐ด may 2021 |
| ๐ jpn_explainable_qa_dataset | - | - | โญ repo not found | ๐ด repo not found |
| ๐ copa-japanese | - | - | โญ 1 | ๐ด february 2023 |
| ๐ WLSP-familiarity | - | - | โญ 12 | ๐ด january 2025 |
| ๐ ProSub | - | - | โญ 5 | ๐ก april 2025 |
| ๐ commonsense-moral-ja | - | - | โญ 15 | ๐ก november 2025 |
| ๐ ramendb | - | - | โญ 7 | ๐ข today |
| ๐ huggingface-datasets_CAMERA | - | - | โญ 3 | ๐ด march 2023 |
| ๐ FactCheckSentenceNLI-FCSNLI- | - | - | โญ 0 | ๐ด march 2021 |
| ๐ databricks-dolly-15k-ja | - | - | โญ 89 | ๐ด july 2023 |
| ๐ EaST-MELD | - | - | โญ 0 | ๐ด june 2023 |
| ๐ meconaudio | - | - | โญ 9 | ๐ด october 2023 |
| ๐ japanese-addresses | - | - | โญ 758 | ๐ข december 2025 |
| ๐ aozorasearch | - | - | โญ 21 | ๐ด september 2020 |
| ๐ llm-jp-corpus | - | - | โญ 43 | ๐ด october 2023 |
| ๐ alpaca_ja | - | - | โญ 86 | ๐ด may 2023 |
| ๐ instruction_ja | - | - | โญ 24 | ๐ด july 2023 |
| ๐ japanese-family-names | - | - | โญ 18 | ๐ด june 2017 |
| ๐ kanji-data-media | - | - | โญ 401 | ๐ด november 2023 |
| ๐ reazonspeech | - | - | โญ 360 | ๐ข january |
| ๐ huriganacorpus-aozora | - | - | โญ 17 | ๐ด january 2024 |
| ๐ koniwa | - | - | โญ 56 | ๐ก april 2025 |
| ๐ JMMLU | - | - | โญ 38 | ๐ก october 2025 |
| ๐ hurigana-speech-corpus-aozora | - | - | โญ 42 | ๐ก march 2025 |
| ๐ jqara | - | - | โญ 42 | ๐ก september 2025 |
| ๐ jemhopqa | - | - | โญ 29 | ๐ก april 2025 |
| ๐ jacred | - | - | โญ 7 | ๐ด march 2024 |
| ๐ jades | - | - | โญ 0 | ๐ด december 2022 |
| ๐ do-not-answer-ja | - | - | โญ 24 | ๐ด december 2023 |
| ๐ oasst1-89k-ja | - | - | โญ 16 | ๐ด november 2023 |
| ๐ jacwir | - | - | โญ 8 | ๐ก september 2025 |
| ๐ japanese-technical-dict | - | - | โญ 3 | ๐ด november 2024 |
| ๐ j-unimorph | - | - | โญ 9 | ๐ข january |
| ๐ GazeVQA | - | - | โญ 0 | ๐ด september 2024 |
| ๐ J-CRe3 | - | - | โญ 9 | ๐ด january 2025 |
| ๐ jmed-llm | - | - | โญ 56 | ๐ด september 2024 |
| ๐ lawtext | - | - | โญ 91 | ๐ข january |
| ๐ pdmocrdataset-part2 | - | - | โญ 14 | ๐ด june 2024 |
| ๐ japanesetopicwsd | - | - | โญ 2 | ๐ด september 2018 |
| ๐ temporalNLI_dataset | - | - | โญ 1 | ๐ด july 2023 |
| ๐ JSeM | - | - | โญ 13 | ๐ด november 2024 |
| ๐ niilc-qa | - | - | โญ 18 | ๐ด november 2015 |
| ๐ chain-of-thought-ja-dataset | - | - | โญ 5 | ๐ด september 2023 |
| ๐ WikipediaAnnotatedCorpus | - | - | โญ 12 | ๐ข last thursday |
| ๐ elaws-history | - | - | โญ 4 | ๐ข yesterday |
| ๐ Japanese-RP-Bench | - | - | โญ 18 | ๐ด september 2024 |
| ๐ hdic | - | - | โญ 40 | ๐ข today |
| ๐ awesome-japan-opendata | - | - | โญ 156 | ๐ข february |
| ๐ kanji-data | - | - | โญ 11 | ๐ข yesterday |
| ๐ openchj-genji | - | - | โญ 2 | ๐ก march 2025 |
| ๐ AdParaphrase | - | - | โญ 1 | ๐ก may 2025 |
| ๐ Jamp_sp | - | - | โญ 0 | ๐ด june 2024 |
| ๐ jnli-neg | - | - | โญ 0 | ๐ข december 2025 |
| ๐ swallow-corpus | - | - | โญ 5 | ๐ด november 2024 |
| ๐ jalecon | - | - | โญ 5 | ๐ด july 2023 |
| ๐ multils-japanese | - | - | โญ 0 | ๐ข january |
| ๐ nwjc | - | - | โญ 10 | ๐ด april 2022 |
| ๐ open-mantra-dataset | - | - | โญ 198 | ๐ด march 2023 |
| ๐ gimei | - | - | โญ 424 | ๐ข january |
| ๐ safety-boundary-test | - | - | โญ 9 | ๐ก july 2025 |
| ๐ j-ono-data | - | - | โญ 6 | ๐ข february |
| ๐ kanji | - | - | โญ 28 | ๐ข yesterday |
| ๐ jethics | - | - | โญ 2 | ๐ก june 2025 |
| ๐ waon | - | - | โญ 6 | ๐ก november 2025 |
| ๐ kuci | - | - | โญ 5 | ๐ด february 2024 |
| ๐ japanese-address-testdata | - | - | โญ 14 | ๐ด september 2023 |
| ๐ jlpt-word-list | - | - | โญ 58 | ๐ด february 2022 |
| ๐ hiragana_mojigazo | - | - | โญ 18 | ๐ด april 2020 |
| ๐ lawqa_jp | - | - | โญ 264 | ๐ข last friday |
| ๐ yjcaptions | - | - | โญ 60 | ๐ด november 2016 |
| ๐ ja-vg-vqa | - | - | โญ 30 | ๐ด november 2018 |
| ๐ lawhub | - | - | โญ 152 | ๐ด november 2020 |
| ๐ japanese-subtitles-word-kanji-frequency-lists | - | - | โญ 39 | ๐ด december 2023 |
| ๐ jconj | - | - | โญ 35 | ๐ด may 2020 |
| ๐ extract_jawp_names | - | - | โญ 21 | ๐ด december 2022 |
| ๐ cejc_yomichan_freq_dict | - | - | โญ 9 | ๐ด june 2023 |
| ๐ wikidict-ja | - | - | โญ 5 | ๐ด june 2016 |
| ๐ ajimee-bench | - | - | โญ 18 | ๐ด january 2025 |
| ๐ j-spaw | - | - | โญ 5 | ๐ก august 2025 |
| ๐ camera3 | - | - | โญ 4 | ๐ด may 2024 |
| ๐ jgpqa | - | - | โญ 2 | ๐ก september 2025 |
| ๐ tanaka-corpus-plus | - | - | โญ 2 | ๐ด june 2021 |
| ๐ emotioncorpusjapanesetokushimaa2lab | - | - | โญ 2 | ๐ด september 2024 |
| ๐ osworld-jp | - | - | โญ 2 | ๐ก november 2025 |
| ๐ quasi_japanese_reviews | - | - | โญ 1 | ๐ด july 2023 |
| ๐ psychiatry-clinical-notes | - | - | โญ 1 | ๐ก october 2025 |
| ๐ merged-town-names | - | - | โญ 1 | ๐ด may 2022 |
| ๐ japanesetextemoticondata | - | - | โญ 1 | ๐ด march 2021 |
| ๐ mishearing-corpus | - | - | โญ 1 | ๐ข january |
| ๐ kotowaza | - | - | โญ 1 | ๐ข yesterday |
| ๐ selective-rag-kasensabo | - | - | โญ 1 | ๐ก november 2025 |
Guides and tutorials for learning Japanese NLP tools and techniques
- spacy_tutorial - spaCy tutorial in English and Japanese. spacy-transformers, BERT, GiNZA.
- fastTextJapaneseTutorial - Tutorial to train fastText with Japanese corpus
- allennlp-NER-ja - AllenNLP-NER-ja: AllenNLP ใซใใๆฅๆฌ่ชใๅฏพ่ฑกใจใใๅบๆ่กจ็พๆฝๅบ
- chariot-PyTorch-Japanese-text-classification - Experiment for Japanese Text classification using chariot and PyTorch
- ginza-examples - ๆฅๆฌ่ชNLPใฉใคใใฉใชGiNZAใฎใใใ
- DocumentClassificationUsingBERT-Japanese - DocumentClassificationUsingBERT-Japanese
- BERT_Japanese_Google_Colaboratory - Google Colaboratoryใงๆฅๆฌ่ชใฎBERTใๅใใๆนๆณใงใใ
- bert-book - ใBERTใซใใ่ช็ถ่จ่ชๅฆ็ๅ ฅ้: Transformersใไฝฟใฃใๅฎ่ทตใใญใฐใฉใใณใฐใใตใใผใใใผใธ
- janome-tutorial - Janome ใไฝฟใฃใใใญในใใใคใใณใฐๅ ฅ้ใใฅใผใใชใขใซใงใใ
- handson-language-models - ๆฅๆฌ่ชใฎ่จ่ชใขใใซใฎใใณใบใชใณ่ณๆใงใ
- JapaneseNLI - Google Colabใงๆฅๆฌ่ชใใญในใๆจ่ซใ่ฉฆใ
- deep-learning-with-pytorch-ja - deep-learning-with-pytorchใฎๆฅๆฌ่ช็repositoryใงใใ
- bert-classification-tutorial -ใ2023ๅนด็ใBERTใซใใใใญในใๅ้ก
- python-nlp-book - ใใฃใผใใฉใผใใณใฐใซใใ่ช็ถ่จ่ชๅฆ็๏ผๅ ฑ็ซๅบ็๏ผใฎใตใใผใใใผใธใงใ
- llm-book - ใๅคง่ฆๆจก่จ่ชใขใใซๅ ฅ้ใ๏ผๆ่ก่ฉ่ซ็คพ, 2023๏ผใฎGitHubใชใใธใใช
- nlp2024-tutorial-3 - NLP2024 ใใฅใผใใชใขใซ๏ผ ไฝใฃใฆๅญฆใถๆฅๆฌ่ชๅคง่ฆๆจก่จ่ชใขใใซ - ็ฐๅขๆง็ฏๆ้ ใจใฝใผในใณใผใ
- japanese-ir-tutorial - ๆฅๆฌ่ชๆ ๅ ฑๆค็ดขใใฅใผใใชใขใซ
- nlpbook - ใ่ช็ถ่จ่ชๅฆ็ใฎๆ็งๆธใใตใใผใใตใคใ
- kantan-regex-book - ไฝใฃใฆๅญฆใถๆญฃ่ฆ่กจ็พใจใณใธใณ
- bert-classification-tutorial-2024 - ใ2024ๅนด็ใBERTใซใใใใญในใๅ้ก
- Gemma2_2b_Japanese_finetuning_colab.ipynb - Fine-Tuning Google Gemma for Japanese Instructions
- nlp100v2020 - ใ่จ่ชๅฆ็100ๆฌใใใฏ 2020ใใPythonใง่งฃใ
- textmining-ja - Rใซใใ่ช็ถ่จ่ชๅฆ็ใปใใญในใๅๆใฎ็ทด็ฟ
- nlp2025-tutorial-2 - NLP2025 ใฎใใฅใผใใชใขใซใๅฐ็ๆ ๅ ฑใจ่จ่ชๅฆ็ ๅฎ่ทตๅ ฅ้ใใฎ่ณๆใจใฝใผในใณใผใ
- nlp100v2025 - ใ่จ่ชๅฆ็100ๆฌใใใฏ 2025ใใPythonใง่งฃใ
- topic-models-ao - ใใใใใฏใขใใซใ(ๆฉๆขฐๅญฆ็ฟใใญใใงใใทใงใใซใทใชใผใบ)ใฎใใผใ
- slp2025 -้ณๅญฆใทใณใใธใฆใ 2025ใใฅใผใใชใขใซใใใซใใขใผใใซๅคง่ฆๆจก่จ่ชใขใใซๅ ฅ้ใ่ณๆ
- book_impress_it-basic-education-ai - ใคใณใใฌในๅบ็ใITๅบ็คๆ้ค ่ช็ถ่จ่ชๅฆ็๏ผ็ปๅ่งฃๆใ
- genai-agent-advanced-book - ๆธ็ฑใ็พๅ ดใงๆดป็จใใใใใฎ็ๆAIใจใผใธใงใณใๅฎ่ทตๅ ฅ้ใ๏ผ่ฌ่ซ็คพใตใคใจใณใใฃใใฃใใฏ็คพ๏ผใงๅฉ็จใใใใฝใผในใณใผใ
- course2024-nlp - 2024ๅนดๅบฆ ็ซๆๅคงๅญฆๅคงๅญฆ้ข ไบบๅทฅ็ฅ่ฝ็งๅญฆ็ ็ฉถ็ง ่ช็ถ่จ่ชๅฆ็็น่ซ
- support-genai-book - ๅ่ซๆใใ่งฃใๆใใ็ๆAI๏ผๆ่ก่ฉ่ซ็คพ๏ผใฎใตใใผใใใผใธใงใ
- ir100 - ๆ ๅ ฑๆค็ดข100ๆฌใใใฏ
- kaggle_llm_book - ใKaggle ใงใฏใใใๅคง่ฆๆจก่จ่ชใขใใซๅ ฅ้ใ๏ฝ่ช็ถ่จ่ชๅฆ็ใๅฎ่ทตใใใญใฐใฉใใณใฐ๏ฝใใฎใตใใผใใตใคใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ spacy_tutorial | - | - | โญ 63 | ๐ด january 2020 |
| ๐ fastTextJapaneseTutorial | - | - | โญ 205 | ๐ด september 2016 |
| ๐ allennlp-NER-ja | - | - | โญ 5 | ๐ด may 2022 |
| ๐ chariot-PyTorch-Japanese-text-classification | - | - | โญ 5 | ๐ด march 2019 |
| ๐ ginza-examples | - | - | โญ 16 | ๐ด january 2021 |
| ๐ DocumentClassificationUsingBERT-Japanese | - | - | โญ 0 | ๐ก august 2025 |
| ๐ BERT_Japanese_Google_Colaboratory | - | - | โญ 29 | ๐ด january 2022 |
| ๐ bert-book | - | - | โญ 264 | ๐ด february 2024 |
| ๐ janome-tutorial | - | - | โญ 31 | ๐ด march 2019 |
| ๐ handson-language-models | - | - | โญ 3 | ๐ด march 2021 |
| ๐ JapaneseNLI | - | - | โญ 6 | ๐ด june 2021 |
| ๐ deep-learning-with-pytorch-ja | - | - | โญ 142 | ๐ด may 2021 |
| ๐ bert-classification-tutorial | - | - | โญ 235 | ๐ด may 2024 |
| ๐ python-nlp-book | - | - | โญ 10 | ๐ด may 2023 |
| ๐ llm-book | - | - | โญ 461 | ๐ข december 2025 |
| ๐ nlp2024-tutorial-3 | - | - | โญ 112 | ๐ด april 2024 |
| ๐ japanese-ir-tutorial | - | - | โญ 3 | ๐ด june 2024 |
| ๐ nlpbook | - | - | โญ 14 | ๐ก april 2025 |
| ๐ kantan-regex-book | - | - | โญ 22 | ๐ด march 2024 |
| ๐ bert-classification-tutorial-2024 | - | - | โญ 30 | ๐ด july 2024 |
| ๐ Gemma2_2b_Japanese_finetuning_colab.ipynb | - | - | โญ repo not found | ๐ด august 2024 |
| ๐ nlp100v2020 | - | - | โญ 91 | ๐ก april 2025 |
| ๐ textmining-ja | - | - | โญ 3 | ๐ก october 2025 |
| ๐ nlp2025-tutorial-2 | - | - | โญ 17 | ๐ข january |
| ๐ nlp100v2025 | - | - | โญ 91 | ๐ก april 2025 |
| ๐ public-annotations | - | - | โญ 13 | ๐ก april 2025 |
| ๐ topic-models-ao | - | - | โญ 4 | ๐ก may 2025 |
| ๐ slp2025 | - | - | โญ 62 | ๐ข february |
| ๐ book_impress_it-basic-education-ai | - | - | โญ 4 | ๐ก june 2025 |
| ๐ genai-agent-advanced-book | - | - | โญ 185 | ๐ก september 2025 |
| ๐ course2024-nlp | - | - | โญ repo not found | ๐ด repo not found |
| ๐ support-genai-book | - | - | โญ 89 | ๐ข january |
| ๐ ir100 | - | - | โญ 93 | ๐ข december 2025 |
| ๐ kaggle_llm_book | - | - | โญ 19 | ๐ข last saturday |
Summaries of studies and papers in Japanese NLP research
- awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
- GEC-Info-ja - ๆๆณ่ชคใ่จๆญฃใซ้ขใใๆฅๆฌ่ชๆ็ฎใๅ้ใปๅ้กใใใใใฎใชใใธใใช
- dataset-list - lists of text corpus and more (mainly Japanese)
- tuning_playbook_ja - ใใฃใผใใฉใผใใณใฐใขใใซใฎๆง่ฝใไฝ็ณป็ใซๆๅคงๅใใใใใฎใใฌใคใใใฏ
- japanese-pitch-accent-resources - Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
- awesome-japanese-llm - ใชใผใใณใฝใผในใฎๆฅๆฌ่ชLLMใพใจใ
| Name | downloads/week | total downloads | stars | last commit |
|---|---|---|---|---|
| ๐ awesome-bert-japanese | - | - | โญ 131 | ๐ด march 2023 |
| ๐ GEC-Info-ja | - | - | โญ 12 | ๐ก april 2025 |
| ๐ dataset-list | - | - | โญ 117 | ๐ด july 2024 |
| ๐ tuning_playbook_ja | - | - | โญ 190 | ๐ด january 2023 |
| ๐ japanese-pitch-accent-resources | - | - | โญ 124 | ๐ด february 2024 |
| ๐ awesome-japanese-llm | - | - | โญ 1.3k | ๐ข yesterday |
- ่ช็ถ่จ่ชๅฆ็ใฎ้ค ๅฑ
- yasuokaใฎๆฅ่จ๏ผ ๆฅๆฌ่ชไฟใๅใ่งฃๆๅจใ2020ๅนดใฎ็ทใใใใ
- yasuokaใฎๆฅ่จ๏ผ ๆฅๆฌ่ชไฟใๅใ่งฃๆๅจใ2021ๅนดใฎ็ทใใใใ
- https://github.com/topics/japanese?l=python
- https://github.com/topics/japanese-language?l=python
- https://github.com/search?o=desc&q=corpus+japanese&s=&type=Repositories
- https://paperswithcode.com/datasets?lang=japanese
- https://github.com/himkt/awesome-bert-japanese
- Awesome-Rust-MachineLearning-ๆฅๆฌ่ชๅใใฎrustใฏใฌใผใใ่จไบ็ญใใพใจใใใใฎ
- ๅคง่ฆๆจก่จ่ชใขใใซๅ ฅ้โ ก ใ็ๆๅLLMใฎๅฎ่ฃ ใจ่ฉไพก
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for awesome-japanese-nlp-resources
Similar Open Source Tools
Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models (like ChatGPT, LLaMA, GLM, Baichuan, etc) Evaluation on Language capabilities, Knowledge, Reasoning, Fairness and Safety.
Github-Ranking-AI
This repository provides a list of the most starred and forked repositories on GitHub. It is updated automatically and includes information such as the project name, number of stars, number of forks, language, number of open issues, description, and last commit date. The repository is divided into two sections: LLM and chatGPT. The LLM section includes repositories related to large language models, while the chatGPT section includes repositories related to the chatGPT chatbot.
GeneratedOnBoardings
GeneratedOnBoardings is a repository containing automatically generated onboarding diagrams for over 800+ Python projects using CodeBoarding, an open-source tool for creating interactive visual documentation. The tool helps developers explore unfamiliar codebases through visual documentation, making it easier to understand and contribute to open-source projects. Users can provide feedback to improve the tool, and can also generate onboarding diagrams for their own projects by running CodeBoarding locally or trying the online demo at CodeBoarding.org/demo.
Awesome-LLM-3D
This repository is a curated list of papers related to 3D tasks empowered by Large Language Models (LLMs). It covers tasks such as 3D understanding, reasoning, generation, and embodied agents. The repository also includes other Foundation Models like CLIP and SAM to provide a comprehensive view of the area. It is actively maintained and updated to showcase the latest advances in the field. Users can find a variety of research papers and projects related to 3D tasks and LLMs in this repository.
kumo-search
Kumo search is an end-to-end search engine framework that supports full-text search, inverted index, forward index, sorting, caching, hierarchical indexing, intervention system, feature collection, offline computation, storage system, and more. It runs on the EA (Elastic automic infrastructure architecture) platform, enabling engineering automation, service governance, real-time data, service degradation, and disaster recovery across multiple data centers and clusters. The framework aims to provide a ready-to-use search engine framework to help users quickly build their own search engines. Users can write business logic in Python using the AOT compiler in the project, which generates C++ code and binary dynamic libraries for rapid iteration of the search engine.
Cool-GenAI-Fashion-Papers
Cool-GenAI-Fashion-Papers is a curated list of resources related to GenAI-Fashion, including papers, workshops, companies, and products. It covers a wide range of topics such as fashion design synthesis, outfit recommendation, fashion knowledge extraction, trend analysis, and more. The repository provides valuable insights and resources for researchers, industry professionals, and enthusiasts interested in the intersection of AI and fashion.
open-llms
Open LLMs is a repository containing various Large Language Models licensed for commercial use. It includes models like T5, GPT-NeoX, UL2, Bloom, Cerebras-GPT, Pythia, Dolly, and more. These models are designed for tasks such as transfer learning, language understanding, chatbot development, code generation, and more. The repository provides information on release dates, checkpoints, papers/blogs, parameters, context length, and licenses for each model. Contributions to the repository are welcome, and it serves as a resource for exploring the capabilities of different language models.
LLM-Agent-Survey
Autonomous agents are designed to achieve specific objectives through self-guided instructions. With the emergence and growth of large language models (LLMs), there is a growing trend in utilizing LLMs as fundamental controllers for these autonomous agents. This repository conducts a comprehensive survey study on the construction, application, and evaluation of LLM-based autonomous agents. It explores essential components of AI agents, application domains in natural sciences, social sciences, and engineering, and evaluation strategies. The survey aims to be a resource for researchers and practitioners in this rapidly evolving field.
so-vits-models
This repository collects various LLM, AI-related models, applications, and datasets, including LLM-Chat for dialogue models, LLMs for large models, so-vits-svc for sound-related models, stable-diffusion for image-related models, and virtual-digital-person for generating videos. It also provides resources for deep learning courses and overviews, AI competitions, and specific AI tasks such as text, image, voice, and video processing.
LLM-for-Healthcare
The repository 'LLM-for-Healthcare' provides a comprehensive survey of large language models (LLMs) for healthcare, covering data, technology, applications, and accountability and ethics. It includes information on various LLM models, training data, evaluation methods, and computation costs. The repository also discusses tasks such as NER, text classification, question answering, dialogue systems, and generation of medical reports from images in the healthcare domain.
llm-export
llm-export is a tool for exporting llm models to onnx and mnn formats. It has features such as passing onnxruntime correctness tests, optimizing the original code to support dynamic shapes, reducing constant parts, optimizing onnx models using OnnxSlim for performance improvement, and exporting lora weights to onnx and mnn formats. Users can clone the project locally, clone the desired LLM project locally, and use LLMExporter to export the model. The tool supports various export options like exporting the entire model as one onnx model, exporting model segments as multiple models, exporting model vocabulary to a text file, exporting specific model layers like Embedding and lm_head, testing the model with queries, validating onnx model consistency with onnxruntime, converting onnx models to mnn models, and more. Users can specify export paths, skip optimization steps, and merge lora weights before exporting.
VoiceBench
VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.
ailia-models
The collection of pre-trained, state-of-the-art AI models. ailia SDK is a self-contained, cross-platform, high-speed inference SDK for AI. The ailia SDK provides a consistent C++ API across Windows, Mac, Linux, iOS, Android, Jetson, and Raspberry Pi platforms. It also supports Unity (C#), Python, Rust, Flutter(Dart) and JNI for efficient AI implementation. The ailia SDK makes extensive use of the GPU through Vulkan and Metal to enable accelerated computing. # Supported models 323 models as of April 8th, 2024
ape-papers
APE Working Papers is a repository containing the archive of autonomous policy evaluation working papers. The papers are produced by AI agents that identify policy questions, fetch real data, conduct econometric analysis, write research papers, and undergo peer review. The repository includes PDFs, LaTeX source, analysis code, and replication data. All papers compete in a tournament system where judges compare them against published research. The repository is a public mirror that automatically syncs papers from a private repository upon publication.
