awesome-japanese-nlp-resources

awesome-japanese-nlp-resources

A curated list of resources dedicated to Python libraries, LLMs, dictionaries, and corpora of NLP for Japanese

Stars: 914

Visit
 screenshot

README:

awesome-japanese-nlp-resources

Awesome RRs Hugging Face Spaces License: CC0-1.0 CC0

A curated list of resources dedicated to Python libraries, llms, dictionaries, and corpora of NLP for Japanese

English | ๆ—ฅๆœฌ่ชž (Japanese) | ็น้ซ”ไธญๆ–‡ (Chinese) | ็ฎ€ไฝ“ไธญๆ–‡ (Chinese)

๐ŸŽ‰ The latest additions

Python

  • edinet-mcp - EDINET XBRL parsing library and MCP server for Japanese financial data.

Corpus

  • jfinqa - Japanese financial question answering benchmark with 1,000 questions from 68 listed companies.
  • kotowaza - Structured JSON dataset of Japanese proverbs (kotowaza) with meanings in Indonesian & English, examples, JLPT levels, and tags.
  • selective-rag-kasensabo - ๅปบ่จญใฎๆŠ€่ก“ๅŸบๆบ–ใซ้–ขใ™ใ‚‹่ณชๅ•ใฎๅฐ‚้–€ๆ€ง็ฒ’ๅบฆ๏ผˆ็ดฐใ‹ใ„/็ฒ—ใ„๏ผ‰ใ‚’96%ๆญฃ็ขบใซ่‡ชๅ‹•ๅˆคๅฎšใ—ใ€ๆœ€้ฉใชRAGใ‚ทใ‚นใƒ†ใƒ ๏ผˆColBERT/Naive๏ผ‰ใ‚’้ธๆŠžใ™ใ‚‹ๅฎŸ็”จ็š„ใชAgentic RAGใ‚ทใ‚นใƒ†ใƒ ใฎMVPใงใ™ใ€‚2025ๅนด11ๆœˆใซๅ…ฌ้–‹ใ•ใ‚ŒใŸๆฒณๅท็ ‚้˜ฒใƒ€ใƒ ใฎๆŠ€่ก“ๅŸบๆบ–ใ‚’ๅฏพ่ฑกใซ๏ผ”ใคใฎRAGใ‚ทใ‚นใƒ†ใƒ ใ‚’ๆง‹็ฏ‰ใ—ใ€ๅฐ‚้–€ๆ€งใฎ็ฒ’ๅบฆใŒ็•ฐใชใ‚‹200ๅ•ใฎ่ณชๅ•ใซๅฏพใ—ใฆใ€็ฒพๅบฆใจ้€Ÿๅบฆใ‚’ๆฏ”่ผƒใ—ใŸใ€‚

Updated on Feb 17, 2026

Contents

Python library

Morphology analysis

Libraries that split Japanese text into words or morphemes and assign part-of-speech and base forms

  • sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
  • Janome - Japanese morphological analysis engine written in pure Python
  • mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
  • mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
  • fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
  • nagisa - A Japanese tokenizer based on recurrent neural networks
  • pyknp - A Python Module for JUMAN++/KNP
  • Mykytea-python - Python wrapper for KyTea
  • konoha - Konoha: Simple wrapper of Japanese Tokenizers
  • natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
  • rakutenma-python - Rakuten MA (Python version)
  • python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
  • dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
  • rhoknp - Yet another Python binding for Juman++/KNP
  • python-vibrato - Viterbi-based accelerated tokenizer (Python wrapper)
  • jagger-python - Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)
  • Mecari - Mecari (Japanese Morphological Analysis with Graph Neural Networks)
Name downloads/week total downloads stars last commit
๐Ÿ”— SudachiPy ๐Ÿ“ฅ 387k ๐Ÿ“ฆ 60M โญ 427 ๐Ÿ”ด october 2022
๐Ÿ”— Janome ๐Ÿ“ฅ 35k ๐Ÿ“ฆ 11M โญ 902 ๐ŸŸก october 2025
๐Ÿ”— mecab-python3 ๐Ÿ“ฅ 299k ๐Ÿ“ฆ 35M โญ 580 ๐ŸŸก november 2025
๐Ÿ”— mecab ๐Ÿ“ฅ 3k ๐Ÿ“ฆ 646k โญ 271 ๐Ÿ”ด october 2024
๐Ÿ”— fugashi ๐Ÿ“ฅ 129k ๐Ÿ“ฆ 13M โญ 507 ๐ŸŸก october 2025
๐Ÿ”— nagisa ๐Ÿ“ฅ 67k ๐Ÿ“ฆ 7M โญ 412 ๐ŸŸข last thursday
๐Ÿ”— pyknp ๐Ÿ“ฅ 1k ๐Ÿ“ฆ 3M โญ 92 ๐ŸŸข january
๐Ÿ”— Mykytea-python ๐Ÿ“ฅ 1k ๐Ÿ“ฆ 553k โญ 36 ๐Ÿ”ด january 2024
๐Ÿ”— konoha ๐Ÿ“ฅ 37k ๐Ÿ“ฆ 5M โญ 261 ๐ŸŸก april 2025
๐Ÿ”— natto-py ๐Ÿ“ฅ 127k ๐Ÿ“ฆ 32M โญ 95 ๐Ÿ”ด november 2023
๐Ÿ”— rakutenma-python ๐Ÿ“ฅ 113 ๐Ÿ“ฆ 27k โญ 23 ๐Ÿ”ด may 2017
๐Ÿ”— python-vaporetto ๐Ÿ“ฅ 292 ๐Ÿ“ฆ 173k โญ 21 ๐ŸŸก june 2025
๐Ÿ”— dango ๐Ÿ“ฅ 90 ๐Ÿ“ฆ 25k โญ 25 ๐Ÿ”ด november 2021
๐Ÿ”— rhoknp ๐Ÿ“ฅ 10k ๐Ÿ“ฆ 977k โญ 37 ๐ŸŸข january
๐Ÿ”— python-vibrato ๐Ÿ“ฅ 160 ๐Ÿ“ฆ 115k โญ 43 ๐Ÿ”ด september 2024
๐Ÿ”— jagger-python ๐Ÿ“ฅ 2k ๐Ÿ“ฆ 295k โญ 12 ๐Ÿ”ด march 2024
๐Ÿ”— Mecari - - โญ 38 ๐ŸŸก september 2025

Parsing

Libraries that analyze syntactic and dependency structures of Japanese sentences

  • ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
  • cabocha - Yet Another Japanese Dependency Structure Analyzer
  • UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
  • camphr - Camphr - NLP libary for creating pipeline components
  • SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
  • depccg - A* CCG Parser with a Supertag and Dependency Factored Model
  • bertknp - A Japanese dependency parser based on BERT
  • esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
  • yomikata - Heteronym disambiguation library using a fine-tuned BERT model.
  • jdepp-python - Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers)
  • lightblue - A CCG parser for Japanese with DTS-representations
  • natsume-simple - natsume-simpleใฏๆ—ฅๆœฌ่ชžใฎไฟ‚ใ‚Šๅ—ใ‘้–ขไฟ‚ๆคœ็ดขใ‚ทใ‚นใƒ†ใƒ 
  • jdeppy - Python wrapper for J.DepP, fast Japanese Dependency Parser
Name downloads/week total downloads stars last commit
๐Ÿ”— ginza ๐Ÿ“ฅ 10k ๐Ÿ“ฆ 2M โญ 832 ๐Ÿ”ด march 2024
๐Ÿ”— cabocha ๐Ÿ“ฅ 110 ๐Ÿ“ฆ 54k โญ 7 ๐Ÿ”ด august 2022
๐Ÿ”— UniDic2UD ๐Ÿ“ฅ 423 ๐Ÿ“ฆ 326k โญ 38 ๐ŸŸข december 2025
๐Ÿ”— camphr ๐Ÿ“ฅ 614 ๐Ÿ“ฆ 268k โญ 338 ๐Ÿ”ด august 2021
๐Ÿ”— SuPar-UniDic ๐Ÿ“ฅ 471 ๐Ÿ“ฆ 118k โญ 20 ๐Ÿ”ด repo not found
๐Ÿ”— depccg ๐Ÿ“ฅ 117 ๐Ÿ“ฆ 46k โญ 98 ๐Ÿ”ด august 2023
๐Ÿ”— bertknp - - โญ 23 ๐Ÿ”ด october 2021
๐Ÿ”— esupar ๐Ÿ“ฅ 604 ๐Ÿ“ฆ 167k โญ 54 ๐ŸŸข january
๐Ÿ”— yomikata ๐Ÿ“ฅ 35 ๐Ÿ“ฆ 49k โญ 32 ๐Ÿ”ด october 2023
๐Ÿ”— jdepp-python ๐Ÿ“ฅ 2k ๐Ÿ“ฆ 279k โญ 4 ๐Ÿ”ด february 2024
๐Ÿ”— lightblue - - โญ 26 ๐ŸŸข last thursday
๐Ÿ”— natsume-simple - - โญ 5 ๐Ÿ”ด february 2025
๐Ÿ”— jdeppy ๐Ÿ“ฅ 41 ๐Ÿ“ฆ 11k โญ 3 ๐Ÿ”ด february 2022

Converter

Libraries that convert between character types such as kana, romaji, and full-width/half-width forms

  • pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
  • cutlet - Japanese to romaji converter in Python
  • alphabet2kana - Convert English alphabet to Katakana
  • Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
  • mozcpy - Mozc for Python: Kana-Kanji converter
  • jamorasep - Japanese text parser to separate Hiragana/Katakana string into morae (syllables).
  • text2phoneme - ๆ—ฅๆœฌ่ชžๆ–‡ใ‚’้Ÿณ็ด ๅˆ—ใธๅค‰ๆ›ใ™ใ‚‹ใ‚นใ‚ฏใƒชใƒ—ใƒˆ
  • jntajis-python - A fast character conversion and transliteration library based on the scheme defined for Japan National Tax Agency (ๅ›ฝ็จŽๅบ) 's
  • wiredify - Convert japanese kana from ba-bi-bu-be-bo into va-vi-vu-ve-vo
  • mecab-text-cleaner - Simple Python package (CLI/Python API) for getting japanese readings (yomigana) and accents using MeCab.
  • pynormalizenumexp - ๆ•ฐ้‡่กจ็พใ‚„ๆ™‚้–“่กจ็พใฎๆŠฝๅ‡บใƒปๆญฃ่ฆๅŒ–ใ‚’่กŒใ†NormalizeNumexpใฎPythonๅฎŸ่ฃ…
  • Jusho - Easy wrapper for the postal code data of Japan
  • yurenizer - Japanese text normalizer that resolves spelling inconsistencies. ๏ผˆๆ—ฅๆœฌ่ชž่กจ่จ˜ๆบใ‚Œ่งฃๆถˆใƒ„ใƒผใƒซ๏ผ‰
  • e2k - A tool for automatic English to Katakana conversion
  • alkana.py - A tool to get the katakana reading of an alphabetical string.
  • englishtokanaconverter - ่‹ฑ่ชžๆ–‡ๅญ—ๅˆ—ใ‚’ใ‚ซใ‚ฟใ‚ซใƒŠใซๅค‰ๆ›ใ™ใ‚‹ใƒ—ใƒญใ‚ฐใƒฉใƒ 
  • kanjiconv - Kanji Converter to Hiragana, Katakana, Roman alphabet.
  • kanjize - Kanjize(ใ‚ซใƒณใ‚ธใƒฃใ‚คใ‚บ): Easy converter between Kanji-Number and Integer
Name downloads/week total downloads stars last commit
๐Ÿ”— pykakasi ๐Ÿ“ฅ 230k ๐Ÿ“ฆ 28M โญ 443 ๐Ÿ”ด july 2022
๐Ÿ”— cutlet ๐Ÿ“ฅ 21k ๐Ÿ“ฆ 1M โญ 372 ๐ŸŸก june 2025
๐Ÿ”— alphabet2kana ๐Ÿ“ฅ 336 ๐Ÿ“ฆ 56k โญ 14 ๐ŸŸข yesterday
๐Ÿ”— Convert-Numbers-to-Japanese - - โญ 50 ๐Ÿ”ด november 2020
๐Ÿ”— mozcpy ๐Ÿ“ฅ 176 ๐Ÿ“ฆ 12k โญ 46 ๐Ÿ”ด february 2025
๐Ÿ”— jamorasep ๐Ÿ“ฅ 322 ๐Ÿ“ฆ 9k โญ 11 ๐ŸŸข today
๐Ÿ”— text2phoneme - - โญ 13 ๐Ÿ”ด may 2023
๐Ÿ”— jntajis-python ๐Ÿ“ฅ 348 ๐Ÿ“ฆ 105k โญ 21 ๐Ÿ”ด june 2023
๐Ÿ”— wiredify ๐Ÿ“ฅ 34 ๐Ÿ“ฆ 6k โญ 3 ๐ŸŸข december 2025
๐Ÿ”— mecab-text-cleaner ๐Ÿ“ฅ 50 ๐Ÿ“ฆ 4k โญ 7 ๐Ÿ”ด november 2024
๐Ÿ”— pynormalizenumexp ๐Ÿ“ฅ 72 ๐Ÿ“ฆ 14k โญ 8 ๐Ÿ”ด april 2024
๐Ÿ”— Jusho ๐Ÿ“ฅ 704 ๐Ÿ“ฆ 53k โญ 11 ๐Ÿ”ด june 2024
๐Ÿ”— yurenizer ๐Ÿ“ฅ 74 ๐Ÿ“ฆ 17k โญ 4 ๐ŸŸก march 2025
๐Ÿ”— e2k ๐Ÿ“ฅ 821 ๐Ÿ“ฆ 23k โญ 15 ๐ŸŸก november 2025
๐Ÿ”— alkana.py - - โญ 33 ๐Ÿ”ด october 2021
๐Ÿ”— englishtokanaconverter - - โญ 4 ๐ŸŸข yesterday
๐Ÿ”— kanjiconv ๐Ÿ“ฅ 111 ๐Ÿ“ฆ 12k โญ 16 ๐ŸŸก october 2025
๐Ÿ”— kanjize ๐Ÿ“ฅ 8k ๐Ÿ“ฆ 1M โญ 68 ๐ŸŸก june 2025

Preprocessor

Libraries that normalize and clean text before analysis

  • neologdn - Japanese text normalizer for mecab-neologd
  • jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
  • mojimoji - A fast converter between Japanese hankaku and zenkaku characters
  • text-cleaning - A powerful text cleaner for Japanese web texts
  • HojiChar - ่ค‡ๆ•ฐใฎๅ‰ๅ‡ฆ็†ใ‚’ๆง‹ๆˆใ—ใฆ็ฎก็†ใ™ใ‚‹ใƒ†ใ‚ญใ‚นใƒˆๅ‰ๅ‡ฆ็†ใƒ„ใƒผใƒซ
  • utsuho - Utsuho is a Python module that facilitates bidirectional conversion between half-width katakana and full-width katakana in Japanese.
  • python-habachen - Yet Another Fast Japanese String Converter
  • kairyou - Quickly preprocesses Japanese text using NLP/NER from SpaCy for Japanese translation or other NLP tasks.
Name downloads/week total downloads stars last commit
๐Ÿ”— neologdn ๐Ÿ“ฅ 6k ๐Ÿ“ฆ 1M โญ 286 ๐ŸŸข december 2025
๐Ÿ”— jaconv ๐Ÿ“ฅ 560k ๐Ÿ“ฆ 60M โญ 340 ๐ŸŸข february
๐Ÿ”— mojimoji ๐Ÿ“ฅ 60k ๐Ÿ“ฆ 11M โญ 153 ๐Ÿ”ด january 2024
๐Ÿ”— text-cleaning - - โญ 12 ๐Ÿ”ด november 2022
๐Ÿ”— HojiChar ๐Ÿ“ฅ 17k ๐Ÿ“ฆ 725k โญ 125 ๐ŸŸก november 2025
๐Ÿ”— utsuho ๐Ÿ“ฅ 80 ๐Ÿ“ฆ 19k โญ 4 ๐ŸŸก october 2025
๐Ÿ”— python-habachen ๐Ÿ“ฅ 5k ๐Ÿ“ฆ 2M โญ 6 ๐ŸŸก october 2025
๐Ÿ”— kairyou ๐Ÿ“ฅ 126 ๐Ÿ“ฆ 30k โญ 6 ๐ŸŸก june 2025

Sentence spliter

Libraries that automatically detect sentence boundaries and split text

  • Bunkai - Sentence boundary disambiguation tool for Japanese texts (ๆ—ฅๆœฌ่ชžๆ–‡ๅขƒ็•Œๅˆคๅฎšๅ™จ)
  • japanese-sentence-breaker - Japanese Sentence Breaker
  • sengiri - Yet another sentence-level tokenizer for the Japanese text
  • budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
  • ja_sentence_segmenter - japanese sentence segmentation library for python
  • hasami - A tool to perform sentence segmentation on Japanese text
  • kuzukiri - Japanese Text Segmenter for Python written in Rust
  • ja-senter-benchmark - Comparison of Japanese Sentence Segmentation Tools
  • fast-bunkai - Japanese sentence splitting(ๆ—ฅๆœฌ่ชžๆ–‡ๅขƒ็•Œๅˆคๅฎšๅ™จ), 40โ€“250ร— faster via a Rust-accelerated Python library with near-perfect API compatibility with megagonlabs/bunkai.
Name downloads/week total downloads stars last commit
๐Ÿ”— bunkai ๐Ÿ“ฅ 511 ๐Ÿ“ฆ 105k โญ 199 ๐Ÿ”ด august 2023
๐Ÿ”— japanese-sentence-breaker ๐Ÿ“ฅ 38 ๐Ÿ“ฆ 5k โญ 14 ๐Ÿ”ด february 2021
๐Ÿ”— sengiri ๐Ÿ“ฅ 80 ๐Ÿ“ฆ 136k โญ 24 ๐ŸŸก november 2025
๐Ÿ”— budoux ๐Ÿ“ฅ 7k ๐Ÿ“ฆ 386k โญ 1.6k ๐ŸŸข today
๐Ÿ”— ja_sentence_segmenter ๐Ÿ“ฅ 1k ๐Ÿ“ฆ 179k โญ 73 ๐Ÿ”ด april 2023
๐Ÿ”— hasami ๐Ÿ“ฅ 316 ๐Ÿ“ฆ 37k โญ 6 ๐Ÿ”ด february 2021
๐Ÿ”— kuzukiri ๐Ÿ“ฅ 101 ๐Ÿ“ฆ 26k โญ 6 ๐ŸŸก june 2025
๐Ÿ”— ja-senter-benchmark - - โญ 9 ๐Ÿ”ด february 2023
๐Ÿ”— fast-bunkai ๐Ÿ“ฅ 108 ๐Ÿ“ฆ 3k โญ 65 ๐ŸŸก october 2025

Sentiment analysis

Libraries that detect emotions or polarity in text

  • oseti - Dictionary based Sentiment Analysis for Japanese
  • negapoji - Japanese negative positive classification.ๆ—ฅๆœฌ่ชžๆ–‡ๆ›ธใฎใƒใ‚ฌใƒใ‚ธใ‚’ๅˆคๅฎšใ€‚
  • pymlask - Emotion analyzer for Japanese text
  • asari - Japanese sentiment analyzer implemented in Python.
Name downloads/week total downloads stars last commit
๐Ÿ”— oseti ๐Ÿ“ฅ 314 ๐Ÿ“ฆ 166k โญ 97 ๐ŸŸก august 2025
๐Ÿ”— negapoji - - โญ 151 ๐Ÿ”ด august 2017
๐Ÿ”— pymlask ๐Ÿ“ฅ 81 ๐Ÿ“ฆ 66k โญ 116 ๐Ÿ”ด july 2024
๐Ÿ”— asari ๐Ÿ“ฅ 180 ๐Ÿ“ฆ 79k โญ 152 ๐Ÿ”ด october 2022

Machine translation

Libraries that automatically translate text between languages

  • jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
  • JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
  • PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
  • VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
  • plamo-translate-cli - A command-line interface for translation using the plamo-2-translate model with local execution.
Name downloads/week total downloads stars last commit
๐Ÿ”— jparacrawl-finetune - - โญ 105 ๐Ÿ”ด april 2021
๐Ÿ”— JASS - - โญ 16 ๐Ÿ”ด january 2022
๐Ÿ”— PheMT - - โญ 18 ๐Ÿ”ด february 2021
๐Ÿ”— VISA - - โญ 14 ๐Ÿ”ด october 2022
๐Ÿ”— plamo-translate-cli - - โญ 325 ๐ŸŸก october 2025

Named entity recognition

Libraries that extract names of people, places, and organizations from text

  • namaco - Character Based Named Entity Recognition.
  • entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
  • noyaki - Converts character span label information to tokenized text-based label information.
  • bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTใƒขใƒ‡ใƒซใฎใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐใงๅ›บๆœ‰่กจ็พๆŠฝๅ‡บ็”จใ‚ฟใ‚นใ‚ฏใฎใƒขใƒ‡ใƒซใ‚’ไฝœๆˆใƒปไฝฟ็”จใ™ใ‚‹ใ‚ตใƒณใƒ—ใƒซใงใ™
  • joint-information-extraction-hs - ่ฉณ็ดฐใชใ‚ขใƒŽใƒ†ใƒผใ‚ทใƒงใƒณๅŸบๆบ–ใซๅŸบใฅใ็—‡ไพ‹ๅ ฑๅ‘Šใ‚ณใƒผใƒ‘ใ‚นใ‹ใ‚‰ใฎๅ›บๆœ‰่กจ็พๅŠใณ้–ขไฟ‚ใฎๆŠฝๅ‡บ็ฒพๅบฆใฎๆŽจ่ซ–ใ‚’่กŒใ†ใ‚ณใƒผใƒ‰
  • pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
  • bert-ner-japanese - BERTใซใ‚ˆใ‚‹ๆ—ฅๆœฌ่ชžๅ›บๆœ‰่กจ็พๆŠฝๅ‡บใฎใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐ็”จใƒ—ใƒญใ‚ฐใƒฉใƒ 
  • huggingface-finetune-japanese - Examples to finetune encoder-only and encoder-decoder transformers for Japanese language (Hugging Face) Resources
  • novelanalysisbyner - BERTใฎfine-tuningใซใ‚ˆใ‚‹ๅ›บๆœ‰่กจ็พๆŠฝๅ‡บ
Name downloads/week total downloads stars last commit
๐Ÿ”— namaco - - โญ 40 ๐Ÿ”ด february 2018
๐Ÿ”— entitypedia - - โญ 13 ๐Ÿ”ด december 2018
๐Ÿ”— noyaki ๐Ÿ“ฅ 100 ๐Ÿ“ฆ 20k โญ 5 ๐Ÿ”ด august 2022
๐Ÿ”— bert-japanese-ner-finetuning - - โญ 11 ๐Ÿ”ด june 2022
๐Ÿ”— joint-information-extraction-hs - - โญ 1 ๐Ÿ”ด november 2021
๐Ÿ”— pygeonlp ๐Ÿ“ฅ 237 ๐Ÿ“ฆ 21k โญ 22 ๐ŸŸก october 2025
๐Ÿ”— bert-ner-japanese - - โญ 5 ๐Ÿ”ด september 2022
๐Ÿ”— huggingface-finetune-japanese - - โญ 16 ๐Ÿ”ด october 2023
๐Ÿ”— novelanalysisbyner - - โญ 2 ๐Ÿ”ด june 2024

OCR

Libraries that recognize and extract text from images

  • Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
  • mokuro - Read Japanese manga inside browser with selectable text.
  • handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
  • OCR_Japanease - ๆ—ฅๆœฌ่ชžOCR
  • ndlocr_cli - NDLOCRใฎใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณ
  • donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
  • JMTrans - manga translator - get japanese manga from url to translate manga image
  • Kindai-OCR - OCR system for recognizing modern Japanese magazines
  • text_recognition - NDLOCR็”จใƒ†ใ‚ญใ‚นใƒˆ่ช่ญ˜ใƒขใ‚ธใƒฅใƒผใƒซ
  • Poricom - Optical character recognition in manga images. Manga OCR desktop application
  • owocr - Optical character recognition for Japanese text
  • yomitoku - Yomitoku is an AI-powered document image analysis package designed specifically for the Japanese language.
  • findtextcenternet - Japanese OCR with CenterNet
  • simple-ocr-for-manga - A simple OCR for manga (Japanese traditional and Japanese vertical)
  • jp-ocr-evaluation - ๆ—ฅๆœฌ่ชžใฎๆ–‡็ซ ็”ปๅƒใซๅฏพใ™ใ‚‹OCRใฎๆ€ง่ƒฝใ‚’่ฉ•ไพก
  • paddleocr-vl-sft-for-japanese-manga-on-rtx-3060 - Fine-tune PaddleOCR-VL on the Manga109s dataset for Japanese manga text recognition. The base model struggles with vertical Japanese text reading order in manga. After fine-tuning, the model correctly handles manga-specific text layouts.
  • MangaOCR - A lightweight OCR model for Japanese text, especially in Manga
  • meikiocr - high-speed, high-accuracy, local ocr for japanese video games
  • meikipop - universal japanese ocr popup dictionary for windows, linux and macos
Name downloads/week total downloads stars last commit
๐Ÿ”— manga-ocr ๐Ÿ“ฅ 4k ๐Ÿ“ฆ 240k โญ 2.5k ๐ŸŸก june 2025
๐Ÿ”— mokuro ๐Ÿ“ฅ 759 ๐Ÿ“ฆ 89k โญ 1.5k ๐ŸŸก june 2025
๐Ÿ”— handwritten-japanese-ocr - - โญ 38 ๐Ÿ”ด april 2022
๐Ÿ”— OCR_Japanease - - โญ 244 ๐Ÿ”ด april 2021
๐Ÿ”— ndlocr_cli - - โญ 567 ๐ŸŸก september 2025
๐Ÿ”— donut ๐Ÿ“ฅ 342 ๐Ÿ“ฆ 196k โญ 6.8k ๐Ÿ”ด july 2023
๐Ÿ”— JMTrans - - โญ 88 ๐Ÿ”ด january 2021
๐Ÿ”— Kindai-OCR - - โญ 153 ๐Ÿ”ด july 2023
๐Ÿ”— text_recognition - - โญ 8 ๐Ÿ”ด july 2023
๐Ÿ”— Poricom - - โญ 415 ๐Ÿ”ด june 2023
๐Ÿ”— owocr - - โญ 194 ๐ŸŸข last friday
๐Ÿ”— yomitoku ๐Ÿ“ฅ 1k ๐Ÿ“ฆ 75k โญ 1.3k ๐ŸŸข january
๐Ÿ”— findtextcenternet - - โญ 55 ๐ŸŸก august 2025
๐Ÿ”— simple-ocr-for-manga - - โญ 7 ๐Ÿ”ด repo not found
๐Ÿ”— jp-ocr-evaluation - - โญ 1 ๐Ÿ”ด march 2024
๐Ÿ”— paddleocr-vl-sft-for-japanese-manga-on-rtx-3060 - - โญ 7 ๐ŸŸข december 2025
๐Ÿ”— MangaOCR - - โญ 34 ๐Ÿ”ด may 2024
๐Ÿ”— meikiocr ๐Ÿ“ฅ 707 ๐Ÿ“ฆ 15k โญ 55 ๐ŸŸข january
๐Ÿ”— meikipop - - โญ 206 ๐ŸŸข last saturday

Tool for pretrained models

Libraries that utilize pretrained models to improve accuracy and efficiency

  • JGLUE - JGLUE: Japanese General Language Understanding Evaluation
  • ginza-transformers - Use custom tokenizers in spacy-transformers
  • t5_japanese_dialogue_generation - T5ใซใ‚ˆใ‚‹ไผš่ฉฑ็”Ÿๆˆ
  • japanese_text_classification - To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
  • Japanese-BERT-Sentiment-Analyzer - Deploying sentiment analysis server with FastAPI and BERT
  • jmlm_scoring - Masked Language Model-based Scoring for Japanese and Vietnamese
  • allennlp-shiba-model - AllenNLP integration for Shiba: Japanese CANINE model
  • evaluate_japanese_w2v - script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
  • gector-ja - BERT-based GEC tagging for Japanese
  • Japanese-BPEEncoder - Japanese-BPEEncoder
  • Japanese-BPEEncoder_V2 - Japanese-BPEEncoder Version 2
  • transformer-copy - ๆ—ฅๆœฌ่ชžๆ–‡ๆณ•่ชคใ‚Š่จ‚ๆญฃใƒ„ใƒผใƒซ
  • japanese-stable-diffusion - Japanese Stable Diffusion is a Japanese specific latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
  • nagisa_bert - A BERT model for nagisa
  • prefix-tuning-gpt - Example code for prefix-tuning GPT/GPT-NeoX models and for inference with trained prefixes
  • JGLUE-benchmark - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
  • jptranstokenizer - Japanese Tokenizer for transformers library
  • jp-stable - JP Language Model Evaluation Harness
  • compare-ja-tokenizer - How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese-ACL SRW 2023
  • lm-evaluation-harness-jp-stable - A framework for few-shot evaluation of autoregressive language models.
  • llm-lora-classification - llm-lora-classification
  • jp-stable - JP Language Model Evaluation Harness
  • rinna_gpt-neox_ggml-lora - The repository contains scripts and merge scripts that have been modified to adapt an Alpaca-Lora adapter for LoRA tuning when assuming the use of the "rinna/japanese-gpt-neox..." [gpt-neox] model converted to ggml.
  • japanese-llm-roleplay-benchmark - ใ“ใฎใƒชใƒใ‚ธใƒˆใƒชใฏๆ—ฅๆœฌ่ชžLLMใฎใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟใƒผใƒญใƒผใƒซใƒ—ใƒฌใ‚คใซ้–ขใ™ใ‚‹ๆ€ง่ƒฝใ‚’่ฉ•ไพกใ™ใ‚‹ใŸใ‚ใซไฝœๆˆใ—ใพใ—ใŸใ€‚
  • japanese-llm-ranking - This repository supports YuzuAI's Rakuda leaderboard of Japanese LLMs, which is a Japanese-focused analogue of LMSYS' Vicuna eval.
  • llm-jp-eval - ใ“ใฎใƒ„ใƒผใƒซใฏใ€่ค‡ๆ•ฐใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’ๆจชๆ–ญใ—ใฆๆ—ฅๆœฌ่ชžใฎๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซใ‚’่‡ชๅ‹•่ฉ•ไพกใ™ใ‚‹ใ‚‚ใฎใงใ™๏ผŽ
  • llm-jp-sft - This repository contains the code for supervised fine-tuning of LLM-jp models.
  • llm-jp-tokenizer - LLMๅ‹‰ๅผทไผš๏ผˆLLM-jp๏ผ‰ใง้–‹็™บใ—ใฆใ„ใ‚‹LLM็”จใฎใƒˆใƒผใ‚ฏใƒŠใ‚คใ‚ถใƒผ้–ข้€ฃใ‚’ใพใจใ‚ใŸใƒชใƒใ‚ธใƒˆใƒชใงใ™๏ผŽ
  • japanese-lm-fin-harness - Japanese Language Model Financial Evaluation Harness
  • ja-vicuna-qa-benchmark - Japanese Vicuna QA Benchmark
  • swallow-evaluation - Swallowใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆ ๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซ ่ฉ•ไพกใ‚นใ‚ฏใƒชใƒ—ใƒˆ
  • swallow-evaluation-instruct - Swallowใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆ ไบ‹ๅพŒๅญฆ็ฟ’ใšใฟๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซ ่ฉ•ไพกใƒ•ใƒฌใƒผใƒ ใƒฏใƒผใ‚ฏ
  • pretrained_doc2vec_ja - pretrained doc2vec models on Japanese Wikipedia
  • pl-bert-ja - A repository of Japanese Phoneme-Level BERT
Name downloads/week total downloads stars last commit
๐Ÿ”— JGLUE - - โญ 333 ๐ŸŸก march 2025
๐Ÿ”— ginza-transformers ๐Ÿ“ฅ 790 ๐Ÿ“ฆ 173k โญ 16 ๐Ÿ”ด august 2022
๐Ÿ”— t5_japanese_dialogue_generation - - โญ 3 ๐Ÿ”ด november 2021
๐Ÿ”— japanese_text_classification - - โญ 9 ๐Ÿ”ด january 2020
๐Ÿ”— Japanese-BERT-Sentiment-Analyzer - - โญ 2 ๐Ÿ”ด april 2021
๐Ÿ”— jmlm_scoring - - โญ 5 ๐Ÿ”ด february 2022
๐Ÿ”— allennlp-shiba-model ๐Ÿ“ฅ 137 ๐Ÿ“ฆ 20k โญ 12 ๐Ÿ”ด june 2021
๐Ÿ”— evaluate_japanese_w2v - - โญ 12 ๐Ÿ”ด november 2024
๐Ÿ”— gector-ja - - โญ 19 ๐Ÿ”ด june 2021
๐Ÿ”— Japanese-BPEEncoder - - โญ 41 ๐Ÿ”ด september 2021
๐Ÿ”— Japanese-BPEEncoder_V2 - - โญ 41 ๐Ÿ”ด january 2023
๐Ÿ”— transformer-copy - - โญ 29 ๐Ÿ”ด september 2020
๐Ÿ”— japanese-stable-diffusion - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— nagisa_bert ๐Ÿ“ฅ 923 ๐Ÿ“ฆ 56k โญ 5 ๐Ÿ”ด december 2023
๐Ÿ”— prefix-tuning-gpt - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— JGLUE-benchmark - - โญ 18 ๐ŸŸข february
๐Ÿ”— jptranstokenizer ๐Ÿ“ฅ 115 ๐Ÿ“ฆ 27k โญ 5 ๐Ÿ”ด february 2024
๐Ÿ”— jp-stable - - โญ 154 ๐Ÿ”ด november 2023
๐Ÿ”— compare-ja-tokenizer - - โญ 6 ๐Ÿ”ด june 2023
๐Ÿ”— lm-evaluation-harness-jp-stable - - โญ 1 ๐Ÿ”ด june 2023
๐Ÿ”— llm-lora-classification - - โญ 98 ๐Ÿ”ด july 2023
๐Ÿ”— jp-stable - - โญ 154 ๐Ÿ”ด november 2023
๐Ÿ”— rinna_gpt-neox_ggml-lora - - โญ 18 ๐Ÿ”ด may 2023
๐Ÿ”— japanese-llm-roleplay-benchmark - - โญ 40 ๐Ÿ”ด november 2023
๐Ÿ”— japanese-llm-ranking - - โญ 50 ๐Ÿ”ด march 2024
๐Ÿ”— llm-jp-eval - - โญ 147 ๐ŸŸข december 2025
๐Ÿ”— llm-jp-sft - - โญ 62 ๐Ÿ”ด june 2024
๐Ÿ”— llm-jp-tokenizer - - โญ 45 ๐Ÿ”ด february 2025
๐Ÿ”— japanese-lm-fin-harness - - โญ 77 ๐ŸŸข january
๐Ÿ”— ja-vicuna-qa-benchmark - - โญ 33 ๐Ÿ”ด june 2024
๐Ÿ”— swallow-evaluation - - โญ 23 ๐ŸŸก september 2025
๐Ÿ”— swallow-evaluation-instruct - - โญ 24 ๐ŸŸก october 2025
๐Ÿ”— pretrained_doc2vec_ja - - โญ 25 ๐Ÿ”ด january 2019
๐Ÿ”— pl-bert-ja - - โญ 22 ๐Ÿ”ด december 2023

Others

General-purpose tools supporting Japanese language processing

  • namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
  • asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
  • python_asa - python็‰ˆๆ—ฅๆœฌ่ชžๆ„ๅ‘ณๅฝนๅ‰ฒไป˜ไธŽใ‚ทใ‚นใƒ†ใƒ ๏ผˆASA๏ผ‰
  • toiro - A comparison tool of Japanese tokenizers
  • ja-timex - ่‡ช็„ถ่จ€่ชžใงๆ›ธใ‹ใ‚ŒใŸๆ™‚้–“ๆƒ…ๅ ฑ่กจ็พใ‚’ๆŠฝๅ‡บ/่ฆๆ ผๅŒ–ใ™ใ‚‹ใƒซใƒผใƒซใƒ™ใƒผใ‚นใฎ่งฃๆžๅ™จ
  • JapaneseTokenizers - A set of metrics for feature selection from text data
  • daaja - This repository has implementations of data augmentation for NLP for Japanese.
  • accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation netโ€ฆ
  • kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
  • nlplot - Visualization Module for Natural Language Processing
  • rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
  • jel - Japanese Entity Linker.
  • MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
  • zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
  • AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
  • showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
  • darts-clone-python - Darts-clone python binding
  • jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
  • desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
  • HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
  • nlp-recipes-ja - Samples codes for natural language processing in Japanese
  • Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
  • DNorm-J - Japanese version of DNorm
  • pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
  • ishi - Ishi: A volition classifier for Japanese
  • python-npylm - ใƒ™ใ‚คใ‚บ้šŽๅฑค่จ€่ชžใƒขใƒ‡ใƒซใซใ‚ˆใ‚‹ๆ•™ๅธซใชใ—ๅฝขๆ…‹็ด ่งฃๆž
  • python-npycrf - ๆกไปถไป˜็ขบ็އๅ ดใจใƒ™ใ‚คใ‚บ้šŽๅฑค่จ€่ชžใƒขใƒ‡ใƒซใฎ็ตฑๅˆใซใ‚ˆใ‚‹ๅŠๆ•™ๅธซใ‚ใ‚Šๅฝขๆ…‹็ด ่งฃๆž
  • unsupervised-pos-tagging - ๆ•™ๅธซใชใ—ๅ“่ฉžใ‚ฟใ‚ฐๆŽจๅฎš
  • negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
  • YouyakuMan - Extractive summarizer using BertSum as summarization model
  • japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
  • kantan - Lookup japanese words by radical patterns
  • make-meidai-dialogue - Get Japanese dialogue corpus
  • japanese_summarizer - A summarizer for Japanese articles.
  • chirptext - ChirpText is a collection of text processing tools for Python.
  • yubin - Japanese Address Munger
  • jawiki-cleaner - Japanese Wikipedia Cleaner
  • japanese2phoneme - A python library to convert Japanese to phoneme.
  • anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
  • aozora_classification - This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
  • aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
  • JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
  • NTM - Testing of Neural Topic Modeling for Japanese articles
  • EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
  • text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
  • chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
  • unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
  • mbart-finetuning - Code to perform finetuning of the mBART model.
  • xvector_jtubespeech - xvector model on jtubespeech
  • TinySegmenterMaker - TinySegmenter็”จใฎๅญฆ็ฟ’ใƒขใƒ‡ใƒซใ‚’่‡ชไฝœใ™ใ‚‹ใŸใ‚ใฎใƒ„ใƒผใƒซ๏ผŽ
  • Grongish - ๆ—ฅๆœฌ่ชžใจใ‚ฐใƒญใƒณใ‚ฎ่ชžใฎ็›ธไบ’ๅค‰ๆ›ใ‚นใ‚ฏใƒชใƒ—ใƒˆ
  • WordCloud-Japanese - WordCloudใงใฎๆ—ฅๆœฌ่ชžๆ–‡็ซ ใ‚’Mecab๏ผˆๅฝขๆ…‹็ด ่งฃๆžใ‚จใƒณใ‚ธใƒณ๏ผ‰ใ‚’ไฝฟ็”จใ›ใšใซๅฝขๆ…‹็ด ่งฃๆžใƒใƒƒใ‚ฏใช่กจ็คบใ‚’ๅฎŸ็พใ™ใ‚‹ใ‚นใ‚ฏใƒชใƒ—ใƒˆ
  • snark - ๆ—ฅๆœฌ่ชžใƒฏใƒผใƒ‰ใƒใƒƒใƒˆใ‚’ๅˆฉ็”จใ—ใŸDBใ‚ขใ‚ฏใ‚ปใ‚นใƒฉใ‚คใƒ–ใƒฉใƒช
  • toEmoji - ๆ—ฅๆœฌ่ชžๆ–‡ใ‚’็ตตๆ–‡ๅญ—ใ ใ‘ใฎๆ–‡ใซๅค‰ๆ›ใ™ใ‚‹ใชใซใ‹
  • termextract - - ๅฐ‚้–€็”จ่ชžๆŠฝๅ‡บใ‚ขใƒซใ‚ดใƒชใ‚บใƒ ใฎๅฎŸ่ฃ…ใฎ็ทด็ฟ’
  • JDT-with-KenLM-scoring - Japanese-Dialog-Transformerใฎๅฟœ็ญ”ๅ€™่ฃœใซๅฏพใ—ใฆใ€KenLMใซใ‚ˆใ‚‹N-gram่จ€่ชžใƒขใƒ‡ใƒซใงใ‚นใ‚ณใ‚ขใƒชใƒณใ‚ฐใ—ใ€ใƒ•ใ‚ฃใƒซใ‚ฟใƒชใƒณใ‚ฐ่‹ฅใ—ใใฏใƒชใƒฉใƒณใ‚ญใƒณใ‚ฐใ‚’่กŒใ†ใ€‚
  • mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (ๆททๅˆใƒฆใƒ‹ใ‚ฐใƒฉใƒ ใƒขใƒ‡ใƒซใจ็„ก้™ๆททๅˆใƒฆใƒ‹ใ‚ฐใƒฉใƒ ใƒขใƒ‡ใƒซ)
  • hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (้š ใ‚Œใƒžใƒซใ‚ณใƒ•ใƒขใƒ‡ใƒซใจ็„ก้™้š ใ‚Œใƒžใƒซใ‚ณใƒ•ใƒขใƒ‡ใƒซ)
  • Ngram-language-model - Ngram language model in Python. (Nใ‚ฐใƒฉใƒ ่จ€่ชžใƒขใƒ‡ใƒซ)
  • ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
  • neural_ime - Neural IME: Neural Input Method Engine
  • neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
  • tinysegmenter - tokenizer specified for Japanese
  • AugLy-jp - Data Augmentation for Japanese Text on AugLy
  • furigana4epub - A Python script for adding furigana to Japanese epub books using Mecab and Unidic.
  • PyKatsuyou - Japanese verb/adjective inflections tool
  • jageocoder - Pure Python Japanese address geocoder
  • pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
  • nksnd - New kana-kanji conversion engine
  • JaMIE - A Japanese Medical Information Extraction Toolkit
  • fasttext-vs-word2vec-on-twitter-data - fasttextใจword2vecใฎๆฏ”่ผƒใจใ€ๅฎŸ่กŒใ‚นใ‚ฏใƒชใƒ—ใƒˆใ€ๅญฆ็ฟ’ใ‚นใ‚ฏใƒชใƒ—ใƒˆใงใ™
  • minimal-search-engine - ๆœ€ๅฐใฎใ‚ตใƒผใƒใ‚จใƒณใ‚ธใƒณ/PageRank/tf-idf
  • 5ch-analysis - 5chใฎ้ŽๅŽปใƒญใ‚ฐใ‚’ใ‚นใ‚ฏใƒฌใ‚คใƒ”ใƒณใ‚ฐใ—ใฆใ€้ŽๅŽปๆต่กŒใฃใŸๅ˜่ชž(ex, ้ฆ™ๅ…ทๅธซ, orz)ใชใฉใ‚’่ฟฝ่ทก่ชฟๆŸป
  • tweet_extructor - Twitterๆ—ฅๆœฌ่ชž่ฉ•ๅˆคๅˆ†ๆžใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใฎใŸใ‚ใฎใƒ„ใ‚คใƒผใƒˆใƒ€ใ‚ฆใƒณใƒญใƒผใƒ€
  • japanese-word-aggregation - Aggregating Japanese words based on Juman++ and ConceptNet5.5
  • jinf - A Japanese inflection converter
  • kwja - A unified language analyzer for Japanese
  • mlm-scoring-transformers - Reproduced package based on Masked Language Model Scoring (ACL2020).
  • ClipCap-for-Japanese - [PyTorch] ClipCap for Japanese
  • SAT-for-Japanese - [PyTorch] Show, Attend and Tell for Japanese
  • cihai - Python library for CJK (Chinese, Japanese, and Korean) language dictionary
  • marine - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation
  • whisper-asr-finetune - Finetuning Whisper ASR model
  • japanese_chatbot - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
  • radicalchar - ้ƒจ้ฆ–ๆ–‡ๅญ—ๆญฃ่ฆๅŒ–ใƒฉใ‚คใƒ–ใƒฉใƒช
  • akaza - Yet another Japanese IME for IBus/Linux
  • posuto - Japanese postal code data.
  • tacotron2-japanese - Tacotron2 implementation of Japanese
  • ibus-hiragana - ใฒใ‚‰ใŒใชIME for IBus
  • furiganapad - ใตใ‚ŠใŒใชใƒ‘ใƒƒใƒ‰
  • chikkarpy - Japanese synonym library
  • ja-tokenizer-docker-py - Mecab + NEologd + Docker + Python3
  • JapaneseEmbeddingEval - JapaneseEmbeddingEval
  • gptuber-by-langchain - GPTใŒYouTuberใ‚’ใ‚„ใ‚Šใพใ™
  • shuwa - Extend GNOME On-Screen Keyboard for Input Methods
  • japanese-nli-model - This repository provides the code for Japanese NLI model, a fine-tuned masked language model.
  • tra-fugu - A tool for Japanese-English translation and English-Japanese translation by using FuguMT
  • fugumt - ใทใ‚‹ใƒผใตใŠใถใ“ใ‚“ใ›ใทใจ ใงๅ…ฌ้–‹ใ—ใŸๆฉŸๆขฐ็ฟป่จณใ‚จใƒณใ‚ธใƒณใ‚’ๅˆฉ็”จใ™ใ‚‹็ฟป่จณ็’ฐๅขƒใงใ™ใ€‚ ใƒ•ใ‚ฉใƒผใƒ ใซๅ…ฅๅŠ›ใ•ใ‚ŒใŸๆ–‡ๅญ—ๅˆ—ใฎ็ฟป่จณใ€PDFใฎ็ฟป่จณใŒๅฏ่ƒฝใงใ™ใ€‚
  • JaSPICE - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
  • Retrieval-based-Voice-Conversion-WebUI-JP-localization - jp-localization
  • pyopenjtalk - Python wrapper for OpenJTalk
  • yomigana-ebook - Make learning Japanese easier by adding readings for every kanji in the eBook
  • N46Whisper - Whisper based Japanese subtitle generator
  • japanese_llm_simple_webui - Rinna-3.6Bใ€OpenCALM็ญ‰ใฎๆ—ฅๆœฌ่ชžๅฏพๅฟœLLM(ๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซ)็”จใฎ็ฐกๆ˜“Webใ‚คใƒณใ‚ฟใƒ•ใ‚งใƒผใ‚นใงใ™
  • pdf-translator - pdf-translator translates English PDF files into Japanese, preserving the original layout.
  • japanese_qa_demo_with_haystack_and_es - Haystack + Elasticsearch + wikipedia(ja) ใ‚’็”จใ„ใŸใ€ๆ—ฅๆœฌ่ชžใฎ่ณชๅ•ๅฟœ็ญ”ใ‚ทใ‚นใƒ†ใƒ ใฎใ‚ตใƒณใƒ—ใƒซ
  • mozc-devices - Automatically exported from code.google.com/p/mozc-morse
  • natsume - A Japanese text frontend processing toolkit
  • vits-japros-webui - ๆ—ฅๆœฌ่ชžTTS๏ผˆVITS๏ผ‰ใฎๅญฆ็ฟ’ใจ้ŸณๅฃฐๅˆๆˆใฎGradio WebUI
  • ja-law-parser - A Japanese law parser
  • dictation-kit - Japanese dictation kit using Julius
  • julius4seg - Juliusใ‚’ไฝฟใฃใŸใ‚ปใ‚ฐใƒกใƒณใƒ†ใƒผใ‚ทใƒงใƒณๆ”ฏๆดใƒ„ใƒผใƒซ
  • voicevox_engine - ็„กๆ–™ใงไฝฟใˆใ‚‹ไธญๅ“่ณชใชใƒ†ใ‚ญใ‚นใƒˆ่ชญใฟไธŠใ’ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ€VOICEVOXใฎ้Ÿณๅฃฐๅˆๆˆใ‚จใƒณใ‚ธใƒณ
  • LLaVA-JP - LLaVA-JP is a Japanese VLM trained by LLaVA method
  • RAG-Japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
  • bertjsc - Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTใซๅŸบใฅใ„ใฆๆ—ฅๆœฌ่ชžๆ กๆญฃ
  • llm-leaderboard - Project of llm evaluation to Japanese tasks
  • jglue-evaluation-scripts - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
  • BLIP2-Japanese - Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
  • wikipedia-passages-jawiki-embeddings-utils - wikipedia ๆ—ฅๆœฌ่ชžใฎๆ–‡ใ‚’ใ€ๅ„็จฎๆ—ฅๆœฌ่ชžใฎ embeddings ใ‚„ faiss index ใธใจๅค‰ๆ›ใ™ใ‚‹ใ‚นใ‚ฏใƒชใƒ—ใƒˆ็ญ‰ใ€‚
  • simple-simcse-ja - Exploring Japanese SimCSE
  • wikipedia-japanese-open-rag - Wikipediaใฎๆ—ฅๆœฌ่ชž่จ˜ไบ‹ใ‚’ๅ…ƒใซใ€ใƒฆใƒผใ‚ถใฎ่ณชๅ•ใซๅ›ž็ญ”ใ™ใ‚‹Gradioใƒ™ใƒผใ‚นใฎRAGใฎใ‚ตใƒณใƒ—ใƒซ
  • gpt4-autoeval - GPT-4 ใ‚’็”จใ„ใฆใ€่จ€่ชžใƒขใƒ‡ใƒซใฎๅฟœ็ญ”ใ‚’่‡ชๅ‹•่ฉ•ไพกใ™ใ‚‹ใ‚นใ‚ฏใƒชใƒ—ใƒˆ
  • t5-japanese - ๆ—ฅๆœฌ่ชžT5ใƒขใƒ‡ใƒซ
  • japanese_llm_eval - A repo for evaluating Japanese LLMsใ€€ใƒปใ€€ๆ—ฅๆœฌ่ชžLLMใ‚’่ฉ•ไพกใ™ใ‚‹ใƒฌใƒ
  • jmteb - The evaluation scripts of JMTEB (Japanese Massive Text Embedding Benchmark)
  • pydomino - ๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใซๅฏพใ—ใฆ้Ÿณ็ด ใƒฉใƒ™ใƒซใ‚’ใ‚ขใƒฉใ‚คใƒณใƒกใƒณใƒˆใ™ใ‚‹ใŸใ‚ใฎใƒ„ใƒผใƒซใงใ™
  • easynovelassistant - ่ปฝ้‡ใง่ฆๅˆถใ‚‚ๆคœ้–ฒใ‚‚ใชใ„ๆ—ฅๆœฌ่ชžใƒญใƒผใ‚ซใƒซ LLMใ€ŽLightChatAssistant-TypeBใ€ใซใ‚ˆใ‚‹ใ€็ฐกๅ˜ใชใƒŽใƒ™ใƒซ็”Ÿๆˆใ‚ขใ‚ทใ‚นใ‚ฟใƒณใƒˆใงใ™ใ€‚ใƒญใƒผใ‚ซใƒซ็‰นๆจฉใฎๆฐธ็ถš็”Ÿๆˆ Generate forever ใงใ€ๅฝ“ใŸใ‚Šใ‚ฌใƒใƒฃใ‚’็ฉใฟไธŠใ’ใพใ™ใ€‚่ชญใฟไธŠใ’ใซใ‚‚ๅฏพๅฟœใ€‚
  • clip-japanese - ๆ—ฅๆœฌ่ชžใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใงใฎqlora instruction tuningๅญฆ็ฟ’ใ‚ตใƒณใƒ—ใƒซใ‚ณใƒผใƒ‰
  • rime-jaroomaji - Japanese rลmaji input schema for Rime IME
  • deep-question-generation - ๆทฑๅฑคๅญฆ็ฟ’ใ‚’็”จใ„ใŸใ‚ฏใ‚คใ‚บ่‡ชๅ‹•็”Ÿๆˆ๏ผˆๆ—ฅๆœฌ่ชžT5ใƒขใƒ‡ใƒซ๏ผ‰
  • magpie-nemotron - Magpieใจใ„ใ†ๆ‰‹ๆณ•ใจNemotron-4-340B-Instructใ‚’็”จใ„ใฆๅˆๆˆๅฏพ่ฉฑใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’ไฝœใ‚‹ใ‚ณใƒผใƒ‰
  • qlora_ja - ๆ—ฅๆœฌ่ชžใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใงใฎqlora instruction tuningๅญฆ็ฟ’ใ‚ตใƒณใƒ—ใƒซใ‚ณใƒผใƒ‰
  • mozcdic-ut-jawiki - Mozc UT Jawiki Dictionary is a dictionary generated from the Japanese Wikipedia for Mozc.
  • shisa-v2 - Japanese / English Bilingual LLM
  • llm-translator - Mixtral-based Ja-En (En-Ja) Translation model
  • llm-jp-asr - Whisperใฎใƒ‡ใ‚ณใƒผใƒ€ใ‚’llm-jp-1.3b-v1.0ใซ็ฝฎใๆ›ใˆใŸ้Ÿณๅฃฐ่ช่ญ˜ใƒขใƒ‡ใƒซใ‚’ๅญฆ็ฟ’ใ•ใ›ใ‚‹ใŸใ‚ใฎใ‚ณใƒผใƒ‰
  • rag-japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
  • monaka - A Japanese Parser (including historical Japanese)
  • jp-translate.cloud - A state-of-the-art open-source Japanese <--> English machine translation system based on the latest NMT research.
  • substring-word-finder - ้€ฃ็ถš้ƒจๅˆ†ๆ–‡ๅญ—ๅˆ—ใฎๅ˜่ชžๅˆคๅฎšใ‚’่กŒใ„ใพใ™
  • heron-vlm-leaderboard - This project is a benchmarking tool for evaluating and comparing the performance of various Vision Language Models (VLMs). It uses two datasets: LLaVA-Bench-In-the-Wild and Japanese HERON Bench to measure model performance.
  • text2dataset - Easily turn large English text datasets into Japanese text datasets using open LLMs.
  • mecab-web-api - MeCabใ‚’ๅˆฉ็”จใ—ใŸๆ—ฅๆœฌ่ชžๅฝขๆ…‹็ด ่งฃๆžWebAPI
  • mecab_controller - Mecab wrapper to generate furigana readings.
  • vits - VITSใซใ‚ˆใ‚‹ใƒ†ใ‚ญใ‚นใƒˆ่ชญใฟไธŠใ’ๅ™จ&ใƒœใ‚คใ‚นใƒใ‚งใƒณใ‚ธใƒฃใƒผ
  • akari_chatgpt_bot - ้Ÿณๅฃฐ่ช่ญ˜ใ€ๆ–‡็ซ ็”Ÿๆˆใ€้Ÿณๅฃฐๅˆๆˆใ‚’ไฝฟใฃใฆๅฏพ่ฉฑใ™ใ‚‹ใƒใƒฃใƒƒใƒˆใƒœใƒƒใƒˆใ‚ขใƒ—ใƒช
  • kudasai - Streamlining Japanese-English Translation with Advanced Preprocessing and Integrated Translation Technologies
  • mecab-visualizer - MeCabใฎๅฝขๆ…‹็ด ่งฃๆž็ตๆžœใ‚’ๅฏ่ฆ–ๅŒ–ใ™ใ‚‹ใƒ„ใƒผใƒซ
  • add-dictionary - OpenJTalkใฎใƒฆใƒผใ‚ถ่พžๆ›ธใ‚’GUIใง่ฟฝๅŠ ใ™ใ‚‹ใ‚ขใƒ—ใƒช
  • j-moshi - J-Moshi: A Japanese Full-duplex Spoken Dialogue System
  • jatts - JATTS: Japanese TTS (for research)
  • tsukasa-speech - a Frontier Japanese Speech Generation net
  • symptom-expression-search - Elasticsearchใ‚„GiNZAใ€ๆ‚ฃ่€…่กจ็พ่พžๆ›ธใ‚’ไฝฟใฃใŸๆ‚ฃ่€…่กจ็พๆบใ‚ŒๅธๅŽใ™ใ‚‹ๆ„ๅ‘ณๆง‹้€ ๆคœ็ดขใ‚’่ฉฆใ—ใŸ
  • llm-jp-judge - ็”Ÿๆˆ่‡ชๅ‹•่ฉ•ไพกใ‚’่กŒใ†ใŸใ‚ใฎPythonใƒ„ใƒผใƒซ
  • asagi-vlm-colaboratory-sample - ColaboratoryไธŠใงAsagi(ๅˆๆˆใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’ๆดป็”จใ—ใŸๅคง่ฆๆจกๆ—ฅๆœฌ่ชžVLM)ใ‚’ใŠ่ฉฆใ—ใ™ใ‚‹ใ‚ตใƒณใƒ—ใƒซ
  • llm-jp-eval-mm - This tool automatically evaluates Japanese multi-modal large language models across multiple datasets.
  • llm-jp-judge - ็”Ÿๆˆ่‡ชๅ‹•่ฉ•ไพกใ‚’่กŒใ†ใŸใ‚ใฎPythonใƒ„ใƒผใƒซ
  • manga109api - Simple python API to read annotation data of Manga109
  • fastrtc-jp - fastrtc็”จใฎๆ—ฅๆœฌ่ชžTTSใจSTT่ฟฝๅŠ ใ‚ญใƒƒใƒˆ
  • whisper-transcription - Pythonใ‚’ไฝฟ็”จใ—ใŸWhisperใƒขใƒ‡ใƒซใซใ‚ˆใ‚‹้Ÿณๅฃฐๆ–‡ๅญ—่ตทใ“ใ—ใƒ„ใƒผใƒซ
  • pocket-researcher - LLMใ‚’ๆดป็”จใ—ใŸ่‡ชๅพ‹่ชฟๆŸปใ‚จใƒผใ‚ธใ‚งใƒณใƒˆใ€‚ๆ‰‹่ปฝใซๆƒ…ๅ ฑๅŽ้›†ใ€ๆฆ‚่ฆๆŠŠๆกใ€‚
  • jtransbench - A tool to easily benchmark Japanese translation skills
  • easyllasa - EasyLlasa ใฏ 5๏ฝž15็ง’ใฎๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใจๆ—ฅๆœฌ่ชžใƒ†ใ‚ญใ‚นใƒˆใ‹ใ‚‰ๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใ‚’็”Ÿๆˆใ™ใ‚‹ TSTS (TextSpeechToSpeech) ใงใ™ใ€‚
  • kanjikana-model - ๆฐๅๆผขๅญ—ใ‚ซใƒŠ็ชๅˆใƒขใƒ‡ใƒซ
  • deep-openreview-research-ja - OpenReview่ซ–ๆ–‡ใ‚’่‡ชๅ‹•ใง็™บ่ฆ‹ใƒปๅˆ†ๆžใ™ใ‚‹ๆ—ฅๆœฌ่ชžๅฏพๅฟœAIใ‚จใƒผใ‚ธใ‚งใƒณใƒˆ
  • pitchbench - Experimental Japanese pitch accent based LLM Benchmark
  • mini-transformer-from-scratch - English to Japanese Transformer from scratch
  • vv_core_inference - VOICEVOXใฎใ‚ณใ‚ขๅ†…ใง็”จใ„ใ‚‰ใ‚Œใฆใ„ใ‚‹ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใƒขใƒ‡ใƒซใฎๆŽจ่ซ–ใ‚ณใƒผใƒ‰
  • pyopenjtalk-plus - pyopenjtalk-plus: A Python wrapper for OpenJTalk with additional improvements
  • japanese_spelling_correction - Japanese Spelling Correction
  • py-kaomoji - python kaomoji
  • llm-jp-vila - This repository contains the code for training llm-jp/llm-jp-3-vila-14b, modified from VILA repository.
  • kanjivg-radical - kanjivg-radical
  • japanese-wordnet-visualization - This project visualizes the Japanese Wordnet (ๆ—ฅๆœฌ่ชžใƒฏใƒผใƒ‰ใƒใƒƒใƒˆ) with web application built by Django
  • piper-plus - Enhanced Piper TTS with Japanese support, WebAssembly, multi-GPU training, and quality improvements.
  • Japanera - Easy Tools for Japanese Era System
  • bert-abstractive-text-summarization - Japanese Sentence Summarization with BERT
  • kyujipy - A Python library to convert Japanese texts from Shinjitai (ๆ–ฐๅญ—ไฝ“) to Kyujitai (่ˆŠๅญ—้ซ”) and vice versa
  • jitenbot - Web crawler for creating personal copies of Japanese dictionaries
  • ja-icd10 - ICD-10 ๅ›ฝ้š›็–พ็—…ๅˆ†้กžใฎๆ—ฅๆœฌ่ชžๆƒ…ๅ ฑใ‚’ๆ‰ฑใ†ใŸใ‚ใฎPythonใƒ‘ใƒƒใ‚ฑใƒผใ‚ธ
  • pl-bert-vits2 - VITS2 using Phoneme-Level Japanese BERT
  • ndc_predictor - NDCPredictorใฎๆฉŸๆขฐๅญฆ็ฟ’ใƒขใƒ‡ใƒซ๏ผˆๆ›ธ่ชŒๆƒ…ๅ ฑใ‹ใ‚‰ๆ—ฅๆœฌๅ้€ฒๅˆ†้กžใ‚’ๆŽจๆธฌใ™ใ‚‹fastTextใฎๅญฆ็ฟ’ๆธˆใฟใƒขใƒ‡ใƒซ๏ผ‰
  • pfmt-bench-fin-ja - pfmt-bench-fin-ja: Preferred Multi-turn Benchmark for Finance in Japanese
  • marine-plus - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation (Also supported Windows)
  • ja-tokenizer-benchmark - Compare the speed of various Japanese tokenizers in Python.
  • yat - yat: Yet Another Tokenizer for Japanese NLP
  • igakuqa119 - Evaluating LLMs on the 119th Japanese Medical Licensing Examination
  • japanese-luw-tokenizer - Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers
  • ibus-jig - ibus-jig: Japanese-language Input-method using GPT-4
  • jp-stopword-filter - A lightweight Python library designed to filter stopwords from Japanese text based on customizable rules.
  • yasumail - Synthetic Japanese business email generator for ML training data
  • himotoki - A Python-based Japanese Tokenizer, Dictionary, Morphological Analyzer and Romanization Tool. Based on JMDict for Language Learning.
Name downloads/week total downloads stars last commit
๐Ÿ”— namedivider-python ๐Ÿ“ฅ 475 ๐Ÿ“ฆ 78k โญ 251 ๐ŸŸก november 2025
๐Ÿ”— asa-python ๐Ÿ“ฅ 106 ๐Ÿ“ฆ 30k โญ 11 ๐Ÿ”ด february 2019
๐Ÿ”— python_asa - - โญ 22 ๐Ÿ”ด january 2020
๐Ÿ”— toiro ๐Ÿ“ฅ 69 ๐Ÿ“ฆ 26k โญ 121 ๐ŸŸก november 2025
๐Ÿ”— ja-timex ๐Ÿ“ฅ 440 ๐Ÿ“ฆ 89k โญ 140 ๐Ÿ”ด november 2023
๐Ÿ”— JapaneseTokenizers - - โญ 137 ๐Ÿ”ด march 2019
๐Ÿ”— daaja ๐Ÿ“ฅ 113 ๐Ÿ“ฆ 25k โญ 64 ๐Ÿ”ด february 2023
๐Ÿ”— accel-brain-code ๐Ÿ“ฅ 346 ๐Ÿ“ฆ 148k โญ 320 ๐Ÿ”ด december 2023
๐Ÿ”— JGLUE - - โญ 333 ๐ŸŸก march 2025
๐Ÿ”— kyoto-reader ๐Ÿ“ฅ 518 ๐Ÿ“ฆ 49k โญ 10 ๐Ÿ”ด june 2024
๐Ÿ”— nlplot ๐Ÿ“ฅ 163 ๐Ÿ“ฆ 108k โญ 240 ๐Ÿ”ด september 2022
๐Ÿ”— rake-ja - - โญ 21 ๐Ÿ”ด october 2018
๐Ÿ”— jel ๐Ÿ“ฅ 45 ๐Ÿ“ฆ 8k โญ 11 ๐Ÿ”ด july 2021
๐Ÿ”— MedNER-J - - โญ 18 ๐Ÿ”ด may 2022
๐Ÿ”— zunda-python ๐Ÿ“ฅ 23 ๐Ÿ“ฆ 6k โญ 10 ๐Ÿ”ด november 2019
๐Ÿ”— AIO2_DPR_baseline - - โญ 16 ๐Ÿ”ด january 2022
๐Ÿ”— showcase ๐Ÿ“ฅ 32 ๐Ÿ“ฆ 7k โญ 6 ๐Ÿ”ด june 2018
๐Ÿ”— darts-clone-python ๐Ÿ“ฅ 3k ๐Ÿ“ฆ 9M โญ 20 ๐Ÿ”ด april 2022
๐Ÿ”— jrte-corpus_example - - โญ 3 ๐Ÿ”ด november 2021
๐Ÿ”— desuwa ๐Ÿ“ฅ 59 ๐Ÿ“ฆ 10k โญ 6 ๐Ÿ”ด may 2022
๐Ÿ”— HotPepperGourmetDialogue - - โญ 278 ๐Ÿ”ด may 2016
๐Ÿ”— nlp-recipes-ja - - โญ 65 ๐Ÿ”ด april 2021
๐Ÿ”— Japanese_nlp_scripts - - โญ 26 ๐Ÿ”ด june 2019
๐Ÿ”— DNorm-J - - โญ 9 ๐Ÿ”ด june 2022
๐Ÿ”— pyknp-eventgraph ๐Ÿ“ฅ 130 ๐Ÿ“ฆ 65k โญ 9 ๐Ÿ”ด september 2022
๐Ÿ”— ishi ๐Ÿ“ฅ 38 ๐Ÿ“ฆ 6k โญ 2 ๐Ÿ”ด may 2020
๐Ÿ”— python-npylm - - โญ 34 ๐Ÿ”ด january 2019
๐Ÿ”— python-npycrf - - โญ 11 ๐Ÿ”ด march 2018
๐Ÿ”— unsupervised-pos-tagging - - โญ 16 ๐Ÿ”ด october 2017
๐Ÿ”— negima ๐Ÿ“ฅ 39 ๐Ÿ“ฆ 16k โญ 14 ๐Ÿ”ด august 2018
๐Ÿ”— YouyakuMan - - โญ 52 ๐Ÿ”ด september 2020
๐Ÿ”— japanese-numbers-python ๐Ÿ“ฅ 351 ๐Ÿ“ฆ 2M โญ 21 ๐Ÿ”ด april 2020
๐Ÿ”— kantan - - โญ 8 ๐Ÿ”ด october 2024
๐Ÿ”— make-meidai-dialogue - - โญ 40 ๐Ÿ”ด september 2017
๐Ÿ”— japanese_summarizer - - โญ 10 ๐Ÿ”ด august 2022
๐Ÿ”— chirptext ๐Ÿ“ฅ 1k ๐Ÿ“ฆ 192k โญ 7 ๐Ÿ”ด october 2022
๐Ÿ”— yubin ๐Ÿ“ฅ 23 ๐Ÿ“ฆ 3k โญ 3 ๐Ÿ”ด october 2019
๐Ÿ”— jawiki-cleaner ๐Ÿ“ฅ 265 ๐Ÿ“ฆ 24k โญ 6 ๐Ÿ”ด february 2021
๐Ÿ”— japanese2phoneme ๐Ÿ“ฅ 41 ๐Ÿ“ฆ 4k โญ 1 ๐Ÿ”ด february 2022
๐Ÿ”— anlp_nlp2021_d3-1 - - โญ 1 ๐Ÿ”ด march 2022
๐Ÿ”— aozora_classification - - โญ 11 ๐Ÿ”ด september 2017
๐Ÿ”— aozora-corpus-generator - - โญ 8 ๐ŸŸก june 2025
๐Ÿ”— JLM - - โญ 111 ๐Ÿ”ด june 2019
๐Ÿ”— NTM - - โญ 13 ๐Ÿ”ด july 2019
๐Ÿ”— EN-JP-ML-Lexicon - - โญ 40 ๐Ÿ”ด march 2021
๐Ÿ”— text-generation - - โญ 19 ๐ŸŸก august 2025
๐Ÿ”— chainer_nic - - โญ 17 ๐Ÿ”ด december 2018
๐Ÿ”— unihan-lm - - โญ 2 ๐Ÿ”ด november 2020
๐Ÿ”— mbart-finetuning - - โญ 3 ๐Ÿ”ด october 2021
๐Ÿ”— xvector_jtubespeech - - โญ 47 ๐Ÿ”ด november 2023
๐Ÿ”— TinySegmenterMaker - - โญ 72 ๐Ÿ”ด september 2022
๐Ÿ”— Grongish - - โญ 25 ๐ŸŸข december 2025
๐Ÿ”— WordCloud-Japanese - - โญ 9 ๐Ÿ”ด january 2020
๐Ÿ”— snark - - โญ 11 ๐Ÿ”ด march 2020
๐Ÿ”— toEmoji - - โญ 4 ๐Ÿ”ด april 2018
๐Ÿ”— termextract - - โญ 18 ๐Ÿ”ด september 2018
๐Ÿ”— JDT-with-KenLM-scoring - - โญ 1 ๐Ÿ”ด july 2022
๐Ÿ”— mixture-of-unigram-model - - โญ 6 ๐Ÿ”ด june 2017
๐Ÿ”— hidden-markov-model - - โญ 5 ๐Ÿ”ด june 2017
๐Ÿ”— Ngram-language-model - - โญ 5 ๐Ÿ”ด december 2017
๐Ÿ”— ASRDeepSpeech - - โญ 69 ๐Ÿ”ด september 2022
๐Ÿ”— neural_ime - - โญ 67 ๐Ÿ”ด december 2016
๐Ÿ”— neural_japanese_transliterator - - โญ 178 ๐Ÿ”ด september 2017
๐Ÿ”— tinysegmenter ๐Ÿ“ฅ 131k ๐Ÿ“ฆ 171k โญ repo not found ๐Ÿ”ด november 2015
๐Ÿ”— AugLy-jp ๐Ÿ“ฅ 173 ๐Ÿ“ฆ 30k โญ 7 ๐Ÿ”ด september 2021
๐Ÿ”— furigana4epub ๐Ÿ“ฅ 60 ๐Ÿ“ฆ 12k โญ 29 ๐Ÿ”ด september 2021
๐Ÿ”— PyKatsuyou ๐Ÿ“ฅ 76 ๐Ÿ“ฆ 19k โญ 12 ๐ŸŸก march 2025
๐Ÿ”— jageocoder ๐Ÿ“ฅ 7k ๐Ÿ“ฆ 320k โญ 93 ๐ŸŸก september 2025
๐Ÿ”— pygeonlp ๐Ÿ“ฅ 237 ๐Ÿ“ฆ 21k โญ 22 ๐ŸŸก october 2025
๐Ÿ”— nksnd - - โญ 26 ๐Ÿ”ด may 2018
๐Ÿ”— JaMIE - - โญ 9 ๐Ÿ”ด may 2023
๐Ÿ”— fasttext-vs-word2vec-on-twitter-data - - โญ 48 ๐Ÿ”ด august 2017
๐Ÿ”— minimal-search-engine - - โญ 19 ๐Ÿ”ด july 2019
๐Ÿ”— 5ch-analysis - - โญ 75 ๐Ÿ”ด november 2018
๐Ÿ”— tweet_extructor - - โญ 3 ๐Ÿ”ด august 2022
๐Ÿ”— japanese-word-aggregation - - โญ 2 ๐Ÿ”ด august 2018
๐Ÿ”— jinf ๐Ÿ“ฅ 107 ๐Ÿ“ฆ 54k โญ 4 ๐Ÿ”ด december 2022
๐Ÿ”— kwja ๐Ÿ“ฅ 597 ๐Ÿ“ฆ 55k โญ 138 ๐ŸŸก august 2025
๐Ÿ”— mlm-scoring-transformers - - โญ 6 ๐Ÿ”ด december 2022
๐Ÿ”— ClipCap-for-Japanese - - โญ 12 ๐Ÿ”ด october 2022
๐Ÿ”— SAT-for-Japanese - - โญ 2 ๐Ÿ”ด october 2022
๐Ÿ”— cihai ๐Ÿ“ฅ 832 ๐Ÿ“ฆ 209k โญ 93 ๐ŸŸข last saturday
๐Ÿ”— marine ๐Ÿ“ฅ 109 ๐Ÿ“ฆ 15k โญ 36 ๐Ÿ”ด september 2022
๐Ÿ”— whisper-asr-finetune - - โญ 32 ๐Ÿ”ด december 2022
๐Ÿ”— japanese_chatbot - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— radicalchar - - โญ 8 ๐Ÿ”ด december 2022
๐Ÿ”— akaza - - โญ 233 ๐ŸŸข today
๐Ÿ”— posuto ๐Ÿ“ฅ 6k ๐Ÿ“ฆ 653k โญ 224 ๐ŸŸข february
๐Ÿ”— tacotron2-japanese - - โญ 269 ๐Ÿ”ด september 2022
๐Ÿ”— ibus-hiragana - - โญ 78 ๐ŸŸก november 2025
๐Ÿ”— furiganapad - - โญ 19 ๐ŸŸก april 2025
๐Ÿ”— chikkarpy ๐Ÿ“ฅ 312 ๐Ÿ“ฆ 58k โญ 55 ๐Ÿ”ด february 2022
๐Ÿ”— ja-tokenizer-docker-py - - โญ 36 ๐Ÿ”ด may 2022
๐Ÿ”— JapaneseEmbeddingEval - - โญ 183 ๐Ÿ”ด october 2024
๐Ÿ”— gptuber-by-langchain - - โญ 63 ๐Ÿ”ด january 2023
๐Ÿ”— shuwa - - โญ 145 ๐Ÿ”ด december 2022
๐Ÿ”— japanese-nli-model - - โญ 5 ๐Ÿ”ด october 2022
๐Ÿ”— tra-fugu - - โญ 6 ๐Ÿ”ด march 2023
๐Ÿ”— fugumt - - โญ 65 ๐Ÿ”ด february 2021
๐Ÿ”— JaSPICE ๐Ÿ“ฅ 35 ๐Ÿ“ฆ 2k โญ 9 ๐Ÿ”ด november 2023
๐Ÿ”— Retrieval-based-Voice-Conversion-WebUI-JP-localization - - โญ 48 ๐Ÿ”ด april 2023
๐Ÿ”— pyopenjtalk ๐Ÿ“ฅ 15k ๐Ÿ“ฆ 1M โญ 243 ๐ŸŸก april 2025
๐Ÿ”— yomigana-ebook ๐Ÿ“ฅ 39 ๐Ÿ“ฆ 7k โญ 25 ๐Ÿ”ด february 2024
๐Ÿ”— N46Whisper - - โญ 1.7k ๐Ÿ”ด february 2025
๐Ÿ”— japanese_llm_simple_webui - - โญ 17 ๐Ÿ”ด may 2024
๐Ÿ”— pdf-translator - - โญ 337 ๐Ÿ”ด may 2024
๐Ÿ”— japanese_qa_demo_with_haystack_and_es - - โญ 1 ๐Ÿ”ด december 2022
๐Ÿ”— mozc-devices - - โญ 2.7k ๐ŸŸก november 2025
๐Ÿ”— natsume ๐Ÿ“ฅ 0 ๐Ÿ“ฆ 3k โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— vits-japros-webui - - โญ 42 ๐Ÿ”ด january 2024
๐Ÿ”— ja-law-parser - - โญ 25 ๐Ÿ”ด january 2024
๐Ÿ”— dictation-kit - - โญ 164 ๐Ÿ”ด april 2019
๐Ÿ”— julius4seg - - โญ 7 ๐Ÿ”ด august 2021
๐Ÿ”— voicevox_engine - - โญ 1.6k ๐ŸŸข today
๐Ÿ”— LLaVA-JP - - โญ 64 ๐Ÿ”ด june 2024
๐Ÿ”— RAG-Japanese - - โญ 10 ๐ŸŸก may 2025
๐Ÿ”— bertjsc - - โญ 14 ๐Ÿ”ด august 2024
๐Ÿ”— llm-leaderboard - - โญ 91 ๐ŸŸก september 2025
๐Ÿ”— jglue-evaluation-scripts - - โญ 18 ๐ŸŸข february
๐Ÿ”— BLIP2-Japanese - - โญ 13 ๐ŸŸก september 2025
๐Ÿ”— wikipedia-passages-jawiki-embeddings-utils - - โญ 11 ๐Ÿ”ด march 2024
๐Ÿ”— simple-simcse-ja - - โญ 69 ๐Ÿ”ด october 2023
๐Ÿ”— wikipedia-japanese-open-rag - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— gpt4-autoeval - - โญ 16 ๐Ÿ”ด june 2024
๐Ÿ”— t5-japanese - - โญ 116 ๐ŸŸก september 2025
๐Ÿ”— japanese_llm_eval - - โญ 5 ๐Ÿ”ด april 2024
๐Ÿ”— jmteb - - โญ 84 ๐ŸŸก november 2025
๐Ÿ”— pydomino - - โญ 36 ๐ŸŸก august 2025
๐Ÿ”— easynovelassistant - - โญ 216 ๐Ÿ”ด july 2024
๐Ÿ”— clip-japanese - - โญ 13 ๐ŸŸก september 2025
๐Ÿ”— rime-jaroomaji - - โญ 47 ๐ŸŸข last thursday
๐Ÿ”— deep-question-generation - - โญ 12 ๐Ÿ”ด march 2023
๐Ÿ”— magpie-nemotron - - โญ 9 ๐Ÿ”ด july 2024
๐Ÿ”— qlora_ja - - โญ 1 ๐Ÿ”ด july 2024
๐Ÿ”— mozcdic-ut-jawiki - - โญ 26 ๐ŸŸข february
๐Ÿ”— shisa-v2 - - โญ 28 ๐ŸŸข december 2025
๐Ÿ”— llm-translator - - โญ 20 ๐Ÿ”ด january 2025
๐Ÿ”— llm-jp-asr - - โญ 9 ๐Ÿ”ด september 2024
๐Ÿ”— rag-japanese - - โญ 10 ๐ŸŸก may 2025
๐Ÿ”— monaka - - โญ 4 ๐Ÿ”ด january 2025
๐Ÿ”— jp-translate.cloud - - โญ 3 ๐Ÿ”ด september 2024
๐Ÿ”— substring-word-finder - - โญ 4 ๐ŸŸก november 2025
๐Ÿ”— heron-vlm-leaderboard - - โญ 6 ๐Ÿ”ด december 2024
๐Ÿ”— text2dataset - - โญ 26 ๐Ÿ”ด january 2025
๐Ÿ”— mecab-web-api - - โญ 40 ๐Ÿ”ด july 2022
๐Ÿ”— mecab_controller - - โญ 18 ๐ŸŸข january
๐Ÿ”— vits - - โญ 91 ๐Ÿ”ด february 2023
๐Ÿ”— akari_chatgpt_bot - - โญ 48 ๐ŸŸก october 2025
๐Ÿ”— kudasai - - โญ 26 ๐ŸŸก june 2025
๐Ÿ”— mecab-visualizer - - โญ 2 ๐Ÿ”ด september 2023
๐Ÿ”— add-dictionary - - โญ 3 ๐ŸŸก october 2025
๐Ÿ”— j-moshi - - โญ 298 ๐ŸŸก june 2025
๐Ÿ”— jatts - - โญ 44 ๐ŸŸก may 2025
๐Ÿ”— tsukasa-speech - - โญ 60 ๐ŸŸก may 2025
๐Ÿ”— symptom-expression-search - - โญ 2 ๐Ÿ”ด february 2021
๐Ÿ”— llm-jp-judge - - โญ 38 ๐ŸŸข december 2025
๐Ÿ”— asagi-vlm-colaboratory-sample - - โญ 1 ๐ŸŸก march 2025
๐Ÿ”— llm-jp-eval-mm - - โญ 41 ๐ŸŸข january
๐Ÿ”— llm-jp-judge - - โญ 38 ๐ŸŸข december 2025
๐Ÿ”— manga109api ๐Ÿ“ฅ 194 ๐Ÿ“ฆ 45k โญ 128 ๐Ÿ”ด march 2022
๐Ÿ”— fastrtc-jp - - โญ 5 ๐ŸŸก may 2025
๐Ÿ”— whisper-transcription - - โญ 16 ๐ŸŸข january
๐Ÿ”— pocket-researcher - - โญ 11 ๐ŸŸก april 2025
๐Ÿ”— jtransbench - - โญ 13 ๐ŸŸก october 2025
๐Ÿ”— easyllasa - - โญ 25 ๐ŸŸก september 2025
๐Ÿ”— kanjikana-model - - โญ 111 ๐ŸŸข december 2025
๐Ÿ”— deep-openreview-research-ja - - โญ 13 ๐ŸŸก november 2025
๐Ÿ”— pitchbench - - โญ 1 ๐ŸŸข december 2025
๐Ÿ”— mini-transformer-from-scratch - - โญ 2 ๐ŸŸก november 2025
๐Ÿ”— vv_core_inference - - โญ 31 ๐ŸŸข december 2025
๐Ÿ”— pyopenjtalk-plus ๐Ÿ“ฅ 6k ๐Ÿ“ฆ 298k โญ 55 ๐ŸŸก november 2025
๐Ÿ”— japanese_spelling_correction - - โญ 14 ๐Ÿ”ด september 2023
๐Ÿ”— py-kaomoji ๐Ÿ“ฅ 81 ๐Ÿ“ฆ 37k โญ 6 ๐Ÿ”ด december 2018
๐Ÿ”— llm-jp-vila - - โญ 10 ๐ŸŸก august 2025
๐Ÿ”— kanjivg-radical - - โญ 105 ๐Ÿ”ด august 2018
๐Ÿ”— japanese-wordnet-visualization - - โญ 3 ๐Ÿ”ด november 2022
๐Ÿ”— piper-plus - - โญ 29 ๐ŸŸข last wednesday
๐Ÿ”— Japanera ๐Ÿ“ฅ 2k ๐Ÿ“ฆ 342k โญ 35 ๐ŸŸก june 2025
๐Ÿ”— bert-abstractive-text-summarization - - โญ 49 ๐Ÿ”ด december 2019
๐Ÿ”— kyujipy ๐Ÿ“ฅ 77 ๐Ÿ“ฆ 22k โญ 22 ๐ŸŸข january
๐Ÿ”— jitenbot - - โญ 4 ๐Ÿ”ด december 2024
๐Ÿ”— ja-icd10 - - โญ 5 ๐Ÿ”ด july 2021
๐Ÿ”— pl-bert-vits2 - - โญ 14 ๐Ÿ”ด december 2023
๐Ÿ”— ndc_predictor - - โญ 11 ๐Ÿ”ด august 2021
๐Ÿ”— pfmt-bench-fin-ja - - โญ 9 ๐ŸŸก march 2025
๐Ÿ”— marine-plus ๐Ÿ“ฅ 120 ๐Ÿ“ฆ 11k โญ 8 ๐ŸŸก october 2025
๐Ÿ”— ja-tokenizer-benchmark - - โญ 7 ๐Ÿ”ด february 2022
๐Ÿ”— yat - - โญ 7 ๐Ÿ”ด june 2018
๐Ÿ”— igakuqa119 - - โญ 7 ๐ŸŸข january
๐Ÿ”— japanese-luw-tokenizer - - โญ 6 ๐Ÿ”ด december 2021
๐Ÿ”— ibus-jig - - โญ 4 ๐Ÿ”ด december 2023
๐Ÿ”— jp-stopword-filter ๐Ÿ“ฅ 42 ๐Ÿ“ฆ 5k โญ 3 ๐Ÿ”ด november 2024
๐Ÿ”— yasumail - - โญ 2 ๐ŸŸข january
๐Ÿ”— himotoki ๐Ÿ“ฅ 437 ๐Ÿ“ฆ 3k โญ 2 ๐ŸŸข last friday

C++

Morphology analysis

High-performance libraries for Japanese morphological analysis

  • mecab - Yet another Japanese morphological analyzer
  • jumanpp - Juman++ (a Morphological Analyzer Toolkit)
  • kytea - The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.
  • juman - Japanese Morphological Analysis System JUMAN
Name downloads/week total downloads stars last commit
๐Ÿ”— mecab - - โญ 1.1k ๐Ÿ”ด february 2025
๐Ÿ”— jumanpp - - โญ 407 ๐Ÿ”ด march 2023
๐Ÿ”— kytea - - โญ 212 ๐Ÿ”ด april 2020
๐Ÿ”— juman - - โญ 11 ๐Ÿ”ด december 2021

Parsing

Libraries for dependency and syntactic parsing of Japanese sentences

  • cabocha - Yet Another Japanese Dependency Structure Analyzer
  • knp - A Japanese Parser
Name downloads/week total downloads stars last commit
๐Ÿ”— cabocha - - โญ 121 ๐Ÿ”ด february 2025
๐Ÿ”— knp - - โญ 33 ๐Ÿ”ด november 2023

Others

Other Japanese NLP and text processing libraries

  • jsc - Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.
  • aquaskk - An input method without morphological analysis.
  • mozc - Mozc - a Japanese Input Method Editor designed for multi-platform
  • trimatch - Trimatch: An (Exact|Prefix|Approximate) String Matching Library
  • resembla - Resembla: Word-based Japanese similar sentence search library
  • corvusskk - โ–ฝโ–ผ SKK-like Japanese Input Method Editor for Windows
  • mozuku - ๆ—ฅๆœฌ่ชžๆ–‡็ซ ใฎ่งฃๆžใƒปๆ กๆญฃใ‚’่กŒใ† LSP ใ‚ตใƒผใƒใƒผใ€‚
Name downloads/week total downloads stars last commit
๐Ÿ”— jsc - - โญ 15 ๐Ÿ”ด december 2012
๐Ÿ”— aquaskk - - โญ 367 ๐Ÿ”ด july 2023
๐Ÿ”— mozc - - โญ 2.8k ๐ŸŸข last thursday
๐Ÿ”— trimatch - - โญ 2 ๐ŸŸข february
๐Ÿ”— resembla - - โญ 73 ๐ŸŸก august 2025
๐Ÿ”— corvusskk - - โญ 352 ๐ŸŸข january
๐Ÿ”— mozuku - - โญ 402 ๐ŸŸข december 2025

Rust crate

Morphology analysis

Fast Japanese morphological analysis crates written in Rust

  • lindera - A morphological analysis library.
  • vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
  • goya - Japanese Morphological Analysis written in Rust
  • vibrato - vibrato: Viterbi-based accelerated tokenizer
  • yoin - A Japanese Morphological Analyzer written in pure Rust
  • mecab-rs - Safe Rust bindings for mecab a part-of-speech and morphological analyzer library
  • awabi - A morphological analyzer using mecab dictionary
  • kanpyo - Japanese Morphological Analyzer written in Rust
Name downloads/week total downloads stars last commit
๐Ÿ”— lindera - ๐Ÿ“ฆ 863k โญ 601 ๐ŸŸข last tuesday
๐Ÿ”— vaporetto - ๐Ÿ“ฆ 172k โญ 251 ๐ŸŸข february
๐Ÿ”— goya - ๐Ÿ“ฆ 11k โญ 81 ๐Ÿ”ด december 2021
๐Ÿ”— vibrato - ๐Ÿ“ฆ 55k โญ 399 ๐ŸŸข february
๐Ÿ”— yoin - ๐Ÿ“ฆ 3k โญ 25 ๐Ÿ”ด october 2017
๐Ÿ”— mecab-rs - ๐Ÿ“ฆ 39k โญ 66 ๐Ÿ”ด september 2023
๐Ÿ”— awabi - ๐Ÿ“ฆ 24k โญ 10 ๐ŸŸก november 2025
๐Ÿ”— kanpyo - ๐Ÿ“ฆ 2.5k โญ 107 ๐ŸŸข january

Converter

Crates for script and character conversion in Japanese text

  • wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
  • unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[ๅŠ่ง’๏ฝถ๏พ…] and Wide-alphanumeric[ๅ…จ่ง’่‹ฑๆ•ฐ] into normal ones
  • kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
  • kanaria - ใ“ใฎใƒฉใ‚คใƒ–ใƒฉใƒชใฏใ€ใฒใ‚‰ใŒใชใƒปใ‚ซใ‚ฟใ‚ซใƒŠใ€ๅŠ่ง’ใƒปๅ…จ่ง’ใฎ็›ธไบ’ๅค‰ๆ›ใ‚„ๅˆคๅˆฅใ‚’ๅง‹ใ‚ใจใ—ใŸๆฉŸ่ƒฝใ‚’ๆไพ›ใ—ใพใ™ใ€‚
  • japanese-address-parser - ๆ—ฅๆœฌใฎไฝๆ‰€ใ‚’้ƒฝ้“ๅบœ็œŒ/ๅธ‚ๅŒบ็”บๆ‘/็”บๅ/ใใฎไป–ใซๅˆ†ๅ‰ฒใ™ใ‚‹ใƒฉใ‚คใƒ–ใƒฉใƒชใงใ™
  • yosina - Yosina is a transliteration library deals with the letters and symbols used in Japanese writing.
  • mojimoji-rs - Rust implementation of a fast converter between Japanese hankaku and zenkaku characters, mojimoji.
Name downloads/week total downloads stars last commit
๐Ÿ”— wana_kana_rust - ๐Ÿ“ฆ 298k โญ 88 ๐ŸŸก march 2025
๐Ÿ”— unicode-jp-rs - ๐Ÿ“ฆ 61k โญ 19 ๐Ÿ”ด april 2020
๐Ÿ”— kana - - โญ 12 ๐Ÿ”ด january 2023
๐Ÿ”— kanaria - - โญ 21 ๐ŸŸข last saturday
๐Ÿ”— japanese-address-parser - - โญ 10 ๐ŸŸก november 2025
๐Ÿ”— yosina - - โญ 20 ๐ŸŸก september 2025
๐Ÿ”— mojimoji-rs - - โญ 4 ๐Ÿ”ด november 2022

Search engine library

Libraries for Japanese full-text search and indexing

Name downloads/week total downloads stars last commit
๐Ÿ”— lindera-tantivy - ๐Ÿ“ฆ 160k โญ 67 ๐ŸŸข january
๐Ÿ”— tantivy-vibrato - ๐Ÿ“ฆ 1.5k โญ 3 ๐Ÿ”ด january 2023

Others

Supplementary crates for Japanese text and IME processing

  • daachorse - A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
  • find-simdoc - Finding all pairs of similar documents time- and memory-efficiently
  • crawdad - Rust library of natural language dictionaries using character-wise double-array tries.
  • tokenizer-speed-bench - Comparison code of various tokenizers
  • stringmatch-bench - Here provides benchmark tools to compare the performance of data structures for string matching.
  • vime - Using Vim as an input method for X11 apps
  • voicevox_core - ็„กๆ–™ใงไฝฟใˆใ‚‹ไธญๅ“่ณชใชใƒ†ใ‚ญใ‚นใƒˆ่ชญใฟไธŠใ’ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ€VOICEVOXใฎใ‚ณใ‚ข
  • akaza - Yet another Japanese IME for IBus/Linux
  • Jotoba - A free online, self-hostable, multilang Japanese dictionary.
  • dvorakjp-romantable - Google ๆ—ฅๆœฌ่ชžๅ…ฅๅŠ›็”จDvorakJPใƒญใƒผใƒžๅญ—ใƒ†ใƒผใƒ–ใƒซ / DvorakJP Roman Table for Google Japanese Input
  • niinii - Japanese glossator for assisted reading of text using Ichiran
  • cskk - SKK (Simple Kana Kanji henkan) library
  • japanki - Learn Japanese vocabs ๐Ÿ‡ฏ๐Ÿ‡ต by doing quizzes on CLI!
  • jpreprocess - Japanese text preprocessor for Text-to-Speech applications (OpenJTalk rewrite in rust language)
  • listup_precedent - ่ฃๅˆคไพ‹ใฎใƒ‡ใƒผใ‚ฟไธ€่ฆงใ‚’่ฃๅˆคๆ‰€ใฎใƒ›ใƒผใƒ ใƒšใƒผใ‚ธ(https://www.courts.go.jp/index.html) ใ‚’ใ‚นใ‚ฏใƒฌใ‚คใƒ”ใƒณใ‚ฐใ—ใฆ็”Ÿๆˆใ™ใ‚‹ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ข
  • jisho - Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.
  • kanalizer - ่‹ฑๅ˜่ชžใ‹ใ‚‰่ชญใฟใ‚’ๆŽจๆธฌใ™ใ‚‹ใƒฉใ‚คใƒ–ใƒฉใƒชใ€‚
  • koharu - Automated manga translation tool with LLM, written in Rust.
  • yomine - A Japanese vocabulary mining tool designed to help language learners mine new words and expressions.
  • matsuba - lightweight japanese ime written in rust
  • hujiang_dictionary - ๆ—ฅๆœฌ่ชž่พžๆ›ธ by Rust, support Telegram bot, AWS Lambda and Cloudflare Workers. Support LLM and search RAG.
Name downloads/week total downloads stars last commit
๐Ÿ”— daachorse - ๐Ÿ“ฆ 651k โญ 242 ๐ŸŸข january
๐Ÿ”— find-simdoc - ๐Ÿ“ฆ 29k โญ 62 ๐ŸŸก march 2025
๐Ÿ”— crawdad - ๐Ÿ“ฆ 59k โญ 36 ๐Ÿ”ด january 2025
๐Ÿ”— tokenizer-speed-bench - - โญ 4 ๐Ÿ”ด march 2023
๐Ÿ”— stringmatch-bench - - โญ 3 ๐Ÿ”ด september 2022
๐Ÿ”— vime - - โญ 230 ๐Ÿ”ด november 2022
๐Ÿ”— voicevox_core - - โญ 1k ๐ŸŸข today
๐Ÿ”— akaza - - โญ 233 ๐ŸŸข today
๐Ÿ”— Jotoba - - โญ 197 ๐Ÿ”ด january 2024
๐Ÿ”— dvorakjp-romantable - - โญ 55 ๐ŸŸข february
๐Ÿ”— niinii - - โญ 15 ๐ŸŸข february
๐Ÿ”— cskk - - โญ 79 ๐ŸŸข january
๐Ÿ”— japanki - - โญ 3 ๐Ÿ”ด october 2023
๐Ÿ”— jpreprocess - - โญ 51 ๐ŸŸข february
๐Ÿ”— listup_precedent - - โญ 5 ๐Ÿ”ด february 2025
๐Ÿ”— jisho - - โญ 17 ๐ŸŸข january
๐Ÿ”— kanalizer - - โญ 26 ๐ŸŸก november 2025
๐Ÿ”— koharu - - โญ 635 ๐ŸŸข last saturday
๐Ÿ”— yomine - - โญ 47 ๐ŸŸข last thursday
๐Ÿ”— matsuba - - โญ 18 ๐Ÿ”ด march 2023
๐Ÿ”— hujiang_dictionary - - โญ 69 ๐ŸŸข last tuesday

JavaScript

Morphology analysis

Japanese morphological analysis libraries for browser and Node.js

  • kuromoji.js - JavaScript implementation of Japanese morphological analyzer
  • rakutenma - Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
  • node-mecab-ya - Yet another mecab wrapper for nodejs
  • juman-bin - a User-Extensible Morphological Analyzer for Japanese. ๆ—ฅๆœฌ่ชžๅฝขๆ…‹็ด ่งฃๆžใ‚ทใ‚นใƒ†ใƒ 
  • node-mecab-async - Asynchronous japanese morphological analyser using MeCab.
Name downloads/week total downloads stars last commit
๐Ÿ”— kuromoji.js ๐Ÿ“ฅ 149k/week ๐Ÿ“ฆ 7.9M โญ 966 ๐Ÿ”ด november 2018
๐Ÿ”— rakutenma ๐Ÿ“ฅ 11/week ๐Ÿ“ฆ 879 โญ 473 ๐Ÿ”ด january 2015
๐Ÿ”— node-mecab-ya ๐Ÿ“ฅ 102/week ๐Ÿ“ฆ 7.2k โญ 110 ๐Ÿ”ด repo not found
๐Ÿ”— juman-bin ๐Ÿ“ฅ 6/week ๐Ÿ“ฆ 297 โญ 3 ๐Ÿ”ด may 2017
๐Ÿ”— node-mecab-async ๐Ÿ“ฅ 4.8k/week ๐Ÿ“ฆ 340k โญ 103 ๐Ÿ”ด october 2017

Converter

Libraries for converting Japanese scripts and readings

  • kuroshiro - Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
  • kuroshiro-analyzer-kuromoji - Kuromoji morphological analyzer for kuroshiro.
  • hepburn - Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation
  • japanese-numerals-to-number - Converts Japanese Numerals into number
  • jslingua - Javascript libraries to process text: Arabic, Japanese, etc.
  • WanaKana - Javascript library for detecting and transliterating Hiragana <--> Katakana <--> Romaji
  • node-romaji-name - Normalize and fix common issues with Romaji-based Japanese names.
  • kyujitai.js - Utility collections for making Japanese text old-fashioned
  • normalize-japanese-addresses - ใ‚ชใƒผใƒ—ใƒณใ‚ฝใƒผใ‚นใฎไฝๆ‰€ๆญฃ่ฆๅŒ–ใƒฉใ‚คใƒ–ใƒฉใƒชใ€‚
  • jaconv - ๆ—ฅๆœฌ่ชžๆ–‡ๅญ—ๅค‰ๆ›ใƒฉใ‚คใƒ–ใƒฉใƒช (javascript)
  • romaji-conv - Convert romaji into hiragana
  • japanese-addresses-v2 - ๅ…จๅ›ฝใฎไฝๆ‰€ใƒ‡ใƒผใ‚ฟAPI
  • jptext-to-emoji - ใƒ†ใ‚ญใ‚นใƒˆใฎๅ˜่ชžใ‚’็ตตๆ–‡ๅญ—ใซๅค‰ๆ›ใ™ใ‚‹
  • japanese.js - Util collection for Japanese text processing. Hiraganize, Katakanize, and Romanize.
Name downloads/week total downloads stars last commit
๐Ÿ”— kuroshiro ๐Ÿ“ฅ 11k/week ๐Ÿ“ฆ 366k โญ 948 ๐Ÿ”ด june 2021
๐Ÿ”— kuroshiro-analyzer-kuromoji ๐Ÿ“ฅ 11k/week ๐Ÿ“ฆ 343k โญ 67 ๐Ÿ”ด august 2018
๐Ÿ”— hepburn ๐Ÿ“ฅ 126k/week ๐Ÿ“ฆ 3.1M โญ 137 ๐ŸŸก september 2025
๐Ÿ”— japanese-numerals-to-number ๐Ÿ“ฅ 38k/week ๐Ÿ“ฆ 2.2M โญ 59 ๐Ÿ”ด february 2023
๐Ÿ”— jslingua ๐Ÿ“ฅ 192/week ๐Ÿ“ฆ 7.4k โญ 52 ๐Ÿ”ด october 2023
๐Ÿ”— WanaKana ๐Ÿ“ฅ 33k/week ๐Ÿ“ฆ 2.1M โญ 901 ๐ŸŸก september 2025
๐Ÿ”— node-romaji-name ๐Ÿ“ฅ 68/week ๐Ÿ“ฆ 12k โญ 41 ๐Ÿ”ด december 2023
๐Ÿ”— kyujitai.js ๐Ÿ“ฅ 12/week ๐Ÿ“ฆ 1.1k โญ 23 ๐Ÿ”ด august 2020
๐Ÿ”— normalize-japanese-addresses - - โญ 945 ๐ŸŸก july 2025
๐Ÿ”— jaconv - - โญ 86 ๐ŸŸก june 2025
๐Ÿ”— romaji-conv - - โญ 26 ๐ŸŸข december 2025
๐Ÿ”— japanese-addresses-v2 - - โญ 69 ๐Ÿ”ด january 2025
๐Ÿ”— jptext-to-emoji - - โญ 2 ๐ŸŸข february
๐Ÿ”— japanese.js - - โญ 167 ๐Ÿ”ด august 2020

Others

Other libraries for Japanese NLP in JavaScript

  • bangumi-data - Raw data for Japanese Anime
  • yomichan - Japanese pop-up dictionary extension for Chrome and Firefox.
  • proofreading-tool - GUIใงๅ‹•ไฝœใ™ใ‚‹ๆ–‡ๆ›ธๆ กๆญฃใƒ„ใƒผใƒซ GUI tool for textlinting.
  • kanjigrid - A web-app displaying the 2200 kanji characters taught in James Heisig's "Remembering the Kanji", 6th edition.
  • japanese-toolkit - Monorepo for Kanji, Furigana, Japanese DB, and others
  • analyze-desumasu-dearu - ๆ–‡ใฎๆ•ฌไฝ“(ใงใ™ใพใ™่ชฟ)ใ€ๅธธไฝ“(ใงใ‚ใ‚‹่ชฟ)ใ‚’่งฃๆžใ™ใ‚‹JavaScriptใƒฉใ‚คใƒ–ใƒฉใƒช
  • hatsuon - Japanese pitch accent utils
  • sentiment_ja_js - Sentiment Analysis in Japanese. sentiment_ja with JavaScript
  • mecab-ipadic-seed - mecab-ipadic seed dictionary reader
  • Japanese-Word-Of-The-Day - Well, a different Japanese word everyday.
  • oskim - Extend GNOME On-Screen Keyboard for Input Methods
  • tweetMapping - ๆฑๆ—ฅๆœฌๅคง้œ‡็ฝ็™บ็”Ÿใ‹ใ‚‰24ๆ™‚้–“ไปฅๅ†…ใซใคใถใ‚„ใ‹ใ‚ŒใŸใ‚ธใ‚ชใ‚ฟใ‚ฐไป˜ใใƒ„ใ‚คใƒผใƒˆใฎใƒ‡ใ‚ธใ‚ฟใƒซใ‚ขใƒผใ‚ซใ‚คใƒ–ใงใ™ใ€‚
  • pitch-accent - Predict pitch accent in Japanese
  • kana2ipa - ใ€Œใฒใ‚‰ใŒใชใ€ใพใŸใฏใ€Œใ‚ซใ‚ฟใ‚ซใƒŠใ€ใ‚’ๆ—ฅๆœฌ่ชžใง็™บ้Ÿณใ™ใ‚‹้š›ใฎ้Ÿณๅฃฐ่จ˜ๅท(IPA)ใซๅค‰ๆ›ใ™ใ‚‹ใ‚ณใƒžใƒณใƒ‰
  • voicevox - ็„กๆ–™ใงไฝฟใˆใ‚‹ไธญๅ“่ณชใชใƒ†ใ‚ญใ‚นใƒˆ่ชญใฟไธŠใ’ใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใ€VOICEVOXใฎใ‚จใƒ‡ใ‚ฃใ‚ฟใƒผ
  • kamiya-codec - Towards a Japanese verb conjugator and deconjugator based on Taeko Kamiya's The Handbook of Japanese Verbs and The Handbook of Japanese Adjectives and Adverbs opuses.
  • closewords - ๆœ€ใ‚‚ไผผใŸๅ˜่ชžใ‚’ๅ˜่ชž็พคใ‹ใ‚‰ๆคœ็ดขใ™ใ‚‹ๆ—ฅๆœฌ่ชž(ๆผขๅญ—ๅซใ‚€)ๅฏพๅฟœใฎใƒฉใ‚คใƒ–ใƒฉใƒช
  • japanese-analyzer - Japanese Sentence Analyzer (ๆ—ฅๆœฌ่ชžๆ–‡็ซ ่งฃๆžๅ™จ)
  • japanese-furigana-normalize - Normalize Japanese Furigana
  • yama - acquire Japanese vocabulary on any website
  • kaitai - An application for analyzing Japanese sentence structure using AI. This tool visualizes how words and phrases relate to each other, showing grammatical relationships with interactive diagrams.
Name downloads/week total downloads stars last commit
๐Ÿ”— bangumi-data ๐Ÿ“ฅ 1.1k/week ๐Ÿ“ฆ 60k โญ 592 ๐ŸŸข february
๐Ÿ”— yomichan - - โญ 1.1k ๐Ÿ”ด february 2023
๐Ÿ”— proofreading-tool - - โญ 87 ๐ŸŸก october 2025
๐Ÿ”— kanjigrid - - โญ 44 ๐Ÿ”ด november 2018
๐Ÿ”— japanese-toolkit - - โญ 62 ๐Ÿ”ด january 2023
๐Ÿ”— analyze-desumasu-dearu ๐Ÿ“ฅ 83k/week ๐Ÿ“ฆ 4.8M โญ 18 ๐Ÿ”ด january 2025
๐Ÿ”— hatsuon ๐Ÿ“ฅ 21/week ๐Ÿ“ฆ 948 โญ 37 ๐Ÿ”ด march 2022
๐Ÿ”— sentiment_ja_js - - โญ 10 ๐Ÿ”ด december 2021
๐Ÿ”— mecab-ipadic-seed ๐Ÿ“ฅ 76/week ๐Ÿ“ฆ 5.6k โญ 8 ๐Ÿ”ด july 2016
๐Ÿ”— Japanese-Word-Of-The-Day ๐Ÿ“ฅ 5/week ๐Ÿ“ฆ 292 โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— oskim - - โญ 2 ๐Ÿ”ด february 2023
๐Ÿ”— tweetMapping - - โญ 25 ๐ŸŸข today
๐Ÿ”— pitch-accent ๐Ÿ“ฅ 1/week ๐Ÿ“ฆ 84 โญ 2 ๐Ÿ”ด september 2023
๐Ÿ”— kana2ipa - - โญ 17 ๐Ÿ”ด october 2020
๐Ÿ”— voicevox - - โญ 3k ๐ŸŸข today
๐Ÿ”— kamiya-codec - - โญ 20 ๐ŸŸก may 2025
๐Ÿ”— closewords - - โญ 3 ๐ŸŸก august 2025
๐Ÿ”— japanese-analyzer - - โญ 683 ๐ŸŸข december 2025
๐Ÿ”— japanese-furigana-normalize - - โญ 6 ๐Ÿ”ด july 2024
๐Ÿ”— yama - - โญ 7 ๐ŸŸข last wednesday
๐Ÿ”— kaitai - - โญ 1 ๐ŸŸข last saturday

Go

Morphology analysis

Lightweight Japanese morphological analysis libraries in Go

  • kagome - Self-contained Japanese Morphological Analyzer written in pure Go
Name downloads/week total downloads stars last commit
๐Ÿ”— kagome - - โญ 943 ๐ŸŸข january

Others

Additional Go-based Japanese text processing libraries

  • ojosama - ใƒ†ใ‚ญใ‚นใƒˆใ‚’ๅฃฑ็™พๆบ€ๅคฉๅŽŸใ‚ตใƒญใƒกใŠๅฌขๆง˜้ขจใฎๅฃ่ชฟใซๅค‰ๆ›ใ—ใพใ™
  • nihongo - Japanese Dictionary
  • yomichan-import - External dictionary importer for Yomichan.
  • imas-ime-dic - THE IDOLM@STER words dictionary for Japanese IME (by imas-db.jp)
  • go-kakasi - Kanji transliteration to hiragana/katakana/romaji, in Go
  • go-moji - A Go library for Zenkaku/Hankaku conversion
  • ojichat - ใŠใ˜ใ•ใ‚“ใŒLINEใ‚„ใƒกใƒผใƒซใง้€ใฃใฆใใใ†ใชๆ–‡ใ‚’็”Ÿๆˆใ™ใ‚‹
  • name - Name Searcher in Japanese
Name downloads/week total downloads stars last commit
๐Ÿ”— ojosama - - โญ 388 ๐ŸŸข last tuesday
๐Ÿ”— nihongo - - โญ 83 ๐Ÿ”ด february 2024
๐Ÿ”— yomichan-import - - โญ 85 ๐Ÿ”ด february 2023
๐Ÿ”— imas-ime-dic - - โญ 31 ๐ŸŸข january
๐Ÿ”— go-kakasi - - โญ 6 ๐ŸŸข december 2025
๐Ÿ”— go-moji - - โญ 20 ๐Ÿ”ด april 2019
๐Ÿ”— ojichat - - โญ 1.3k ๐Ÿ”ด october 2024
๐Ÿ”— name - - โญ 11 ๐Ÿ”ด january 2025

Java

Morphology analysis

Japanese morphological analysis and dictionary management libraries

  • kuromoji - Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
  • Sudachi -ใ€€A Japanese Tokenizer for Business
  • SudachiDict - A lexicon for Sudachi
  • meval - ๅฝขๆ…‹็ด ่งฃๆžๅ™จๆ€ง่ƒฝ่ฉ•ไพกใ‚ทใ‚นใƒ†ใƒ  MevAL
Name downloads/week total downloads stars last commit
๐Ÿ”— kuromoji - - โญ 1k ๐Ÿ”ด september 2019
๐Ÿ”— Sudachi - - โญ 932 ๐Ÿ”ด november 2024
๐Ÿ”— SudachiDict - - โญ 277 ๐ŸŸข january
๐Ÿ”— meval - - โญ 7 ๐Ÿ”ด august 2019

Others

Java libraries for Japanese NLP and OCR

  • kanjitomo-ocr - Java library for identifying Japanese characters from images
  • jakaroma - Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
  • kakasi-java - Kanji transliteration to hiragana/katakana/romaji, in Java
  • Kamite - A desktop language immersion companion for learners of Japanese
  • react-native-japanese-tokenizer - Async Japanese Tokenizer Native Plugin for React Native for iOS and Android
  • elasticsearch-analysis-japanese - Japanese analyzer uses kuromoji japanese tokenizer for ElasticSearch
  • moji4j - A Java library to converts between Japanese Hiragana, Katakana, and Romaji scripts.
  • neologdn-java - Japanese text normalizer for mecab-neologd
  • elasticsearch-sudachi - The Japanese analysis plugin for elasticsearch
Name downloads/week total downloads stars last commit
๐Ÿ”— kanjitomo-ocr - - โญ 203 ๐Ÿ”ด may 2021
๐Ÿ”— jakaroma - - โญ 67 ๐ŸŸก june 2025
๐Ÿ”— kakasi-java - - โญ 55 ๐Ÿ”ด april 2016
๐Ÿ”— Kamite - - โญ 129 ๐ŸŸก march 2025
๐Ÿ”— react-native-japanese-tokenizer - - โญ 38 ๐Ÿ”ด june 2023
๐Ÿ”— elasticsearch-analysis-japanese - - โญ 29 ๐Ÿ”ด march 2012
๐Ÿ”— moji4j - - โญ 33 ๐Ÿ”ด june 2022
๐Ÿ”— neologdn-java - - โญ 5 ๐ŸŸข february
๐Ÿ”— elasticsearch-sudachi - - โญ 218 ๐ŸŸข january

Pretrained model

Word2Vec

Models that convert words into numeric vectors to capture semantic similarity

  • japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
  • chiVe - Japanese word embedding with Sudachi and NWJC
  • elmo-japanese - elmo-japanese
  • embedrank - Python Implementation of EmbedRank
  • aovec - Easy aozorabunko Word2Vec Builder - ้’็ฉบๆ–‡ๅบซๅ…จๆ›ธ็ฑใฎWord2Vecใƒ“ใƒซใƒ€ใƒผ+ๆง‹็ฏ‰ๆธˆใฟใƒขใƒ‡ใƒซ
  • dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "ไฟ‚ใ‚Šๅ—ใ‘ใซๅŸบใฅใๆ—ฅๆœฌ่ชžๅ˜่ชžๅŸ‹่พผ (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
  • jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
  • jawiki_word_vector_updater - ๆœ€ๆ–ฐใฎๆ—ฅๆœฌ่ชžWikipediaใฎใƒ€ใƒณใƒ—ใƒ‡ใƒผใ‚ฟใ‹ใ‚‰๏ผŒMeCabใ‚’็”จใ„ใฆIPA่พžๆ›ธใจๆœ€ๆ–ฐใฎNeologd่พžๆ›ธใฎไธกๆ–นใงๅฝขๆ…‹็ด ่งฃๆžใ‚’ๅฎŸๆ–ฝใ—๏ผŒใใฎ็ตๆžœใซๅŸบใฅใ„ใŸ word2vec๏ผŒfastText๏ผŒGloVeใฎๅ˜่ชžๅˆ†ๆ•ฃ่กจ็พใ‚’ๅญฆ็ฟ’ใ™ใ‚‹ใŸใ‚ใฎใ‚นใ‚ฏใƒชใƒ—ใƒˆ
Name downloads/week total downloads stars last commit
๐Ÿ”— japanese-words-to-vectors - - โญ 87 ๐Ÿ”ด august 2020
๐Ÿ”— chiVe - - โญ 169 ๐Ÿ”ด march 2024
๐Ÿ”— elmo-japanese - - โญ 5 ๐Ÿ”ด october 2019
๐Ÿ”— embedrank - - โญ 48 ๐Ÿ”ด march 2019
๐Ÿ”— aovec ๐Ÿ“ฅ 425 ๐Ÿ“ฆ 81k โญ 3 ๐Ÿ”ด january 2023
๐Ÿ”— dependency-based-japanese-word-embeddings - - โญ 8 ๐Ÿ”ด august 2019
๐Ÿ”— jawikivec - - โญ 2 ๐Ÿ”ด november 2018
๐Ÿ”— jawiki_word_vector_updater - - โญ 11 ๐Ÿ”ด may 2020

Transformer based models

Models that use self-attention to understand context and perform advanced language tasks

  • bert-japanese - BERT models for Japanese text.
  • japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
  • bert-japanese - BERT with SentencePiece for Japanese text.
  • SudachiTra - Japanese tokenizer for Transformers
  • japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
  • shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
  • Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
  • language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
  • medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
  • ILYS-aoba-chatbot - ILYS-aoba-chatbot
  • t5-japanese - Codes to pre-train Japanese T5 models
  • pytorch_bert_japanese - PytorchใงBERTใฎๆ—ฅๆœฌ่ชžๅญฆ็ฟ’ๆธˆใฟใƒขใƒ‡ใƒซใ‚’ๅˆฉ็”จใ™ใ‚‹
  • Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
  • RoBERTa-japanese - Japanese BERT Pretrained Model
  • aMLP-japanese - aMLP Transformer Model for Japanese
  • bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
  • sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
  • BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
  • gpt2-japanese - Japanese GPT2 Generation Model
  • text2text-japanese - gpt-2 based text2text conversion model
  • gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
  • friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
  • albert-japanese - BERT with SentencePiece for Japanese text.
  • ja_text_bert - ๆ—ฅๆœฌ่ชžWikipediaใ‚ณใƒผใƒ‘ใ‚นใงBERTใฎPre-Trainedใƒขใƒ‡ใƒซใ‚’็”Ÿๆˆใ™ใ‚‹ใŸใ‚ใฎใƒชใƒใ‚ธใƒˆใƒช
  • DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
  • bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
  • Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
  • luke - LUKE -- Language Understanding with Knowledge-based Embeddings
  • GPTSAN - General-purpose Swich transformer based Japanese language mode
  • japanese-clip - Japanese CLIP by rinna Co., Ltd.
  • AcademicBART - We pretrained a BART-based Japanese masked language model on paper abstracts from the academic database CiNii Articles
  • AcademicRoBERTa - We pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles.
  • LINE-DistilBERT-Japanese - DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
  • Japanese-Alpaca-LoRA - ๆ—ฅๆœฌ่ชžใซ็ฟป่จณใ—ใŸStanford Alpacaใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’็”จใ„ใฆLLaMAใ‚’ใƒ•ใ‚กใ‚คใƒณใƒใƒฅใƒผใƒ‹ใƒณใ‚ฐใ—ไฝœๆˆใ—ใŸLow-Rank Adapterใฎใƒชใƒณใ‚ฏใจGenerateใ‚ตใƒณใƒ—ใƒซใ‚ณใƒผใƒ‰
  • albert-japanese-tinysegmenter - Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert) on Japanese Wikipedia Resources
  • japanese-llama-experiment - Japanese LLaMa experiment
  • easylightchatassistant - EasyLightChatAssistant ใฏ่ปฝ้‡ใงๆคœ้–ฒใ‚„่ฆๅˆถใฎใชใ„ใƒญใƒผใ‚ซใƒซๆ—ฅๆœฌ่ชžใƒขใƒ‡ใƒซใฎLightChatAssistant ใ‚’ใ€KoboldCpp ใง็ฐกๅ˜ใซใŠ่ฉฆใ—ใ™ใ‚‹็’ฐๅขƒใงใ™ใ€‚
Name downloads/week total downloads stars last commit
๐Ÿ”— bert-japanese - - โญ 543 ๐Ÿ”ด march 2024
๐Ÿ”— japanese-pretrained-models - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— bert-japanese - - โญ 498 ๐Ÿ”ด february 2021
๐Ÿ”— SudachiTra ๐Ÿ“ฅ 331 ๐Ÿ“ฆ 159k โญ 79 ๐Ÿ”ด december 2023
๐Ÿ”— japanese-dialog-transformers - - โญ 245 ๐Ÿ”ด june 2023
๐Ÿ”— shiba ๐Ÿ“ฅ 39 ๐Ÿ“ฆ 7k โญ 89 ๐Ÿ”ด november 2023
๐Ÿ”— Dialog - - โญ 73 ๐Ÿ”ด october 2020
๐Ÿ”— language-pretraining - - โญ 50 ๐Ÿ”ด may 2023
๐Ÿ”— medbertjp - - โญ 12 ๐Ÿ”ด november 2020
๐Ÿ”— ILYS-aoba-chatbot - - โญ 23 ๐Ÿ”ด october 2021
๐Ÿ”— t5-japanese - - โญ 40 ๐Ÿ”ด september 2021
๐Ÿ”— pytorch_bert_japanese - - โญ 35 ๐Ÿ”ด june 2019
๐Ÿ”— Laboro-BERT-Japanese - - โญ 73 ๐Ÿ”ด may 2022
๐Ÿ”— RoBERTa-japanese - - โญ 23 ๐Ÿ”ด november 2021
๐Ÿ”— aMLP-japanese - - โญ 16 ๐Ÿ”ด may 2022
๐Ÿ”— bert-japanese-aozora - - โญ 40 ๐Ÿ”ด august 2020
๐Ÿ”— sbert-ja - - โญ 11 ๐Ÿ”ด august 2021
๐Ÿ”— BERT-Japan-vaccination - - โญ 7 ๐Ÿ”ด may 2022
๐Ÿ”— gpt2-japanese - - โญ 325 ๐Ÿ”ด september 2023
๐Ÿ”— text2text-japanese - - โญ 33 ๐Ÿ”ด july 2021
๐Ÿ”— gpt-ja - - โญ 3 ๐Ÿ”ด september 2021
๐Ÿ”— friendly_JA-Model - - โญ 1 ๐Ÿ”ด may 2022
๐Ÿ”— albert-japanese - - โญ 33 ๐Ÿ”ด october 2021
๐Ÿ”— ja_text_bert - - โญ 115 ๐Ÿ”ด november 2018
๐Ÿ”— DistilBERT-base-jp - - โญ 161 ๐Ÿ”ด april 2020
๐Ÿ”— bert - - โญ 28 ๐Ÿ”ด april 2022
๐Ÿ”— Laboro-DistilBERT-Japanese - - โญ 16 ๐Ÿ”ด december 2020
๐Ÿ”— luke - - โญ 726 ๐Ÿ”ด june 2023
๐Ÿ”— GPTSAN - - โญ 118 ๐Ÿ”ด september 2023
๐Ÿ”— japanese-clip - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— AcademicBART - - โญ 2 ๐Ÿ”ด july 2024
๐Ÿ”— AcademicRoBERTa - - โญ 9 ๐Ÿ”ด september 2024
๐Ÿ”— LINE-DistilBERT-Japanese - - โญ 46 ๐Ÿ”ด march 2023
๐Ÿ”— Japanese-Alpaca-LoRA - - โญ 141 ๐Ÿ”ด april 2023
๐Ÿ”— albert-japanese-tinysegmenter - - โญ 13 ๐Ÿ”ด september 2023
๐Ÿ”— japanese-llama-experiment - - โญ 54 ๐ŸŸข december 2025
๐Ÿ”— easylightchatassistant - - โญ 39 ๐Ÿ”ด april 2024

ChatGPT

Resources for using ChatGPT and APIs for Japanese dialogue and text generation

  • VRChatGPT - ChatGPTใ‚’ไฝฟใฃใฆVRChatไธŠใงใŠๅ–‹ใ‚Šๅ‡บๆฅใ‚‹ใ‚ˆใ†ใซใ™ใ‚‹ใƒ—ใƒญใ‚ฐใƒฉใƒ ใ€‚
  • AITuberDegikkoMirii - AITuberใฎๅŸบ็คŽใจใชใ‚‹้ƒจๅˆ†ใ‚’้–‹็™บใ—ใฆใ„ใพใ™
  • wanna - Shell command launcher with natural language
  • ChatdollKit - ChatdollKit enables you to make your 3D model into a chatbot
  • ChuanhuChatGPTJapanese - GUI for ChatGPT API For Japanese
  • AISisterAIChan - ChatGPT3.5ใ‚’ๆญ่ผ‰ใ—ใŸไผบใ‹ใ‚ดใƒผใ‚นใƒˆใ€ŒAIๅฆนใ‚ขใ‚คใกใ‚ƒใ‚“ใ€ใงใ™ใ€‚ๅˆฉ็”จใซใฏๅˆฅ้€”ChatGPTใฎAPIใ‚ญใƒผใŒๅฟ…่ฆใงใ™ใ€‚
  • vrchatbot - VRChatใซAI Botใ‚’ไฝœใ‚‹ใŸใ‚ใฎใƒชใƒใ‚ธใƒˆใƒช
  • gptuber-by-langchain - GPTใŒYouTuberใ‚’ใ‚„ใ‚Šใพใ™
  • openai-chatfriend - A chatbox application built using Nuxt 3 powered by Open AI Text completion endpoint. You can select different personality of your AI friend. The default will respond in Japanese. You can use this app to practice your Nihongo skills!
  • chrome-ext-translate-to-hiragana-with-chatgpt - This Chrome extension can translate selected Japanese text to Hiragana by using ChatGPT.
  • azure-search-openai-demo - ใ“ใฎใ‚ตใƒณใƒ—ใƒซใงใฏใ€Retrieval Augmented Generation ใƒ‘ใ‚ฟใƒผใƒณใ‚’ไฝฟ็”จใ—ใฆใ€็‹ฌ่‡ชใฎใƒ‡ใƒผใ‚ฟใซๅฏพใ—ใฆChatGPT ใฎใ‚ˆใ†ใชไฝ“้จ“ใ‚’ไฝœๆˆใ™ใ‚‹ใŸใ‚ใฎใ„ใใคใ‹ใฎใ‚ขใƒ—ใƒญใƒผใƒใ‚’็คบใ—ใฆใ„ใพใ™ใ€‚
  • chatvrm - ChatVRMใฏใƒ–ใƒฉใ‚ฆใ‚ถใง็ฐกๅ˜ใซ3Dใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟใƒผใจไผš่ฉฑใŒใงใใ‚‹ใƒ‡ใƒขใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณใงใ™ใ€‚
  • sftly-replace - A Chrome extention to replace the selected text softly
  • summarize_arxv - Summarize arXiv paper with figures
  • aiavatarkit - Building AI-based conversational avatars lightning fast
  • pva-aoai-integration-solution - ใ“ใฎใƒชใƒใ‚ธใƒˆใƒชใฏใ€็ฅžๆˆธๅธ‚ๅฝนๆ‰€ใงใฎChatGPTใฎ่ฉฆ่กŒๅˆฉ็”จใซๅ‘ใ‘ใฆไฝœๆˆใ—ใŸใƒ•ใƒญใƒผ็ญ‰ใ‚’ใ‚ฝใƒชใƒฅใƒผใ‚ทใƒงใƒณๅŒ–ใ—ๅ…ฌ้–‹ใ™ใ‚‹ใ‚‚ใฎใงใ™ใ€‚
  • jp-azureopenai-samples - Azure OpenAIใ‚’ๆดป็”จใ—ใŸใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณๅฎŸ่ฃ…ใฎใƒชใƒ•ใ‚กใƒฌใƒณใ‚นใ‚’็›ฎ็š„ใจใ—ใฆใ€ใ‚ขใƒ—ใƒชใฎใ‚ตใƒณใƒ—ใƒซ๏ผˆใƒชใƒ•ใ‚กใƒฌใƒณใ‚นใ‚ขใƒผใ‚ญใƒ†ใ‚ฏใƒใƒฃใ€ใ‚ตใƒณใƒ—ใƒซใ‚ณใƒผใƒ‰ใจใƒ‡ใƒ—ใƒญใ‚คๆ‰‹้ †๏ผ‰ใ‚’็„กๅ„Ÿๆไพ›ใ—ใฆใ„ใพใ™ใ€‚
  • character_chat - OpenAIใฎAPIใ‚’ๅˆฉ็”จใ—ใฆใ€่จญๅฎšใ—ใŸใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟใƒผใจๆ—ฅๆœฌ่ชžใงไผš่ฉฑใ™ใ‚‹ใƒใƒฃใƒƒใƒˆใ‚นใ‚ฏใƒชใƒ—ใƒˆใงใ™ใ€‚
  • chatgpt-slackbot - OpenAIใฎChatGPT APIใ‚’SlackไธŠใงๅˆฉ็”จใ™ใ‚‹ใŸใ‚ใฎSlackbotใ‚นใ‚ฏใƒชใƒ—ใƒˆ (ๆ—ฅๆœฌ่ชžใงใฎๅˆฉ็”จใŒๅ‰ๆ)
  • chatgpt-prompt-sample-japanese - ChatGPT ใฎ Prompt ใฎใ‚ตใƒณใƒ—ใƒซใงใ™ใ€‚
  • kanji-flashcard-app-gpt4 - A Japanese Kanji Flashcard App built using Python and Langchain, enhanced with the intelligence of GPT-4.
  • IgakuQA - Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
  • japagen - ๆ—ฅๆœฌ่ชžใ‚ฟใ‚นใ‚ฏใซใŠใ‘ใ‚‹LLMใ‚’็”จใ„ใŸ็–‘ไผผๅญฆ็ฟ’ใƒ‡ใƒผใ‚ฟ็”Ÿๆˆใฎๆคœ่จŽ
  • generativeai-prompt-sample-japanese - ChatGPTใ‚„Copilotใชใฉๅ„็จฎ็”ŸๆˆAI็”จใฎใ€Œๆ—ฅๆœฌ่ชž]ใฎ Prompt ใฎใ‚ตใƒณใƒ—ใƒซ
Name downloads/week total downloads stars last commit
๐Ÿ”— VRChatGPT - - โญ 15 ๐Ÿ”ด march 2023
๐Ÿ”— AITuberDegikkoMirii - - โญ 5 ๐Ÿ”ด march 2023
๐Ÿ”— wanna ๐Ÿ“ฅ 35 ๐Ÿ“ฆ 20k โญ 141 ๐Ÿ”ด april 2023
๐Ÿ”— ChatdollKit - - โญ 1.1k ๐ŸŸข last saturday
๐Ÿ”— ChuanhuChatGPTJapanese - - โญ 1 ๐Ÿ”ด march 2023
๐Ÿ”— AISisterAIChan - - โญ 27 ๐Ÿ”ด may 2023
๐Ÿ”— vrchatbot - - โญ 28 ๐Ÿ”ด december 2022
๐Ÿ”— gptuber-by-langchain - - โญ 63 ๐Ÿ”ด january 2023
๐Ÿ”— openai-chatfriend - - โญ 16 ๐Ÿ”ด april 2023
๐Ÿ”— chrome-ext-translate-to-hiragana-with-chatgpt - - โญ 1 ๐Ÿ”ด april 2023
๐Ÿ”— azure-search-openai-demo - - โญ 46 ๐Ÿ”ด december 2023
๐Ÿ”— chatvrm - - โญ 818 ๐ŸŸก may 2025
๐Ÿ”— sftly-replace - - โญ 4 ๐Ÿ”ด may 2023
๐Ÿ”— summarize_arxv - - โญ 173 ๐Ÿ”ด may 2023
๐Ÿ”— aiavatarkit - - โญ 532 ๐ŸŸข last wednesday
๐Ÿ”— pva-aoai-integration-solution - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— jp-azureopenai-samples - - โญ 280 ๐ŸŸก september 2025
๐Ÿ”— character_chat - - โญ 16 ๐Ÿ”ด june 2023
๐Ÿ”— chatgpt-slackbot - - โญ 64 ๐Ÿ”ด july 2024
๐Ÿ”— chatgpt-prompt-sample-japanese - - โญ 420 ๐ŸŸข today
๐Ÿ”— kanji-flashcard-app-gpt4 - - โญ 6 ๐Ÿ”ด october 2023
๐Ÿ”— IgakuQA - - โญ 48 ๐Ÿ”ด march 2023
๐Ÿ”— japagen - - โญ 1 ๐Ÿ”ด october 2024
๐Ÿ”— generativeai-prompt-sample-japanese - - โญ 420 ๐ŸŸข today

Dictionary and IME

Resources for Japanese dictionaries and input method editors (IME)

  • mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
  • tdmelodic - A Japanese accent dictionary generator
  • jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
  • unidic-py - Unidic packaged for installation via pip.
  • Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
  • manbyo-sudachi - Sudachiๅ‘ใ‘ไธ‡็—…่พžๆ›ธ
  • jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
  • JIWC-Dictionary - dictionary to find emotion related to text
  • JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
  • ipadic-py - IPAdic packaged for easy use from Python.
  • unidic-lite - A small version of UniDic for easy pip installs.
  • emoji-ime-dictionary - ๆ—ฅๆœฌ่ชžใง็ตตๆ–‡ๅญ—ๅ…ฅๅŠ›ใ‚’ใ™ใ‚‹ใŸใ‚ใฎ IME ่ฟฝๅŠ ่พžๆ›ธ orange_book Google ๆ—ฅๆœฌ่ชžๅ…ฅๅŠ›ใชใฉใงๆ—ฅๆœฌ่ชžใ‹ใ‚‰็ตตๆ–‡ๅญ—ใธใฎๅค‰ๆ›ใ‚’ๅฏ่ƒฝใซใ™ใ‚‹ IME ๆ‹กๅผต่พžๆ›ธ
  • google-ime-dictionary - ๆ—ฅ่‹ฑๅค‰ๆ›ใƒป่‹ฑ่ชž็•ฅ่ชžๅฑ•้–‹ใฎใŸใ‚ใฎ IME ่ฟฝๅŠ ่พžๆ›ธ orange_book ๆ—ฅๆœฌ่ชžใ‹ใ‚‰่‹ฑ่ชžใธใฎๅ’Œ่‹ฑๅค‰ๆ›ใ‚„่‹ฑ่ชž็•ฅ่ชžใฎๅฑ•้–‹ใ‚’ Google ๆ—ฅๆœฌ่ชžๅ…ฅๅŠ›ใ‚„ ATOK ใชใฉใงๅฏ่ƒฝใซใ™ใ‚‹ IME ๆ‹กๅผต่พžๆ›ธ
  • dic-nico-intersection-pixiv - ใƒ‹ใ‚ณใƒ‹ใ‚ณๅคง็™พ็ง‘ใจใƒ”ใ‚ฏใ‚ทใƒ–็™พ็ง‘ไบ‹ๅ…ธใฎๅ…ฑ้€š้ƒจๅˆ†ใฎIME่พžๆ›ธ
  • google-ime-user-dictionary-ja-en - GoogleIME็”จใ‚ซใ‚ฟใ‚ซใƒŠ่ชž่พžๆ›ธใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆใฎใ‚ขใƒผใ‚ซใ‚คใƒ–ใงใ™ใ€‚Project archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
  • emoticon - Googleๆ—ฅๆœฌ่ชžๅ…ฅๅŠ›ใฎ้ก”ๆ–‡ๅญ—่พžๆ›ธโˆฉ(,,ร’โ€ฟร“,,)โˆฉ
  • mecab-mozcdic - open source mozc dictionaryใ‚’MeCab่พžๆ›ธใฎใƒ•ใ‚ฉใƒผใƒžใƒƒใƒˆใซๅค‰ๆ›ใ—ใŸใ‚‚ใฎใงใ™ใ€‚
  • denonbu-ime-dic - ้›ป้ŸณIME: Microsoft IMEใชใฉใงๅˆฉ็”จใ™ใ‚‹ใ“ใจใ‚’ๆƒณๅฎšใ—ใŸใ€Œ้›ป้Ÿณ้ƒจใ€้–ข้€ฃ็”จ่ชžใฎ่พžๆ›ธ
  • nijisanji-ime-dic - Microsoft IMEใชใฉใงๅˆฉ็”จใ™ใ‚‹ใ“ใจใ‚’ๆƒณๅฎšใ—ใŸใ€Œใซใ˜ใ•ใ‚“ใ˜ใ€้–ข้€ฃ็”จ่ชžใฎ็”จ่ชž่พžๆ›ธใงใ™ใ€‚
  • pokemon-ime-dic - Microsoft IMEใชใฉใงๅˆฉ็”จใ™ใ‚‹ใ“ใจใ‚’ๆƒณๅฎšใ—ใŸใ€็พ็Šถๅˆคๆ˜Žใ—ใฆใ„ใ‚‹ๅ…จใฆใฎใƒใ‚ฑใƒขใƒณใฎๅๅ‰ใ‚’็ถฒ็พ…ใ—ใŸ็”จ่ชž่พžๆ›ธใงใ™ใ€‚
  • EJDict - English-Japanese Dictionary data (Public Domain) EJDict-hand
  • Ayashiy-Nipongo-Dic - ่ดตๆจฃใฐใ“ใ‚่พž็•ซใ‚’ไฝฟใฆๆญฃใƒฌใ„ๆ—ฅๆœฌ่ฏญใ‚’ไฝฟใƒฉใ“ใจใŒๅ‡บไพ†ใ‚‘ใ€‚
  • genshin-dict - Windows/macOSใงไฝฟใˆใ‚‹ๅŽŸ็ฅžใฎๅ˜่ชž่พžๆ›ธใงใ™
  • jmdict-simplified - JMdict and JMnedict in JSON format
  • mozcdict-ext - Convert external words into Mozc system dictionary
  • mh-dict-jp - MonsterHunterใฎใƒฆใƒผใ‚ถใƒผ่พžๆ›ธใ‚’ไฝœใ‚ŠใŸใ„โ€ฆ
  • jitenbot - Convert data from Japanese dictionary websites and applications into portable file formats
  • mecab-unidic-neologd - Neologism dictionary based on the language resources on the Web for mecab-unidic
  • hololive-dictionary - ใƒ›ใƒญใƒฉใ‚คใƒ–๏ผˆใƒ›ใƒญใƒฉใ‚คใƒ–ใƒ—ใƒญใƒ€ใ‚ฏใ‚ทใƒงใƒณ๏ผ‰ใซ้–ขใ™ใ‚‹่พžๆ›ธใƒ•ใ‚กใ‚คใƒซใงใ™ใ€‚./dictionary ใƒ•ใ‚ฉใƒซใƒ€ๅ†…ใฎใƒ†ใ‚ญใ‚นใƒˆใƒ•ใ‚กใ‚คใƒซใ‚’ไฝฟใฃใฆใ€IMEใซๅ˜่ชžใ‚’่ฟฝๅŠ ใงใใพใ™ใ€‚่ฉณ็ดฐใฏREADME.mdใ‚’ใ”่ฆงใใ ใ•ใ„ใ€‚
  • jmdict-yomitan - JMdict, JMnedict, KANJIDIC for Yomitan/Yomichan.
  • yomichan-jlpt-vocab - JLPT level tags for words in Yomichan
  • Jitendex - A free and openly licensed Japanese-to-English dictionary compatible with multiple dictionary clients
  • jiten - japanese android/cli/web dictionary based on jmdict/kanjidic โ€” ๆ—ฅๆœฌ่ชžใ€€่พžๅ…ธใ€€ๅ’Œ่‹ฑ่พžๅ…ธใ€€ๆผข่‹ฑๅญ—ๅ…ธใ€€ๅ’Œ็‹ฌ่พžๅ…ธใ€€ๅ’Œ่˜ญ่พžๅ…ธ
  • pixiv-yomitan - Pixiv Encyclopedia Dictionary for Yomitan
  • uchinaaguchi_dict - ใ†ใกใชใƒผใใก่พžๅ…ธ๏ผˆๆฒ–็ธ„่ชž่พžๅ…ธ๏ผ‰
  • yomitan-dictionaries - Japanese and Chinese dictionaries for Yomitan.
  • mouse_over_dictionary - ใƒžใ‚ฆใ‚นใ‚ชใƒผใƒใƒผใ—ใŸๅ˜่ชžใ‚’่‡ชๅ‹•ใง่ชญใฟๅ–ใ‚‹ๆฑŽ็”จ่พžๆ›ธใƒ„ใƒผใƒซ
  • jisyo - ใ‹ใชๆผขๅญ—ๅค‰ๆ›ใ‚จใƒณใ‚ธใƒณ SKKใฎใŸใ‚ใฎๆ–ฐใ—ใ„่พžๆ›ธๅฝขๅผ
  • skk-jisyo.emoji-ja - ๆ—ฅๆœฌ่ชžใฎ่ชญใฟใ‹ใ‚‰ Emoji ใซๅค‰ๆ›ใ™ใ‚‹ใŸใ‚ใฎ SKK ่พžๆ›ธ ๐Ÿ˜‚
  • anthy - Anthy is a kana-kanji conversion engine for Japanese. It converts roma-ji to kana, and the kana text to a mixed kana and kanji.
  • aws_dic_for_google_ime - AWSใ‚ตใƒผใƒ“ใ‚นๅใฎGoogleๆ—ฅๆœฌ่ชžๅ…ฅๅŠ›ๅ‘ใ‘ใฎ่พžๆ›ธ
  • cl-skkserv - Common Lispใซใ‚ˆใ‚‹SKK่พžๆ›ธใ‚ตใƒผใƒใƒผใจใใฎๆ‹กๅผต
  • anthy - Anthy maintenance
  • anthy-unicode - Anthy Unicode - Another Anthy
  • azooKey - azooKey: A Japanese Keyboard iOS Application Fully Developed in Swift
  • azookey-desktop - Japanese Input Method "azooKey" for Desktop, supporting macOS
  • fcitx5-hazkey - Japanese input method for fcitx5, powered by azooKey engine
  • mozcdic-ut-place-names - Mozc UT Place Name Dictionary is a dictionary converted from the Japan Post's ZIP code data for Mozc.
  • azookeykanakanjiconverter - Kana-Kanji Conversion Module written in Swift
  • libkkc - Japanese Kana Kanji conversion input method library
  • libskk - Japanese SKK input method library
  • kanayomi-dict - openjtalkๅฝขๅผใฎใƒฆใƒผใ‚ถใƒผ่พžๆ›ธ
  • cjkvi-dict - ๆผขๅญ—ใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚นใฎ่พžๆ›ธ้–ข้€ฃใƒ‡ใƒผใ‚ฟ
  • wlsp-classical - ๅคๅ…ธๆ—ฅๆœฌ่ชžใฎๅˆ†้กž่ชžๅฝ™่กจใƒ‡ใƒผใ‚ฟ
  • kanji-dict - ๆผขๅญ—ใฎๆ›ธใ้ †(็ญ†้ †)ใƒป่ชญใฟๆ–นใƒป็”ปๆ•ฐใƒป้ƒจ้ฆ–ใƒป็”จไพ‹ใƒปๆˆใ‚Š็ซ‹ใกใ‚’่ชฟในใ‚‹ใŸใ‚ใฎๆผขๅญ—่พžๆ›ธใงใ™ใ€‚Unicode 15.1 ใฎใ™ในใฆใฎๆผขๅญ— 98,682ๅญ—ใ‚’ๅŽ้Œฒใ—ใฆใ„ใพใ™ใ€‚
  • Kaomoji_proj - (เน‘ แด– แด‘ แด– เน‘)ใฟใ‚‡ใ‚“ใ‹ใŠใ‚‚ใ˜๏ผˆๆ—งKaomoji_proj๏ผ‰ใฏMicrosoft็คพใฎๅ…ฅๅŠ›ใ‚ฝใƒ•ใƒˆใ€Microsoft IMEๅ‘ใ‘ใฎ้ก”ๆ–‡ๅญ—ใฎ่พžๆ›ธใ‚’ไฝœๆˆใ™ใ‚‹ใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆใงใ™ใ€‚
  • kotlin-kana-kanji-converter - Kotlin ใ‹ใชๆผขๅญ—ๅค‰ๆ›ใƒ—ใƒญใ‚ฐใƒฉใƒ 
  • alfred-japanese-dictionary - Japanese-English Dictionary using jisho.org with audio, csv export of entries, and preview of dictionary sites.
  • ichiran - Linguistic tools for texts in Japanese language
  • mikan - A Japanese input method.
  • colloquial-kansai-dictionary - A quick reference for the material taught in Colloquial Kansai Japanese.
  • jisho-open - Web frontend for the JMdict Japanese-English dictionary project, with study list support!
  • macskk - Yet Another macOS SKK Input Method
  • nandoku - ้›ฃ่ชญๆผขๅญ—ใ‚’ๅญฆๅนดๅˆฅใซใพใจใ‚ใŸ่พžๆ›ธใงใ™ใ€‚
  • japanese_android_ime - A FOSS Japanese IME for Android
  • anthywl - Japanese input method for Sway using libanthy
  • sekka - Yet another Japanese Input Method inspired by SKK.
  • sumibi - Japanese input method powered by ChatGPT API
  • jinmei-dict - ่พžๆ›ธใƒ‡ใƒผใ‚ฟใ‹ใ‚‰ไบบๅใ ใ‘ใ‚’ๆŠœใๅ‡บใ—ใ€่ชญใฟไปฎๅ๏ผˆใ‚ซใ‚ฟใ‚ซใƒŠ๏ผ‰ใ‚’ใ‚ญใƒผใจใ—ใฆใ€ๅ€™่ฃœใจใชใ‚‹ๆ›ธใๆ–‡ๅญ—ใ‚’ใƒชใ‚นใƒˆใงไฟๆŒใ™ใ‚‹ใ‚ˆใ†ใชJSONๅฝขๅผใซๆ•ดๅฝขใ—ใฆใ„ใพใ™ใ€‚
  • japanesekeyboard - ใ‚นใƒŸใƒฌ ๅฎŒๅ…จใ‚ชใƒ•ใƒฉใ‚คใƒณใฎๆ—ฅๆœฌ่ชžใ‚ญใƒผใƒœใƒผใƒ‰ใ‚ขใƒ—ใƒช
  • japanesearabic - JapaneseArabic Dictionary (ๆ—ฅๆœฌ่ชžใƒปใ‚ขใƒฉใƒ“ใ‚ข่ชž่พžๆ›ธ) ู‚ุงู…ูˆุณ ุงู„ู„ุบุฉ ุงู„ูŠุงุจุงู†ูŠุฉ ูˆุงู„ุนุฑุจูŠุฉ (Yomitan)
  • o-dic - ๆฒ–็ธ„่พžๆ›ธ
  • skk-emoji-jisyo - SKK ็ตตๆ–‡ๅญ—่พžๆ›ธ
  • mozcdic-ut-personal-names - A personal name dictionary for Mozc.
  • mozcdic-ut-sudachidict - A dictionary converted from SudachiDict for Mozc.
  • nihongo - japanese language data and dictionary
  • kagome-dict - Dictionary Library for Kagome v2
  • canna - Canna Japanese input system
  • kansai-accent-dictionary - ไบฌ้˜ชๅผใ‚ขใ‚ฏใ‚ปใƒณใƒˆ๏ผˆ้–ข่ฅฟๅผ๏ผ‰่พžๆ›ธ - 4,615่ชžใ‚’ๅŽ้Œฒใ—ใŸๆ—ฅๆœฌ่ชžๆ–น่จ€ใ‚ขใ‚ฏใ‚ปใƒณใƒˆ่พžๆ›ธ
Name downloads/week total downloads stars last commit
๐Ÿ”— mecab-ipadic-neologd - - โญ 2.8k ๐Ÿ”ด september 2020
๐Ÿ”— tdmelodic - - โญ 123 ๐Ÿ”ด march 2024
๐Ÿ”— jamdict ๐Ÿ“ฅ 469 ๐Ÿ“ฆ 51k โญ 164 ๐Ÿ”ด june 2021
๐Ÿ”— unidic-py ๐Ÿ“ฅ 86k ๐Ÿ“ฆ 9M โญ 108 ๐Ÿ”ด february 2025
๐Ÿ”— Japanese-Company-Lexicon - - โญ 100 ๐Ÿ”ด january 2023
๐Ÿ”— manbyo-sudachi - - โญ 7 ๐Ÿ”ด april 2021
๐Ÿ”— jawiki-kana-kanji-dict - - โญ 59 ๐ŸŸข february
๐Ÿ”— JIWC-Dictionary - - โญ 40 ๐Ÿ”ด january 2021
๐Ÿ”— JumanDIC - - โญ 4 ๐Ÿ”ด august 2022
๐Ÿ”— ipadic-py ๐Ÿ“ฅ 89k ๐Ÿ“ฆ 6M โญ 24 ๐Ÿ”ด october 2021
๐Ÿ”— unidic-lite ๐Ÿ“ฅ 86k ๐Ÿ“ฆ 9M โญ 49 ๐Ÿ”ด september 2020
๐Ÿ”— emoji-ime-dictionary - - โญ 365 ๐Ÿ”ด january 2023
๐Ÿ”— google-ime-dictionary - - โญ 101 ๐Ÿ”ด january 2023
๐Ÿ”— dic-nico-intersection-pixiv - - โญ 83 ๐Ÿ”ด september 2024
๐Ÿ”— google-ime-user-dictionary-ja-en - - โญ 58 ๐Ÿ”ด december 2016
๐Ÿ”— emoticon - - โญ 43 ๐Ÿ”ด may 2020
๐Ÿ”— mecab-mozcdic - - โญ 10 ๐Ÿ”ด january 2018
๐Ÿ”— denonbu-ime-dic - - โญ 2 ๐Ÿ”ด november 2022
๐Ÿ”— nijisanji-ime-dic - - โญ 35 ๐ŸŸข february
๐Ÿ”— pokemon-ime-dic - - โญ 0 ๐Ÿ”ด january 2020
๐Ÿ”— EJDict - - โญ 241 ๐ŸŸก november 2025
๐Ÿ”— Ayashiy-Nipongo-Dic - - โญ 26 ๐Ÿ”ด may 2024
๐Ÿ”— genshin-dict - - โญ 126 ๐ŸŸข february
๐Ÿ”— jmdict-simplified - - โญ 329 ๐ŸŸข today
๐Ÿ”— mozcdict-ext - - โญ 68 ๐ŸŸก september 2025
๐Ÿ”— mh-dict-jp - - โญ 5 ๐ŸŸก april 2025
๐Ÿ”— jitenbot - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— mecab-unidic-neologd - - โญ 87 ๐Ÿ”ด september 2020
๐Ÿ”— hololive-dictionary - - โญ 24 ๐Ÿ”ด december 2024
๐Ÿ”— jmdict-yomitan - - โญ 237 ๐ŸŸข today
๐Ÿ”— yomichan-jlpt-vocab - - โญ 119 ๐ŸŸก august 2025
๐Ÿ”— Jitendex - - โญ 446 ๐ŸŸข today
๐Ÿ”— jiten - - โญ 125 ๐Ÿ”ด december 2023
๐Ÿ”— pixiv-yomitan - - โญ 49 ๐ŸŸข january
๐Ÿ”— uchinaaguchi_dict - - โญ 4 ๐ŸŸข february
๐Ÿ”— yomitan-dictionaries - - โญ 712 ๐ŸŸก august 2025
๐Ÿ”— mouse_over_dictionary - - โญ 72 ๐Ÿ”ด january 2020
๐Ÿ”— jisyo - - โญ 28 ๐Ÿ”ด september 2023
๐Ÿ”— skk-jisyo.emoji-ja - - โญ 30 ๐Ÿ”ด march 2018
๐Ÿ”— aws_dic_for_google_ime - - โญ 7 ๐Ÿ”ด november 2019
๐Ÿ”— cl-skkserv - - โญ 31 ๐Ÿ”ด october 2024
๐Ÿ”— anthy - - โญ 3 ๐Ÿ”ด july 2013
๐Ÿ”— anthy-unicode - - โญ 41 ๐ŸŸข last friday
๐Ÿ”— azooKey - - โญ 633 ๐ŸŸข yesterday
๐Ÿ”— azookey-desktop - - โญ 790 ๐ŸŸข yesterday
๐Ÿ”— fcitx5-hazkey - - โญ 160 ๐ŸŸข last friday
๐Ÿ”— mozcdic-ut-place-names - - โญ 20 ๐ŸŸข february
๐Ÿ”— azookeykanakanjiconverter - - โญ 135 ๐ŸŸข last saturday
๐Ÿ”— libkkc - - โญ 112 ๐Ÿ”ด august 2024
๐Ÿ”— libskk - - โญ 98 ๐ŸŸข last friday
๐Ÿ”— kanayomi-dict - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— cjkvi-dict - - โญ 108 ๐Ÿ”ด september 2017
๐Ÿ”— wlsp-classical - - โญ 2 ๐ŸŸก november 2025
๐Ÿ”— kanji-dict - - โญ 6 ๐ŸŸข february
๐Ÿ”— Kaomoji_proj - - โญ 11 ๐ŸŸก october 2025
๐Ÿ”— kotlin-kana-kanji-converter - - โญ 5 ๐ŸŸก november 2025
๐Ÿ”— alfred-japanese-dictionary - - โญ 6 ๐ŸŸข february
๐Ÿ”— ichiran - - โญ 386 ๐ŸŸข january
๐Ÿ”— mikan - - โญ 24 ๐ŸŸก june 2025
๐Ÿ”— colloquial-kansai-dictionary - - โญ 9 ๐ŸŸข last saturday
๐Ÿ”— jisho-open - - โญ 57 ๐ŸŸก june 2025
๐Ÿ”— macskk - - โญ 258 ๐ŸŸข february
๐Ÿ”— nandoku - - โญ 1 ๐ŸŸข february
๐Ÿ”— japanese_android_ime - - โญ 2 ๐ŸŸก september 2025
๐Ÿ”— anthywl - - โญ 34 ๐ŸŸก april 2025
๐Ÿ”— sekka - - โญ 24 ๐ŸŸก july 2025
๐Ÿ”— sumibi - - โญ 36 ๐ŸŸข last saturday
๐Ÿ”— jinmei-dict - - โญ 6 ๐Ÿ”ด april 2020
๐Ÿ”— japanesekeyboard - - โญ 202 ๐ŸŸข last wednesday
๐Ÿ”— japanesearabic - - โญ 17 ๐ŸŸก may 2025
๐Ÿ”— o-dic - - โญ 5 ๐ŸŸก march 2025
๐Ÿ”— skk-emoji-jisyo - - โญ 140 ๐Ÿ”ด january 2025
๐Ÿ”— mozcdic-ut-personal-names - - โญ 24 ๐ŸŸข february
๐Ÿ”— mozcdic-ut-sudachidict - - โญ 20 ๐ŸŸข february
๐Ÿ”— nihongo - - โญ 19 ๐Ÿ”ด january 2025
๐Ÿ”— kagome-dict - - โญ 15 ๐ŸŸข last tuesday
๐Ÿ”— canna - - โญ 3 ๐ŸŸก august 2025
๐Ÿ”— kansai-accent-dictionary - - โญ 1 ๐ŸŸข december 2025

Corpus

Part-of-speech tagging / Named entity recognition

Corpora annotated with part-of-speech tags and named entities

  • ner-wikipedia-dataset - Wikipediaใ‚’็”จใ„ใŸๆ—ฅๆœฌ่ชžใฎๅ›บๆœ‰่กจ็พๆŠฝๅ‡บใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • IOB2Corpus - Japanese IOB2 tagged corpus for Named Entity Recognition.
  • TwitterCorpus - ้ฆ–้ƒฝๅคงๆ—ฅๆœฌ่ชž Twitter ใ‚ณใƒผใƒ‘ใ‚น
  • UD_Japanese-PUD - Parallel Universal Dependencies.
  • UD_Japanese-GSD - Japanese data from the Google UDT 2.0.
  • KWDLC - Kyoto University Web Document Leads Corpus
  • AnnotatedFKCCorpus - Annotated Fuman Kaitori Center Corpus
  • UD_Japanese-GSDLUW - Long-unit-word version of UD_Japanese-GSD
  • ud_japanese-bccwj - This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation.
Name downloads/week total downloads stars last commit
๐Ÿ”— ner-wikipedia-dataset - - โญ 142 ๐Ÿ”ด september 2023
๐Ÿ”— IOB2Corpus - - โญ 61 ๐Ÿ”ด february 2020
๐Ÿ”— TwitterCorpus - - โญ 21 ๐Ÿ”ด march 2016
๐Ÿ”— UD_Japanese-PUD - - โญ 0 ๐Ÿ”ด may 2020
๐Ÿ”— UD_Japanese-GSD - - โญ 28 ๐Ÿ”ด may 2022
๐Ÿ”— KWDLC - - โญ 83 ๐Ÿ”ด december 2023
๐Ÿ”— AnnotatedFKCCorpus - - โญ 18 ๐Ÿ”ด december 2023
๐Ÿ”— anthy - - โญ 15 ๐Ÿ”ด february 2023
๐Ÿ”— UD_Japanese-GSDLUW - - โญ 3 ๐ŸŸก november 2025
๐Ÿ”— ud_japanese-bccwj - - โญ 26 ๐ŸŸก november 2025

Parallel corpus

Bilingual corpora containing aligned sentences for translation tasks

  • small_parallel_enja - 50k English-Japanese Parallel Corpus for Machine Translation Benchmark.
  • Web-Crawled-Corpus-for-Japanese-Chinese-NMT - A Web Crawled Corpus for Japanese-Chinese NMT
  • CourseraParallelCorpusMining - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
  • JESC - A large parallel corpus of English and Japanese
  • AMI-Meeting-Parallel-Corpus - AMI Meeting Parallel Corpus
  • giant_ja-en_parallel_corpus - This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanfordโ€™s JESC project.
  • jesc_small - Small Japanese-English Subtitle Corpus
  • graded-enja-corpus - ็ฆๆญข็”จ่ชžใ‚„ๅ˜่ชžใƒฌใƒ™ใƒซใ‚’่€ƒๆ…ฎใ—ใŸๆ—ฅ่‹ฑๅฏพ่จณใ‚ณใƒผใƒ‘ใ‚นใงใ™ใ€‚
  • cjk-compsci-terms - CJK computer science terms comparison / ไธญๆ—ฅ้Ÿ“้›ป่…ฆ็ง‘ๅญธ่ก“่ชžๅฐ็…ง / ๆ—ฅไธญ้Ÿ“ใฎใ‚ณใƒณใƒ”ใƒฅใƒผใ‚ฟ็ง‘ๅญฆใฎ็”จ่ชžๅฏพ็…ง / ํ•œยท์ค‘ยท์ผ ์ „์‚ฐํ•™ ์šฉ์–ด ๋Œ€์กฐ
  • Laboro-ParaCorpus - Scripts for creating a Japanese-English parallel corpus and training NMT models
  • google-vs-deepl-je - google-vs-deepl-je
  • matcha - ่จชๆ—ฅ่ฆณๅ…‰ๅฎขๅ‘ใ‘ใƒกใƒ‡ใ‚ฃใ‚ขMATCHAใฎ่จ˜ไบ‹ใ‹ใ‚‰ใ€ๆ—ฅๆœฌ่ชžใฎใƒ†ใ‚ญใ‚นใƒˆๅนณๆ˜“ๅŒ–ใฎใŸใ‚ใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’ๆง‹็ฏ‰ใ—ใพใ—ใŸใ€‚
  • en-ja-el - EnJaEL: En-Ja Parallel Entity Linking Dataset (Version 1.0)
Name downloads/week total downloads stars last commit
๐Ÿ”— small_parallel_enja - - โญ 98 ๐Ÿ”ด september 2019
๐Ÿ”— Web-Crawled-Corpus-for-Japanese-Chinese-NMT - - โญ 15 ๐Ÿ”ด september 2023
๐Ÿ”— CourseraParallelCorpusMining - - โญ 15 ๐Ÿ”ด august 2024
๐Ÿ”— JESC - - โญ 87 ๐Ÿ”ด november 2017
๐Ÿ”— AMI-Meeting-Parallel-Corpus - - โญ 11 ๐Ÿ”ด december 2020
๐Ÿ”— giant_ja-en_parallel_corpus - - โญ 5 ๐Ÿ”ด august 2019
๐Ÿ”— jesc_small - - โญ 3 ๐Ÿ”ด july 2019
๐Ÿ”— graded-enja-corpus - - โญ 6 ๐ŸŸก august 2025
๐Ÿ”— cjk-compsci-terms - - โญ 145 ๐ŸŸข january
๐Ÿ”— Laboro-ParaCorpus - - โญ 18 ๐Ÿ”ด november 2021
๐Ÿ”— google-vs-deepl-je - - โญ 4 ๐Ÿ”ด march 2020
๐Ÿ”— matcha - - โญ 6 ๐Ÿ”ด january 2025
๐Ÿ”— en-ja-el - - โญ 2 ๐Ÿ”ด january 2025

Dialog corpus

Collections of conversation data for training dialogue systems

  • JMRD - Japanese Movie Recommendation Dialogue dataset
  • open2ch-dialogue-corpus - ใŠใƒผใทใ‚“2ใกใ‚ƒใ‚“ใญใ‚‹ใ‚’ใ‚ฏใƒญใƒผใƒซใ—ใฆไฝœๆˆใ—ใŸๅฏพ่ฉฑใ‚ณใƒผใƒ‘ใ‚น
  • BSD - The Business Scene Dialogue corpus
  • asdc - Accommodation Search Dialog Corpus (ๅฎฟๆณŠๆ–ฝ่จญๆŽข็ดขๅฏพ่ฉฑใ‚ณใƒผใƒ‘ใ‚น)
  • japanese-corpus - ๆ—ฅๆœฌ่ชžใฎๅฏพ่ฉฑใƒ‡ใƒผใ‚ฟ for seq2seq etc
  • BPersona-chat - This repository contains the Japaneseโ€“English bilingual chat corpus BPersona-chat published in the paper Chat Translation Error Detection for Assisting Cross-lingual Communications at AACL-IJCNLP 2022's Workshop Eval4NLP 2022.
  • japanese-daily-dialogue - Japanese Daily Dialogue, or ๆ—ฅๆœฌ่ชžๆ—ฅๅธธๅฏพ่ฉฑใ‚ณใƒผใƒ‘ใ‚น in Japanese, is a high-quality multi-turn dialogue dataset containing daily conversations on five topics: dailylife, school, travel, health, and entertainment.
  • llm-japanese-dataset - LLMๆง‹็ฏ‰็”จใฎๆ—ฅๆœฌ่ชžใƒใƒฃใƒƒใƒˆใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • kokorochat - ใƒญใƒผใƒซใƒ—ใƒฌใ‚คใงๅŽ้›†ใ—ใŸๆ—ฅๆœฌ่ชžใฎใ‚ซใ‚ฆใƒณใ‚ปใƒชใƒณใ‚ฐๅฏพ่ฉฑใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
Name downloads/week total downloads stars last commit
๐Ÿ”— JMRD - - โญ 28 ๐Ÿ”ด july 2022
๐Ÿ”— open2ch-dialogue-corpus - - โญ 98 ๐Ÿ”ด june 2021
๐Ÿ”— BSD - - โญ 73 ๐Ÿ”ด november 2021
๐Ÿ”— asdc - - โญ 25 ๐Ÿ”ด august 2023
๐Ÿ”— japanese-corpus - - โญ 3 ๐Ÿ”ด october 2018
๐Ÿ”— BPersona-chat - - โญ 5 ๐Ÿ”ด january 2023
๐Ÿ”— japanese-daily-dialogue - - โญ 54 ๐Ÿ”ด march 2023
๐Ÿ”— llm-japanese-dataset - - โญ 87 ๐Ÿ”ด january 2024
๐Ÿ”— kokorochat - - โญ 16 ๐ŸŸก august 2025

Others

Corpora for tasks such as question answering or entailment recognition

  • jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
  • kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
  • JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
  • simple-jppdb - A paraphrase database for Japanese text simplification
  • chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
  • JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
  • JaNLI - Japanese Adversarial Natural Language Inference Dataset
  • ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
  • emoji-ja - UNICODE็ตตๆ–‡ๅญ—ใฎๆ—ฅๆœฌ่ชž่ชญใฟ/ใ‚ญใƒผใƒฏใƒผใƒ‰/ๅˆ†้กž่พžๆ›ธ
  • nayose-wikipedia-ja - Wikipediaใ‹ใ‚‰ไฝœๆˆใ—ใŸๆ—ฅๆœฌ่ชžๅๅฏ„ใ›ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • ja.text8 - Japanese text8 corpus for word embedding.
  • ThreeLineSummaryDataset - 3่กŒ่ฆ็ด„ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
  • kanji-frequency - Kanji usage frequency data collected from various sources
  • TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
  • CoARiJ - Corpus of Annual Reports in Japan
  • technological-book-corpus-ja - ๆ—ฅๆœฌ่ชžใงๆ›ธใ‹ใ‚ŒใŸๆŠ€่ก“ๆ›ธใ‚’ๅŽ้›†ใ—ใŸ็”Ÿใ‚ณใƒผใƒ‘ใ‚น/ใƒ„ใƒผใƒซ
  • ita-corpus-chuwa - Chunked word annotation for ITA corpus
  • wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
  • inappropriate-words-ja - ๆ—ฅๆœฌ่ชžใซใŠใ‘ใ‚‹ไธ้ฉๅˆ‡่กจ็พใ‚’ๅŽ้›†ใ—ใพใ™ใ€‚่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใฎๆ™‚ใฎใƒ‡ใƒผใ‚ฟใ‚ฏใƒชใƒผใƒ‹ใƒณใ‚ฐ็”จ็ญ‰ใซไฝฟใˆใ‚‹ใจๆ€ใ„ใพใ™ใ€‚
  • house-of-councillors - ๅ‚่ญฐ้™ขใฎๅ…ฌๅผใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใ‹ใ‚‰ไผšๆดพใ€่ญฐๅ“กใ€่ญฐๆกˆใ€่ณชๅ•ไธปๆ„ๆ›ธใฎใƒ‡ใƒผใ‚ฟใ‚’ๆ•ด็†ใ—ใพใ—ใŸใ€‚
  • house-of-representatives - ๅ›ฝไผš่ญฐๆกˆใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚น๏ผš่ก†่ญฐ้™ข
  • STAIR-captions - STAIR captions: large-scale Japanese image caption dataset
  • Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
  • speechBSD - An extension of the BSD corpus with audio and speaker attribute information
  • ita-corpus - ITAใ‚ณใƒผใƒ‘ใ‚นใฎๆ–‡็ซ ใƒชใ‚นใƒˆ
  • rohan4600 - ใƒขใƒผใƒฉใƒใƒฉใƒณใ‚นๅž‹ๆ—ฅๆœฌ่ชžใ‚ณใƒผใƒ‘ใ‚น
  • anlp-jp-history - ่จ€่ชžๅ‡ฆ็†ๅญฆไผšๅนดๆฌกๅคงไผš่ฌ›ๆผ”ใฎๅ…จใƒชใ‚นใƒˆใƒปๆฉŸๆขฐๅฏ่ชญ็‰ˆใชใฉ
  • keigo_transfer_task - ๆ•ฌ่ชžๅค‰ๆ›ใ‚ฟใ‚นใ‚ฏใซใŠใ‘ใ‚‹่ฉ•ไพก็”จใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • loanwords_gairaigo - English loanwords in Japanese
  • jawikicorpus - Japanese-Wikipedia Wikification Corpus
  • GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
  • wrime - WRIME: ไธป่ฆณใจๅฎข่ฆณใฎๆ„Ÿๆƒ…ๅˆ†ๆžใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
  • WikipediaWordFrequencyList - ๆ—ฅๆœฌ่ชžWikipediaใงไฝฟ็”จใ•ใ‚Œใ‚‹้ ปๅ‡บๅ˜่ชžใฎใƒชใ‚นใƒˆ
  • kokkosho_data - ่ปŠไธกไธๅ…ทๅˆๆƒ…ๅ ฑใซ้–ขใ™ใ‚‹ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • pdmocrdataset-part1 - ใƒ‡ใ‚ธใ‚ฟใƒซๅŒ–่ณ‡ๆ–™OCRใƒ†ใ‚ญใ‚นใƒˆๅŒ–ไบ‹ๆฅญใซใŠใ„ใฆไฝœๆˆใ•ใ‚ŒใŸOCRๅญฆ็ฟ’็”จใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • huriganacorpus-ndlbib - ๅ…จๅ›ฝๆ›ธ่ชŒใƒ‡ใƒผใ‚ฟใ‹ใ‚‰ไฝœๆˆใ—ใŸๆŒฏใ‚Šไปฎๅใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • jvs_hiho - JVS (Japanese versatile speech) ใ‚ณใƒผใƒ‘ใ‚นใฎ่‡ชไฝœใฎใƒฉใƒ™ใƒซ
  • hirakanadic - Allows Sudachi to normalize from hiragana to katakana from any compound word list
  • animedb - ็ด„100ๅนดใซๆธกใ‚‹ใ‚ขใƒ‹ใƒกไฝœๅ“ใƒชใ‚นใƒˆใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚น
  • security_words - ใ‚ตใ‚คใƒใƒผใ‚ปใ‚ญใƒฅใƒชใƒ†ใ‚ฃใซ้–ข้€ฃใ™ใ‚‹ๅ…ฌ็š„ใช็ต„็น”ใฎๆ—ฅ่‹ฑๅฏพๅฟœ
  • Data-on-Japanese-Diet-Members - ๆ—ฅๆœฌใฎๅ›ฝไผš่ญฐๅ“กใฎใƒ‡ใƒผใ‚ฟ
  • honkoku-data - ๆญดๅฒ่ณ‡ๆ–™ใฎๅธ‚ๆฐ‘ๅ‚ๅŠ ๅž‹็ฟปๅˆปใƒ—ใƒฉใƒƒใƒˆใƒ•ใ‚ฉใƒผใƒ ใ€Œใฟใ‚“ใชใง็ฟปๅˆปใ€ใฎใƒ†ใ‚ญใ‚นใƒˆใƒ‡ใƒผใ‚ฟ็ฝฎใๅ ดใงใ™ใ€‚ / Transcription texts created on Minna de Honkoku (https://honkoku.org), a crowdsourced transcription platform for historical Japanese documents.
  • wikihow_japanese - wikiHow dataset (Japanese version)
  • engineer-vocabulary-list - Engineer Vocabulary List in Japanese/English
  • JSICK - Japanese Sentences Involving Compositional Knowledge (JSICK) Dataset/JSICK-stress Test Set
  • phishurl-list - Phishing URL dataset from JPCERT/CC
  • jcms - A Japanese Corpus of Many Specialized Domains (JCMS)
  • aozorabunko_text - text-only archives of www.aozora.gr.jp
  • friendly_JA-Corpus - friendly_JA is a parallel Japanese-to-Japanese corpus aimed at making Japanese easier by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
  • topokanji - Topologically ordered lists of kanji for effective learning
  • isbn4groups - ISBN-13ใซใŠใ‘ใ‚‹ๆ—ฅๆœฌ่ชžใงใฎๅ‡บ็‰ˆ็‰ฉ (978-4-XXXXXXXXX) ใซ้–ขใ™ใ‚‹ใƒ‡ใƒผใ‚ฟ็ญ‰
  • NMeCab - NMeCab: About Japanese morphological analyzer on .NET
  • ndlngramdata - ใƒ‡ใ‚ธใ‚ฟใƒซๅŒ–่ณ‡ๆ–™ใ‹ใ‚‰ไฝœๆˆใ—ใŸOCRใƒ†ใ‚ญใ‚นใƒˆใƒ‡ใƒผใ‚ฟใฎngram้ ปๅบฆ็ตฑ่จˆๆƒ…ๅ ฑใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • ndlngramviewer_v2 - 2023ๅนด1ๆœˆใซใƒชใƒ‹ใƒฅใƒผใ‚ขใƒซใ—ใŸNDL Ngram Viewerใฎใ‚ฝใƒผใ‚นใ‚ณใƒผใƒ‰็ญ‰ไธ€ๅผ
  • data_set - ๆณ•ๅพ‹ใƒปๅˆคไพ‹้–ขไฟ‚ใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • huggingface-datasets_wrime - WRIME for huggingface datasets
  • ndl-minhon-ocrdataset - NDLๅคๅ…ธ็ฑOCRๅญฆ็ฟ’็”จใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ๏ผˆใฟใ‚“ใชใง็ฟปๅˆปๅŠ ๅทฅใƒ‡ใƒผใ‚ฟ๏ผ‰
  • PAX_SAPIENTICA - GIS & Archaeological Simulator. 2023 in development.
  • j-liwc2015 - Japanese version of LIWC2015
  • huggingface-datasets_livedoor-news-corpus - Japanese Livedoor news corpus for huggingface datasets
  • huggingface-datasets_JGLUE - JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
  • commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
  • comet-atomic-ja - COMET-ATOMIC ja
  • dcsg-ja - Dialogue Commonsense Graph in Japanese
  • japanese-toxic-dataset - "Proposal and Evaluation of Japanese Toxicity Schema" provides a schema and dataset for toxicity in the Japanese language.
  • camera - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) is the Japanese ad text generation dataset.
  • Japanese-Fakenews-Dataset - ๆ—ฅๆœฌ่ชžใƒ•ใ‚งใ‚คใ‚ฏใƒ‹ใƒฅใƒผใ‚นใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • jpn_explainable_qa_dataset - jpn_explainable_qa_dataset
  • copa-japanese - COPA Dataset in Japanese
  • WLSP-familiarity - Word Familiarity Rate for 'Word List by Semantic Principles (WLSP)'
  • ProSub - A cross-linguistic study of pronoun substitutes and address terms
  • commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
  • ramendb - ใชใ‚“ใจใ‹ใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚น( https://supleks.jp/ )ใ‹ใ‚‰ใฎใ‚นใ‚ฏใƒฌใ‚คใƒ”ใƒณใ‚ฐใƒ„ใƒผใƒซใจๅŽ้›†ใƒ‡ใƒผใ‚ฟ
  • huggingface-datasets_CAMERA - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) for huggingface datasets
  • FactCheckSentenceNLI-FCSNLI- - FactCheckSentenceNLIใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • databricks-dolly-15k-ja - databricks/dolly-v2-12b ใฎๅญฆ็ฟ’ใƒ‡ใƒผใ‚ฟใซไฝฟ็”จใ•ใ‚ŒใŸdatabricks-dolly-15k.jsonl ใ‚’ๆ—ฅๆœฌ่ชžใซ็ฟป่จณใ—ใŸใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใซใชใ‚Šใพใ™ใ€‚
  • EaST-MELD - EaST-MELD is an English-Japanese dataset for emotion-aware speech translation based on MELD.
  • meconaudio - Mecon Audio(Medical Conference Audio)ใฏๅŽš็”ŸๅŠดๅƒ็œไธปๅ‚ฌใฎๅ…ˆ้€ฒๅŒป็™‚ไผš่ญฐใฎ่ญฐไบ‹้Œฒใฎ่ชญใฟไธŠใ’ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใงใ™ใ€‚
  • japanese-addresses - ๅ…จๅ›ฝใฎ็”บไธ็›ฎใƒฌใƒ™ใƒซ๏ผˆ277,191ไปถ๏ผ‰ใฎไฝๆ‰€ใƒ‡ใƒผใ‚ฟใฎใ‚ชใƒผใƒ—ใƒณใƒ‡ใƒผใ‚ฟ
  • aozorasearch - The full-text search system for Aozora Bunko by Groonga. ้’็ฉบๆ–‡ๅบซๅ…จๆ–‡ๆคœ็ดขใƒฉใ‚คใƒ–ใƒฉใƒชๅ…ผWebใ‚ขใƒ—ใƒชใ€‚
  • llm-jp-corpus - This repository contains scripts to reproduce the LLM-jp corpus.
  • alpaca_ja - alpacaใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ‚’ๆ—ฅๆœฌ่ชžๅŒ–ใ—ใŸใ‚‚ใฎใงใ™
  • instruction_ja - Japanese instruction data (ๆ—ฅๆœฌ่ชžๆŒ‡็คบใƒ‡ใƒผใ‚ฟ)
  • japanese-family-names - Top 5000 Japanese family names, with readings, ordered by frequency.
  • kanji-data-media - Japanese language data on kanji, radicals, media files, fonts and related resources from Kanji alive
  • reazonspeech - Construct large-scale Japanese audio corpus at home
  • huriganacorpus-aozora - ้’็ฉบๆ–‡ๅบซๅŠใณใ‚ตใƒ”ใ‚จใฎ็‚นๅญ—ใƒ‡ใƒผใ‚ฟใ‹ใ‚‰ไฝœๆˆใ—ใŸๆŒฏใ‚Šไปฎๅใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • koniwa - An open collection of annotated voices in Japanese language
  • JMMLU - ๆ—ฅๆœฌ่ชžใƒžใƒซใƒใ‚ฟใ‚นใ‚ฏ่จ€่ชž็†่งฃใƒ™ใƒณใƒใƒžใƒผใ‚ฏ Japanese Massive Multitask Language Understanding Benchmark
  • hurigana-speech-corpus-aozora - ้’็ฉบๆ–‡ๅบซๆŒฏใ‚Šไปฎๅๆณจ้‡ˆไป˜ใ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • jqara - JQaRA: Japanese Question Answering with Retrieval Augmentation - ๆคœ็ดขๆ‹กๅผต(RAG)่ฉ•ไพกใฎใŸใ‚ใฎๆ—ฅๆœฌ่ชžQ&Aใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • jemhopqa - JEMHopQA (Japanese Explainable Multi-hop Question Answering) is a Japanese multi-hop QA dataset that can evaluate internal reasoning.
  • jacred - Repository for Japanese Document-level Relation Extraction Dataset (plan to be released in March).
  • jades - JADES is a dataset for text simplification in Japanese, described in "JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers" (the paper will be available soon).
  • do-not-answer-ja - 2023ๅนด8ๆœˆใซใƒกใƒซใƒœใƒซใƒณๅคงๅญฆใ‹ใ‚‰ๅ…ฌ้–‹ใ•ใ‚ŒใŸๅฎ‰ๅ…จๆ€ง่ฉ•ไพกใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ€ŽDo-Not-Answerใ€ใ‚’ๆ—ฅๆœฌ่ชžLLMใฎ่ฉ•ไพกใซใŠใ„ใฆใ‚‚ไฝฟ็”จใงใใ‚‹ใ‚ˆใ†ใซๆ—ฅๆœฌ่ชžใซ่‡ชๅ‹•็ฟป่จณใ—ใ€ใ•ใ‚‰ใซๆ—ฅๆœฌๆ–‡ๅŒ–ใ‚‚่€ƒๆ…ฎใ—ใฆไฟฎๆญฃใ—ใŸใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใ€‚
  • oasst1-89k-ja - OpenAssistant ใฎใ‚ชใƒผใƒ—ใƒณใ‚ฝใƒผใ‚นใƒ‡ใƒผใ‚ฟ OASST1 ใ‚’ๆ—ฅๆœฌ่ชžใซ็ฟป่จณใ—ใŸใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใซใชใ‚Šใพใ™ใ€‚
  • jacwir - JaCWIR: Japanese Casual Web IR - ๆ—ฅๆœฌ่ชžๆƒ…ๅ ฑๆคœ็ดข่ฉ•ไพกใฎใŸใ‚ใฎๅฐ่ฆๆจกใงใ‚ซใ‚ธใƒฅใ‚ขใƒซใชWebใ‚ฟใ‚คใƒˆใƒซใจๆฆ‚่ฆใฎใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • japanese-technical-dict - ๆ—ฅๆœฌ่ชžๅญฆ็ฟ’่€…ใฎใŸใ‚ใฎ็ง‘ๅญฆๆŠ€่ก“ๆฅญ็•Œใงใ‚ˆใไฝฟใ‚ใ‚Œใ‚‹็‰‡ไปฎๅใจๅ…ƒใฎๅ˜่ชžๅฏพ็…ง่กจ
  • j-unimorph - Dataset of UniMorph in Japanese
  • GazeVQA - Dataset for the LREC-COLING 2024 paper "A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions"
  • J-CRe3 - Code for J-CRe3 experiments (Ueda et al., LREC-COLING, 2024)
  • jmed-llm - JMED-LLM: Japanese Medical Evaluation Dataset for Large Language Models
  • lawtext - Plain text format for Japanese law
  • pdmocrdataset-part2 - OCRๅ‡ฆ็†ใƒ—ใƒญใ‚ฐใƒฉใƒ ็ ”็ฉถ้–‹็™บไบ‹ๆฅญใซใŠใ„ใฆไฝœๆˆใ•ใ‚ŒใŸOCRๅญฆ็ฟ’็”จใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • japanesetopicwsd - ่ฉฑ้กŒใซๅŸบใฅใ่ชž็พฉๆ›–ๆ˜งๆ€ง่งฃๆถˆ่ฉ•ไพกใ‚ปใƒƒใƒˆ
  • temporalNLI_dataset - Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating Generalization Capacity of Language Models
  • JSeM - Japanese semantic test suite (FraCaS counterpart and extensions)
  • niilc-qa - NIILC QA data
  • chain-of-thought-ja-dataset - Dataset of paper "Verification of Chain-of-Thought Prompting in Japanese"
  • WikipediaAnnotatedCorpus - This is a Japanese text corpus that consists of Wikipedia articles with various linguistic annotations.
  • elaws-history - e-Gov ๆณ•ไปคๆคœ็ดขใง้…ๅธƒใ•ใ‚Œใฆใ„ใ‚‹ใ€Œๅ…จใฆใฎๆณ•ไปคใƒ‡ใƒผใ‚ฟใ€ใ‚’ๅฎšๆœŸ็š„ใซใƒ€ใ‚ฆใƒณใƒญใƒผใƒ‰ใ—ใ€ใ‚ขใƒผใ‚ซใ‚คใƒ–ใ—ใฆใ„ใพใ™
  • Japanese-RP-Bench - Japanese-RP-BenchใฏLLMใฎๆ—ฅๆœฌ่ชžใƒญใƒผใƒซใƒ—ใƒฌใ‚ค่ƒฝๅŠ›ใ‚’ๆธฌๅฎšใ™ใ‚‹ใŸใ‚ใฎใƒ™ใƒณใƒใƒžใƒผใ‚ฏใงใ™ใ€‚
  • hdic - HDIC : Integrated Database of Hanzi Dictionaries in Early Japan
  • awesome-japan-opendata - Awesome Japan Open Data - ๆ—ฅๆœฌใฎใ‚ชใƒผใƒ—ใƒณใƒ‡ใƒผใ‚ฟๆƒ…ๅ ฑไธ€่ฆงใƒปใพใจใ‚
  • kanji-data - ๅธธ็”จๆผขๅญ—่กจไป–ใ€ๆผขๅญ—ใซ้–ขใ™ใ‚‹ใƒ‡ใƒผใ‚ฟ
  • openchj-genji - ใ€Œๆบๆฐ็‰ฉ่ชžใ€ๅฝขๆ…‹่ซ–ๆƒ…ๅ ฑใƒ‡ใƒผใ‚ฟ
  • AdParaphrase - This repository contains data for our paper "AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts".
  • Jamp_sp - ใ‚ขใ‚นใƒšใ‚ฏใƒˆใ‚’่€ƒๆ…ฎใ—ใŸๆ—ฅๆœฌ่ชžๆ™‚้–“ๆŽจ่ซ–ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆใฎๆง‹็ฏ‰๏ผˆJamp_sp: Controlled Japanese Temporal Inference Dataset Considering Aspect๏ผ‰
  • jnli-neg - ๅฆๅฎš็†่งฃ่ƒฝๅŠ›ใ‚’่ฉ•ไพกใ™ใ‚‹ใŸใ‚ใฎๆ—ฅๆœฌ่ชž่จ€่ชžๆŽจ่ซ–ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ JNLI-Neg ใฎๅ…ฌ้–‹็”จใƒชใƒใ‚ธใƒˆใƒชใงใ™ใ€‚
  • swallow-corpus - This repository provides Python implementation for building Swallow Corpus Version 1, a large Japanese web corpus (Okazaki et al., 2024), from Common Crawl archives.
  • jalecon - A Dataset of Japanese Lexical Complexity for Non-Native Readers
  • multils-japanese - MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.
  • nwjc - NINJAL Web Japanese Corpus
  • open-mantra-dataset - Dataset introduced in the paper "Towards Fully Automated Manga Translation" presented in AAAI21
  • public-annotations - Various annotations of Manga109 dataset
  • gimei - random Japanese name and address generator
  • safety-boundary-test - ๆ—ฅๆœฌ่ชž่จ€่ชžใƒขใƒ‡ใƒซใฎๅฎ‰ๅ…จๆ€งใฎๆŒฏใ‚‹่ˆžใ„ใ‚’่ฉ•ไพกใ™ใ‚‹ใƒ†ใ‚นใƒˆใ‚ปใƒƒใƒˆ
  • j-ono-data - A simple, open-source collection of Japanese onomatopoeic and mimetic sound words in JSON format. With manga samples.
  • kanji - List of japanese kanji radicals to learn
  • jethics - ๆ—ฅๆœฌ่ชž้“ๅพณ็†่งฃๅบฆ่ฉ•ไพก็”จใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆJETHICSใฎๆฆ‚่ชฌใƒšใƒผใ‚ธ (to be update)
  • waon - WAON: Large-Scale and High-Quality Japanese Image-Text Dataset for Vision-Language Models
  • kuci - Kyoto University Commonsense Inference dataset (KUCI)
  • japanese-address-testdata - ่งฃๆžใŒ้›ฃใ—ใ„ๆ—ฅๆœฌใฎไฝๆ‰€ใฎใƒ†ใ‚นใƒˆใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • jlpt-word-list - Japanese word list from JLPT vocabulary
  • hiragana_mojigazo - ๆ–‡ๅญ—็”ปๅƒใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ(ๅนณไปฎๅ73ๆ–‡ๅญ—็‰ˆ)
  • lawqa_jp - ๆ—ฅๆœฌใฎๆณ•ไปคใซ้–ขใ™ใ‚‹ๅคš่‚ข้ธๆŠžๅผQAใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • yjcaptions - YJ Captions 26k Dataset
  • ja-vg-vqa - Japanese Visual Genome VQA dataset
  • lawhub - Repository to track Japanese Law in text format
  • japanese-subtitles-word-kanji-frequency-lists - A word frequency list derived from subtitles from Japanese drama, anime and films.
  • jconj - A table-based Japanese word conjugator
  • extract_jawp_names - Extracts personal names in Wikipedia Japanese.
  • cejc_yomichan_freq_dict - Frequency dictionary for yomichan based on the Corpus of Everyday Japanese Conversation dataset
  • wikidict-ja - Wikipedia Bilingual Reference Data (Japanese)
  • ajimee-bench - AJIMEE-Bench (Advanced Japanese IME Evaluation Benchmark)
  • j-spaw - J-SpAW: Japanese speech corpus for speaker verification and anti-spoofing
  • camera3 - CAMERA3: An Evaluation Dataset for Controllable Ad Text Generation in Japanese
  • jgpqa - Japanese translation of the GPQA dataset
  • tanaka-corpus-plus - Tanaka Corpus ใฎใƒŽใ‚คใ‚บใ‚’้™คๅŽปใ—ใฆใ„ใพใ™ใ€‚
  • emotioncorpusjapanesetokushimaa2lab - Japanese emotion corpus Tokushima Univ. A-2 Lab.
  • osworld-jp - ่จ€่ชžใ‚’่€ƒๆ…ฎใ—ใŸ่ฉ•ไพกใฎใŸใ‚ใฎใ€ๆ—ฅๆœฌ่ชž็‰ˆใ‚ณใƒณใƒ”ใƒฅใƒผใ‚ฟใƒฆใƒผใ‚นใƒ™ใƒณใƒใƒžใƒผใ‚ฏ
  • quasi_japanese_reviews - Quasi Japanese Reviews (ๆ“ฌไผผใƒฌใƒ“ใƒฅใƒผใƒ‡ใƒผใ‚ฟ)
  • psychiatry-clinical-notes - ็ฒพ็ฅž็ง‘ๅˆ่จบใ‚ซใƒซใƒ†ไฝœๆˆใ‚ขใƒณใ‚ฑใƒผใƒˆ ใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • merged-town-names - ๅธ‚็”บๆ‘ๅˆไฝตใชใฉใซใ‚ˆใ‚Šๆถˆๆป…ใ—ใŸๆ—งๅœฐๅใจๆ–ฐๅœฐๅใฎๅฏพๅฟœ่กจ
  • japanesetextemoticondata - Japanese text-emoticon data.
  • mishearing-corpus - ่žใ้–“้•ใˆใ‚ณใƒผใƒ‘ใ‚น๏ธฑCSV๏ผ‹Table Schema ใง็ด„ 1 ไธ‡ไปถใ‚’็ฎก็†ใ—ใ€VS Code๏ผ‹pre-commit๏ผ‹Frictionless๏ผ‹GitHub Actions ใง่‡ชๅ‹•ๆคœ่จผใ‚’่กŒใ†ๆ—ฅๆœฌ่ชžใƒ‡ใƒผใ‚ฟใ‚ปใƒƒใƒˆ
  • kotowaza - Structured JSON dataset of Japanese proverbs (kotowaza) with meanings in Indonesian & English, examples, JLPT levels, and tags.
  • selective-rag-kasensabo - ๅปบ่จญใฎๆŠ€่ก“ๅŸบๆบ–ใซ้–ขใ™ใ‚‹่ณชๅ•ใฎๅฐ‚้–€ๆ€ง็ฒ’ๅบฆ๏ผˆ็ดฐใ‹ใ„/็ฒ—ใ„๏ผ‰ใ‚’96%ๆญฃ็ขบใซ่‡ชๅ‹•ๅˆคๅฎšใ—ใ€ๆœ€้ฉใชRAGใ‚ทใ‚นใƒ†ใƒ ๏ผˆColBERT/Naive๏ผ‰ใ‚’้ธๆŠžใ™ใ‚‹ๅฎŸ็”จ็š„ใชAgentic RAGใ‚ทใ‚นใƒ†ใƒ ใฎMVPใงใ™ใ€‚2025ๅนด11ๆœˆใซๅ…ฌ้–‹ใ•ใ‚ŒใŸๆฒณๅท็ ‚้˜ฒใƒ€ใƒ ใฎๆŠ€่ก“ๅŸบๆบ–ใ‚’ๅฏพ่ฑกใซ๏ผ”ใคใฎRAGใ‚ทใ‚นใƒ†ใƒ ใ‚’ๆง‹็ฏ‰ใ—ใ€ๅฐ‚้–€ๆ€งใฎ็ฒ’ๅบฆใŒ็•ฐใชใ‚‹200ๅ•ใฎ่ณชๅ•ใซๅฏพใ—ใฆใ€็ฒพๅบฆใจ้€Ÿๅบฆใ‚’ๆฏ”่ผƒใ—ใŸใ€‚
Name downloads/week total downloads stars last commit
๐Ÿ”— jrte-corpus - - โญ 77 ๐Ÿ”ด june 2023
๐Ÿ”— kanji-data - - โญ 201 ๐Ÿ”ด december 2019
๐Ÿ”— JapaneseWordSimilarityDataset - - โญ 102 ๐Ÿ”ด december 2021
๐Ÿ”— simple-jppdb - - โญ 32 ๐Ÿ”ด march 2017
๐Ÿ”— chABSA-dataset - - โญ 141 ๐Ÿ”ด september 2018
๐Ÿ”— JaQuAD - - โญ 108 ๐Ÿ”ด january 2022
๐Ÿ”— JaNLI - - โญ 17 ๐Ÿ”ด may 2023
๐Ÿ”— ebe-dataset - - โญ 18 ๐Ÿ”ด december 2020
๐Ÿ”— emoji-ja - - โญ 82 ๐ŸŸก march 2025
๐Ÿ”— nayose-wikipedia-ja - - โญ 35 ๐Ÿ”ด march 2020
๐Ÿ”— ja.text8 - - โญ 111 ๐Ÿ”ด october 2017
๐Ÿ”— ThreeLineSummaryDataset - - โญ 31 ๐Ÿ”ด april 2018
๐Ÿ”— japanese - - โญ 83 ๐Ÿ”ด september 2018
๐Ÿ”— kanji-frequency - - โญ 155 ๐ŸŸข january
๐Ÿ”— TEDxJP-10K - - โญ 24 ๐Ÿ”ด january 2021
๐Ÿ”— CoARiJ - - โญ 94 ๐Ÿ”ด december 2020
๐Ÿ”— technological-book-corpus-ja - - โญ 26 ๐Ÿ”ด july 2023
๐Ÿ”— ita-corpus-chuwa - - โญ 5 ๐Ÿ”ด august 2021
๐Ÿ”— wikipedia-utils - - โญ 78 ๐Ÿ”ด april 2024
๐Ÿ”— inappropriate-words-ja - - โญ 201 ๐Ÿ”ด december 2021
๐Ÿ”— house-of-councillors - - โญ 105 ๐ŸŸข yesterday
๐Ÿ”— house-of-representatives - - โญ 176 ๐ŸŸข yesterday
๐Ÿ”— STAIR-captions - - โญ 90 ๐Ÿ”ด july 2018
๐Ÿ”— Winograd-Schema-Challenge-Ja - - โญ 6 ๐Ÿ”ด january 2019
๐Ÿ”— speechBSD - - โญ 3 ๐Ÿ”ด february 2024
๐Ÿ”— ita-corpus - - โญ 222 ๐Ÿ”ด december 2024
๐Ÿ”— rohan4600 - - โญ 66 ๐Ÿ”ด february 2023
๐Ÿ”— anlp-jp-history - - โญ 3 ๐Ÿ”ด april 2024
๐Ÿ”— keigo_transfer_task - - โญ 21 ๐Ÿ”ด november 2022
๐Ÿ”— loanwords_gairaigo - - โญ 19 ๐Ÿ”ด january 2021
๐Ÿ”— jawikicorpus - - โญ 4 ๐Ÿ”ด november 2018
๐Ÿ”— GeneralPolicySpeechOfPrimeMinisterOfJapan - - โญ 6 ๐Ÿ”ด january 2020
๐Ÿ”— wrime - - โญ 174 ๐ŸŸก september 2025
๐Ÿ”— jtubespeech - - โญ 229 ๐Ÿ”ด march 2023
๐Ÿ”— WikipediaWordFrequencyList - - โญ 2 ๐Ÿ”ด april 2022
๐Ÿ”— kokkosho_data - - โญ 1 ๐Ÿ”ด july 2019
๐Ÿ”— pdmocrdataset-part1 - - โญ 75 ๐Ÿ”ด june 2024
๐Ÿ”— huriganacorpus-ndlbib - - โญ 28 ๐Ÿ”ด september 2021
๐Ÿ”— jvs_hiho - - โญ 31 ๐Ÿ”ด february 2021
๐Ÿ”— hirakanadic ๐Ÿ“ฅ 61 ๐Ÿ“ฆ 14k โญ 7 ๐Ÿ”ด july 2023
๐Ÿ”— animedb - - โญ 328 ๐Ÿ”ด january 2023
๐Ÿ”— security_words - - โญ 27 ๐Ÿ”ด august 2023
๐Ÿ”— Data-on-Japanese-Diet-Members - - โญ 3 ๐Ÿ”ด september 2022
๐Ÿ”— honkoku-data - - โญ 17 ๐ŸŸข yesterday
๐Ÿ”— wikihow_japanese - - โญ 35 ๐Ÿ”ด december 2020
๐Ÿ”— engineer-vocabulary-list - - โญ 1.9k ๐Ÿ”ด november 2020
๐Ÿ”— JSICK - - โญ 45 ๐Ÿ”ด may 2023
๐Ÿ”— phishurl-list - - โญ 199 ๐ŸŸก november 2025
๐Ÿ”— jcms - - โญ 9 ๐Ÿ”ด november 2022
๐Ÿ”— aozorabunko_text - - โญ 85 ๐Ÿ”ด march 2023
๐Ÿ”— friendly_JA-Corpus - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— topokanji - - โญ 199 ๐Ÿ”ด january 2016
๐Ÿ”— isbn4groups - - โญ 1 ๐Ÿ”ด june 2024
๐Ÿ”— NMeCab - - โญ 97 ๐Ÿ”ด march 2024
๐Ÿ”— ndlngramdata - - โญ 14 ๐Ÿ”ด january 2023
๐Ÿ”— ndlngramviewer_v2 - - โญ 3 ๐Ÿ”ด july 2023
๐Ÿ”— data_set - - โญ 49 ๐Ÿ”ด january 2025
๐Ÿ”— huggingface-datasets_wrime - - โญ 4 ๐Ÿ”ด january 2023
๐Ÿ”— ndl-minhon-ocrdataset - - โญ 18 ๐Ÿ”ด february 2025
๐Ÿ”— PAX_SAPIENTICA - - โญ 180 ๐ŸŸข december 2025
๐Ÿ”— j-liwc2015 - - โญ 13 ๐Ÿ”ด november 2024
๐Ÿ”— huggingface-datasets_livedoor-news-corpus - - โญ 2 ๐Ÿ”ด october 2023
๐Ÿ”— huggingface-datasets_JGLUE - - โญ 12 ๐ŸŸก march 2025
๐Ÿ”— commonsense-moral-ja - - โญ 15 ๐ŸŸก november 2025
๐Ÿ”— comet-atomic-ja - - โญ 31 ๐Ÿ”ด march 2024
๐Ÿ”— dcsg-ja - - โญ 6 ๐Ÿ”ด march 2023
๐Ÿ”— japanese-toxic-dataset - - โญ 21 ๐Ÿ”ด january 2023
๐Ÿ”— camera - - โญ 26 ๐Ÿ”ด august 2024
๐Ÿ”— Japanese-Fakenews-Dataset - - โญ 20 ๐Ÿ”ด may 2021
๐Ÿ”— jpn_explainable_qa_dataset - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— copa-japanese - - โญ 1 ๐Ÿ”ด february 2023
๐Ÿ”— WLSP-familiarity - - โญ 12 ๐Ÿ”ด january 2025
๐Ÿ”— ProSub - - โญ 5 ๐ŸŸก april 2025
๐Ÿ”— commonsense-moral-ja - - โญ 15 ๐ŸŸก november 2025
๐Ÿ”— ramendb - - โญ 7 ๐ŸŸข today
๐Ÿ”— huggingface-datasets_CAMERA - - โญ 3 ๐Ÿ”ด march 2023
๐Ÿ”— FactCheckSentenceNLI-FCSNLI- - - โญ 0 ๐Ÿ”ด march 2021
๐Ÿ”— databricks-dolly-15k-ja - - โญ 89 ๐Ÿ”ด july 2023
๐Ÿ”— EaST-MELD - - โญ 0 ๐Ÿ”ด june 2023
๐Ÿ”— meconaudio - - โญ 9 ๐Ÿ”ด october 2023
๐Ÿ”— japanese-addresses - - โญ 758 ๐ŸŸข december 2025
๐Ÿ”— aozorasearch - - โญ 21 ๐Ÿ”ด september 2020
๐Ÿ”— llm-jp-corpus - - โญ 43 ๐Ÿ”ด october 2023
๐Ÿ”— alpaca_ja - - โญ 86 ๐Ÿ”ด may 2023
๐Ÿ”— instruction_ja - - โญ 24 ๐Ÿ”ด july 2023
๐Ÿ”— japanese-family-names - - โญ 18 ๐Ÿ”ด june 2017
๐Ÿ”— kanji-data-media - - โญ 401 ๐Ÿ”ด november 2023
๐Ÿ”— reazonspeech - - โญ 360 ๐ŸŸข january
๐Ÿ”— huriganacorpus-aozora - - โญ 17 ๐Ÿ”ด january 2024
๐Ÿ”— koniwa - - โญ 56 ๐ŸŸก april 2025
๐Ÿ”— JMMLU - - โญ 38 ๐ŸŸก october 2025
๐Ÿ”— hurigana-speech-corpus-aozora - - โญ 42 ๐ŸŸก march 2025
๐Ÿ”— jqara - - โญ 42 ๐ŸŸก september 2025
๐Ÿ”— jemhopqa - - โญ 29 ๐ŸŸก april 2025
๐Ÿ”— jacred - - โญ 7 ๐Ÿ”ด march 2024
๐Ÿ”— jades - - โญ 0 ๐Ÿ”ด december 2022
๐Ÿ”— do-not-answer-ja - - โญ 24 ๐Ÿ”ด december 2023
๐Ÿ”— oasst1-89k-ja - - โญ 16 ๐Ÿ”ด november 2023
๐Ÿ”— jacwir - - โญ 8 ๐ŸŸก september 2025
๐Ÿ”— japanese-technical-dict - - โญ 3 ๐Ÿ”ด november 2024
๐Ÿ”— j-unimorph - - โญ 9 ๐ŸŸข january
๐Ÿ”— GazeVQA - - โญ 0 ๐Ÿ”ด september 2024
๐Ÿ”— J-CRe3 - - โญ 9 ๐Ÿ”ด january 2025
๐Ÿ”— jmed-llm - - โญ 56 ๐Ÿ”ด september 2024
๐Ÿ”— lawtext - - โญ 91 ๐ŸŸข january
๐Ÿ”— pdmocrdataset-part2 - - โญ 14 ๐Ÿ”ด june 2024
๐Ÿ”— japanesetopicwsd - - โญ 2 ๐Ÿ”ด september 2018
๐Ÿ”— temporalNLI_dataset - - โญ 1 ๐Ÿ”ด july 2023
๐Ÿ”— JSeM - - โญ 13 ๐Ÿ”ด november 2024
๐Ÿ”— niilc-qa - - โญ 18 ๐Ÿ”ด november 2015
๐Ÿ”— chain-of-thought-ja-dataset - - โญ 5 ๐Ÿ”ด september 2023
๐Ÿ”— WikipediaAnnotatedCorpus - - โญ 12 ๐ŸŸข last thursday
๐Ÿ”— elaws-history - - โญ 4 ๐ŸŸข yesterday
๐Ÿ”— Japanese-RP-Bench - - โญ 18 ๐Ÿ”ด september 2024
๐Ÿ”— hdic - - โญ 40 ๐ŸŸข today
๐Ÿ”— awesome-japan-opendata - - โญ 156 ๐ŸŸข february
๐Ÿ”— kanji-data - - โญ 11 ๐ŸŸข yesterday
๐Ÿ”— openchj-genji - - โญ 2 ๐ŸŸก march 2025
๐Ÿ”— AdParaphrase - - โญ 1 ๐ŸŸก may 2025
๐Ÿ”— Jamp_sp - - โญ 0 ๐Ÿ”ด june 2024
๐Ÿ”— jnli-neg - - โญ 0 ๐ŸŸข december 2025
๐Ÿ”— swallow-corpus - - โญ 5 ๐Ÿ”ด november 2024
๐Ÿ”— jalecon - - โญ 5 ๐Ÿ”ด july 2023
๐Ÿ”— multils-japanese - - โญ 0 ๐ŸŸข january
๐Ÿ”— nwjc - - โญ 10 ๐Ÿ”ด april 2022
๐Ÿ”— open-mantra-dataset - - โญ 198 ๐Ÿ”ด march 2023
๐Ÿ”— gimei - - โญ 424 ๐ŸŸข january
๐Ÿ”— safety-boundary-test - - โญ 9 ๐ŸŸก july 2025
๐Ÿ”— j-ono-data - - โญ 6 ๐ŸŸข february
๐Ÿ”— kanji - - โญ 28 ๐ŸŸข yesterday
๐Ÿ”— jethics - - โญ 2 ๐ŸŸก june 2025
๐Ÿ”— waon - - โญ 6 ๐ŸŸก november 2025
๐Ÿ”— kuci - - โญ 5 ๐Ÿ”ด february 2024
๐Ÿ”— japanese-address-testdata - - โญ 14 ๐Ÿ”ด september 2023
๐Ÿ”— jlpt-word-list - - โญ 58 ๐Ÿ”ด february 2022
๐Ÿ”— hiragana_mojigazo - - โญ 18 ๐Ÿ”ด april 2020
๐Ÿ”— lawqa_jp - - โญ 264 ๐ŸŸข last friday
๐Ÿ”— yjcaptions - - โญ 60 ๐Ÿ”ด november 2016
๐Ÿ”— ja-vg-vqa - - โญ 30 ๐Ÿ”ด november 2018
๐Ÿ”— lawhub - - โญ 152 ๐Ÿ”ด november 2020
๐Ÿ”— japanese-subtitles-word-kanji-frequency-lists - - โญ 39 ๐Ÿ”ด december 2023
๐Ÿ”— jconj - - โญ 35 ๐Ÿ”ด may 2020
๐Ÿ”— extract_jawp_names - - โญ 21 ๐Ÿ”ด december 2022
๐Ÿ”— cejc_yomichan_freq_dict - - โญ 9 ๐Ÿ”ด june 2023
๐Ÿ”— wikidict-ja - - โญ 5 ๐Ÿ”ด june 2016
๐Ÿ”— ajimee-bench - - โญ 18 ๐Ÿ”ด january 2025
๐Ÿ”— j-spaw - - โญ 5 ๐ŸŸก august 2025
๐Ÿ”— camera3 - - โญ 4 ๐Ÿ”ด may 2024
๐Ÿ”— jgpqa - - โญ 2 ๐ŸŸก september 2025
๐Ÿ”— tanaka-corpus-plus - - โญ 2 ๐Ÿ”ด june 2021
๐Ÿ”— emotioncorpusjapanesetokushimaa2lab - - โญ 2 ๐Ÿ”ด september 2024
๐Ÿ”— osworld-jp - - โญ 2 ๐ŸŸก november 2025
๐Ÿ”— quasi_japanese_reviews - - โญ 1 ๐Ÿ”ด july 2023
๐Ÿ”— psychiatry-clinical-notes - - โญ 1 ๐ŸŸก october 2025
๐Ÿ”— merged-town-names - - โญ 1 ๐Ÿ”ด may 2022
๐Ÿ”— japanesetextemoticondata - - โญ 1 ๐Ÿ”ด march 2021
๐Ÿ”— mishearing-corpus - - โญ 1 ๐ŸŸข january
๐Ÿ”— kotowaza - - โญ 1 ๐ŸŸข yesterday
๐Ÿ”— selective-rag-kasensabo - - โญ 1 ๐ŸŸก november 2025

Tutorial

Guides and tutorials for learning Japanese NLP tools and techniques

  • spacy_tutorial - spaCy tutorial in English and Japanese. spacy-transformers, BERT, GiNZA.
  • fastTextJapaneseTutorial - Tutorial to train fastText with Japanese corpus
  • allennlp-NER-ja - AllenNLP-NER-ja: AllenNLP ใซใ‚ˆใ‚‹ๆ—ฅๆœฌ่ชžใ‚’ๅฏพ่ฑกใจใ—ใŸๅ›บๆœ‰่กจ็พๆŠฝๅ‡บ
  • chariot-PyTorch-Japanese-text-classification - Experiment for Japanese Text classification using chariot and PyTorch
  • ginza-examples - ๆ—ฅๆœฌ่ชžNLPใƒฉใ‚คใƒ–ใƒฉใƒชGiNZAใฎใ™ใ‚ใ‚
  • DocumentClassificationUsingBERT-Japanese - DocumentClassificationUsingBERT-Japanese
  • BERT_Japanese_Google_Colaboratory - Google Colaboratoryใงๆ—ฅๆœฌ่ชžใฎBERTใ‚’ๅ‹•ใ‹ใ™ๆ–นๆณ•ใงใ™ใ€‚
  • bert-book - ใ€ŒBERTใซใ‚ˆใ‚‹่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ๅ…ฅ้–€: Transformersใ‚’ไฝฟใฃใŸๅฎŸ่ทตใƒ—ใƒญใ‚ฐใƒฉใƒŸใƒณใ‚ฐใ€ใ‚ตใƒใƒผใƒˆใƒšใƒผใ‚ธ
  • janome-tutorial - Janome ใ‚’ไฝฟใฃใŸใƒ†ใ‚ญใ‚นใƒˆใƒžใ‚คใƒ‹ใƒณใ‚ฐๅ…ฅ้–€ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซใงใ™ใ€‚
  • handson-language-models - ๆ—ฅๆœฌ่ชžใฎ่จ€่ชžใƒขใƒ‡ใƒซใฎใƒใƒณใ‚บใ‚ชใƒณ่ณ‡ๆ–™ใงใ™
  • JapaneseNLI - Google Colabใงๆ—ฅๆœฌ่ชžใƒ†ใ‚ญใ‚นใƒˆๆŽจ่ซ–ใ‚’่ฉฆใ™
  • deep-learning-with-pytorch-ja - deep-learning-with-pytorchใฎๆ—ฅๆœฌ่ชž็‰ˆrepositoryใงใ™ใ€‚
  • bert-classification-tutorial -ใ€2023ๅนด็‰ˆใ€‘BERTใซใ‚ˆใ‚‹ใƒ†ใ‚ญใ‚นใƒˆๅˆ†้กž
  • python-nlp-book - ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใซใ‚ˆใ‚‹่‡ช็„ถ่จ€่ชžๅ‡ฆ็†๏ผˆๅ…ฑ็ซ‹ๅ‡บ็‰ˆ๏ผ‰ใฎใ‚ตใƒใƒผใƒˆใƒšใƒผใ‚ธใงใ™
  • llm-book - ใ€Œๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซๅ…ฅ้–€ใ€๏ผˆๆŠ€่ก“่ฉ•่ซ–็คพ, 2023๏ผ‰ใฎGitHubใƒชใƒใ‚ธใƒˆใƒช
  • nlp2024-tutorial-3 - NLP2024 ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ๏ผ“ ไฝœใฃใฆๅญฆใถๆ—ฅๆœฌ่ชžๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซ - ็’ฐๅขƒๆง‹็ฏ‰ๆ‰‹้ †ใจใ‚ฝใƒผใ‚นใ‚ณใƒผใƒ‰
  • japanese-ir-tutorial - ๆ—ฅๆœฌ่ชžๆƒ…ๅ ฑๆคœ็ดขใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ
  • nlpbook - ใ€Œ่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใฎๆ•™็ง‘ๆ›ธใ€ใ‚ตใƒใƒผใƒˆใ‚ตใ‚คใƒˆ
  • kantan-regex-book - ไฝœใฃใฆๅญฆใถๆญฃ่ฆ่กจ็พใ‚จใƒณใ‚ธใƒณ
  • bert-classification-tutorial-2024 - ใ€2024ๅนด็‰ˆใ€‘BERTใซใ‚ˆใ‚‹ใƒ†ใ‚ญใ‚นใƒˆๅˆ†้กž
  • Gemma2_2b_Japanese_finetuning_colab.ipynb - Fine-Tuning Google Gemma for Japanese Instructions
  • nlp100v2020 - ใ€Œ่จ€่ชžๅ‡ฆ็†100ๆœฌใƒŽใƒƒใ‚ฏ 2020ใ€ใ‚’Pythonใง่งฃใ
  • textmining-ja - Rใซใ‚ˆใ‚‹่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใƒปใƒ†ใ‚ญใ‚นใƒˆๅˆ†ๆžใฎ็ทด็ฟ’
  • nlp2025-tutorial-2 - NLP2025 ใฎใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซใ€Œๅœฐ็†ๆƒ…ๅ ฑใจ่จ€่ชžๅ‡ฆ็† ๅฎŸ่ทตๅ…ฅ้–€ใ€ใฎ่ณ‡ๆ–™ใจใ‚ฝใƒผใ‚นใ‚ณใƒผใƒ‰
  • nlp100v2025 - ใ€Œ่จ€่ชžๅ‡ฆ็†100ๆœฌใƒŽใƒƒใ‚ฏ 2025ใ€ใ‚’Pythonใง่งฃใ
  • topic-models-ao - ใ€Žใƒˆใƒ”ใƒƒใ‚ฏใƒขใƒ‡ใƒซใ€(ๆฉŸๆขฐๅญฆ็ฟ’ใƒ—ใƒญใƒ•ใ‚งใƒƒใ‚ทใƒงใƒŠใƒซใ‚ทใƒชใƒผใ‚บ)ใฎใƒŽใƒผใƒˆ
  • slp2025 -้Ÿณๅญฆใ‚ทใƒณใƒใ‚ธใ‚ฆใƒ 2025ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซใ€Œใƒžใƒซใƒใƒขใƒผใƒ€ใƒซๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซๅ…ฅ้–€ใ€่ณ‡ๆ–™
  • book_impress_it-basic-education-ai - ใ‚คใƒณใƒ—ใƒฌใ‚นๅ‡บ็‰ˆใ€ŒITๅŸบ็คŽๆ•™้คŠ ่‡ช็„ถ่จ€่ชžๅ‡ฆ็†๏ผ†็”ปๅƒ่งฃๆžใ€
  • genai-agent-advanced-book - ๆ›ธ็ฑใ€Œ็พๅ ดใงๆดป็”จใ™ใ‚‹ใŸใ‚ใฎ็”ŸๆˆAIใ‚จใƒผใ‚ธใ‚งใƒณใƒˆๅฎŸ่ทตๅ…ฅ้–€ใ€๏ผˆ่ฌ›่ซ‡็คพใ‚ตใ‚คใ‚จใƒณใƒ†ใ‚ฃใƒ•ใ‚ฃใƒƒใ‚ฏ็คพ๏ผ‰ใงๅˆฉ็”จใ•ใ‚Œใ‚‹ใ‚ฝใƒผใ‚นใ‚ณใƒผใƒ‰
  • course2024-nlp - 2024ๅนดๅบฆ ็ซ‹ๆ•™ๅคงๅญฆๅคงๅญฆ้™ข ไบบๅทฅ็Ÿฅ่ƒฝ็ง‘ๅญฆ็ ”็ฉถ็ง‘ ่‡ช็„ถ่จ€่ชžๅ‡ฆ็†็‰น่ซ–
  • support-genai-book - ๅŽŸ่ซ–ๆ–‡ใ‹ใ‚‰่งฃใๆ˜Žใ‹ใ™็”ŸๆˆAI๏ผˆๆŠ€่ก“่ฉ•่ซ–็คพ๏ผ‰ใฎใ‚ตใƒใƒผใƒˆใƒšใƒผใ‚ธใงใ™
  • ir100 - ๆƒ…ๅ ฑๆคœ็ดข100ๆœฌใƒŽใƒƒใ‚ฏ
  • kaggle_llm_book - ใ€ŽKaggle ใงใฏใ˜ใ‚ใ‚‹ๅคง่ฆๆจก่จ€่ชžใƒขใƒ‡ใƒซๅ…ฅ้–€ใ€€๏ฝž่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ€ˆๅฎŸ่ทตใ€‰ใƒ—ใƒญใ‚ฐใƒฉใƒŸใƒณใ‚ฐ๏ฝžใ€ใฎใ‚ตใƒใƒผใƒˆใ‚ตใ‚คใƒˆ
Name downloads/week total downloads stars last commit
๐Ÿ”— spacy_tutorial - - โญ 63 ๐Ÿ”ด january 2020
๐Ÿ”— fastTextJapaneseTutorial - - โญ 205 ๐Ÿ”ด september 2016
๐Ÿ”— allennlp-NER-ja - - โญ 5 ๐Ÿ”ด may 2022
๐Ÿ”— chariot-PyTorch-Japanese-text-classification - - โญ 5 ๐Ÿ”ด march 2019
๐Ÿ”— ginza-examples - - โญ 16 ๐Ÿ”ด january 2021
๐Ÿ”— DocumentClassificationUsingBERT-Japanese - - โญ 0 ๐ŸŸก august 2025
๐Ÿ”— BERT_Japanese_Google_Colaboratory - - โญ 29 ๐Ÿ”ด january 2022
๐Ÿ”— bert-book - - โญ 264 ๐Ÿ”ด february 2024
๐Ÿ”— janome-tutorial - - โญ 31 ๐Ÿ”ด march 2019
๐Ÿ”— handson-language-models - - โญ 3 ๐Ÿ”ด march 2021
๐Ÿ”— JapaneseNLI - - โญ 6 ๐Ÿ”ด june 2021
๐Ÿ”— deep-learning-with-pytorch-ja - - โญ 142 ๐Ÿ”ด may 2021
๐Ÿ”— bert-classification-tutorial - - โญ 235 ๐Ÿ”ด may 2024
๐Ÿ”— python-nlp-book - - โญ 10 ๐Ÿ”ด may 2023
๐Ÿ”— llm-book - - โญ 461 ๐ŸŸข december 2025
๐Ÿ”— nlp2024-tutorial-3 - - โญ 112 ๐Ÿ”ด april 2024
๐Ÿ”— japanese-ir-tutorial - - โญ 3 ๐Ÿ”ด june 2024
๐Ÿ”— nlpbook - - โญ 14 ๐ŸŸก april 2025
๐Ÿ”— kantan-regex-book - - โญ 22 ๐Ÿ”ด march 2024
๐Ÿ”— bert-classification-tutorial-2024 - - โญ 30 ๐Ÿ”ด july 2024
๐Ÿ”— Gemma2_2b_Japanese_finetuning_colab.ipynb - - โญ repo not found ๐Ÿ”ด august 2024
๐Ÿ”— nlp100v2020 - - โญ 91 ๐ŸŸก april 2025
๐Ÿ”— textmining-ja - - โญ 3 ๐ŸŸก october 2025
๐Ÿ”— nlp2025-tutorial-2 - - โญ 17 ๐ŸŸข january
๐Ÿ”— nlp100v2025 - - โญ 91 ๐ŸŸก april 2025
๐Ÿ”— public-annotations - - โญ 13 ๐ŸŸก april 2025
๐Ÿ”— topic-models-ao - - โญ 4 ๐ŸŸก may 2025
๐Ÿ”— slp2025 - - โญ 62 ๐ŸŸข february
๐Ÿ”— book_impress_it-basic-education-ai - - โญ 4 ๐ŸŸก june 2025
๐Ÿ”— genai-agent-advanced-book - - โญ 185 ๐ŸŸก september 2025
๐Ÿ”— course2024-nlp - - โญ repo not found ๐Ÿ”ด repo not found
๐Ÿ”— support-genai-book - - โญ 89 ๐ŸŸข january
๐Ÿ”— ir100 - - โญ 93 ๐ŸŸข december 2025
๐Ÿ”— kaggle_llm_book - - โญ 19 ๐ŸŸข last saturday

Research summary

Summaries of studies and papers in Japanese NLP research

  • awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
  • GEC-Info-ja - ๆ–‡ๆณ•่ชคใ‚Š่จ‚ๆญฃใซ้–ขใ™ใ‚‹ๆ—ฅๆœฌ่ชžๆ–‡็Œฎใ‚’ๅŽ้›†ใƒปๅˆ†้กžใ™ใ‚‹ใŸใ‚ใฎใƒชใƒใ‚ธใƒˆใƒช
  • dataset-list - lists of text corpus and more (mainly Japanese)
  • tuning_playbook_ja - ใƒ‡ใ‚ฃใƒผใƒ—ใƒฉใƒผใƒ‹ใƒณใ‚ฐใƒขใƒ‡ใƒซใฎๆ€ง่ƒฝใ‚’ไฝ“็ณป็š„ใซๆœ€ๅคงๅŒ–ใ™ใ‚‹ใŸใ‚ใฎใƒ—ใƒฌใ‚คใƒ–ใƒƒใ‚ฏ
  • japanese-pitch-accent-resources - Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
  • awesome-japanese-llm - ใ‚ชใƒผใƒ—ใƒณใ‚ฝใƒผใ‚นใฎๆ—ฅๆœฌ่ชžLLMใพใจใ‚
Name downloads/week total downloads stars last commit
๐Ÿ”— awesome-bert-japanese - - โญ 131 ๐Ÿ”ด march 2023
๐Ÿ”— GEC-Info-ja - - โญ 12 ๐ŸŸก april 2025
๐Ÿ”— dataset-list - - โญ 117 ๐Ÿ”ด july 2024
๐Ÿ”— tuning_playbook_ja - - โญ 190 ๐Ÿ”ด january 2023
๐Ÿ”— japanese-pitch-accent-resources - - โญ 124 ๐Ÿ”ด february 2024
๐Ÿ”— awesome-japanese-llm - - โญ 1.3k ๐ŸŸข yesterday

Reference

Contributors

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-japanese-nlp-resources

Similar Open Source Tools

For similar tasks

No tools available

For similar jobs

No tools available