awesome-japanese-nlp-resources

A curated list of resources dedicated to Python libraries, LLMs, dictionaries, and corpora of NLP for Japanese

Stars: 914

Visit

README:

awesome-japanese-nlp-resources

A curated list of resources dedicated to Python libraries, llms, dictionaries, and corpora of NLP for Japanese

Listed information on 823 GitHub repositories
Listed information on 237 Hugging Face repositories (models and datasets)
Released a tool 🔎 for searching through a large number of repository information

English | 日本語 (Japanese) | 繁體中文 (Chinese) | 简体中文 (Chinese)

🎉 The latest additions

Python

edinet-mcp - EDINET XBRL parsing library and MCP server for Japanese financial data.

Corpus

jfinqa - Japanese financial question answering benchmark with 1,000 questions from 68 listed companies.
kotowaza - Structured JSON dataset of Japanese proverbs (kotowaza) with meanings in Indonesian & English, examples, JLPT levels, and tags.
selective-rag-kasensabo - 建設の技術基準に関する質問の専門性粒度（細かい/粗い）を96%正確に自動判定し、最適なRAGシステム（ColBERT/Naive）を選択する実用的なAgentic RAGシステムのMVPです。2025年11月に公開された河川砂防ダムの技術基準を対象に４つのRAGシステムを構築し、専門性の粒度が異なる200問の質問に対して、精度と速度を比較した。

Updated on Feb 17, 2026

Hugging Face
- Models
- Datasets
Python library
C++
Rust crate
JavaScript
Go
- Morphology analysis
- Others
Java
- Morphology analysis
- Others
Pretrained model
- Word2Vec
- Transformer based models
ChatGPT
Dictionary and IME
Corpus
Tutorial
Research summary
Reference
Contributors

Python library

Morphology analysis

Libraries that split Japanese text into words or morphemes and assign part-of-speech and base forms

sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
Janome - Japanese morphological analysis engine written in pure Python
mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
nagisa - A Japanese tokenizer based on recurrent neural networks
pyknp - A Python Module for JUMAN++/KNP
Mykytea-python - Python wrapper for KyTea
konoha - Konoha: Simple wrapper of Japanese Tokenizers
natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
rakutenma-python - Rakuten MA (Python version)
python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
rhoknp - Yet another Python binding for Juman++/KNP
python-vibrato - Viterbi-based accelerated tokenizer (Python wrapper)
jagger-python - Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)
Mecari - Mecari (Japanese Morphological Analysis with Graph Neural Networks)

Name	downloads/week	total downloads	stars	last commit
🔗 SudachiPy	📥 387k	📦 60M	⭐ 427	🔴 october 2022
🔗 Janome	📥 35k	📦 11M	⭐ 902	🟡 october 2025
🔗 mecab-python3	📥 299k	📦 35M	⭐ 580	🟡 november 2025
🔗 mecab	📥 3k	📦 646k	⭐ 271	🔴 october 2024
🔗 fugashi	📥 129k	📦 13M	⭐ 507	🟡 october 2025
🔗 nagisa	📥 67k	📦 7M	⭐ 412	🟢 last thursday
🔗 pyknp	📥 1k	📦 3M	⭐ 92	🟢 january
🔗 Mykytea-python	📥 1k	📦 553k	⭐ 36	🔴 january 2024
🔗 konoha	📥 37k	📦 5M	⭐ 261	🟡 april 2025
🔗 natto-py	📥 127k	📦 32M	⭐ 95	🔴 november 2023
🔗 rakutenma-python	📥 113	📦 27k	⭐ 23	🔴 may 2017
🔗 python-vaporetto	📥 292	📦 173k	⭐ 21	🟡 june 2025
🔗 dango	📥 90	📦 25k	⭐ 25	🔴 november 2021
🔗 rhoknp	📥 10k	📦 977k	⭐ 37	🟢 january
🔗 python-vibrato	📥 160	📦 115k	⭐ 43	🔴 september 2024
🔗 jagger-python	📥 2k	📦 295k	⭐ 12	🔴 march 2024
🔗 Mecari	-	-	⭐ 38	🟡 september 2025

Parsing

Libraries that analyze syntactic and dependency structures of Japanese sentences

ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
cabocha - Yet Another Japanese Dependency Structure Analyzer
UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
camphr - Camphr - NLP libary for creating pipeline components
SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
depccg - A* CCG Parser with a Supertag and Dependency Factored Model
bertknp - A Japanese dependency parser based on BERT
esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
yomikata - Heteronym disambiguation library using a fine-tuned BERT model.
jdepp-python - Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers)
lightblue - A CCG parser for Japanese with DTS-representations
natsume-simple - natsume-simpleは日本語の係り受け関係検索システム
jdeppy - Python wrapper for J.DepP, fast Japanese Dependency Parser

Name	downloads/week	total downloads	stars	last commit
🔗 ginza	📥 10k	📦 2M	⭐ 832	🔴 march 2024
🔗 cabocha	📥 110	📦 54k	⭐ 7	🔴 august 2022
🔗 UniDic2UD	📥 423	📦 326k	⭐ 38	🟢 december 2025
🔗 camphr	📥 614	📦 268k	⭐ 338	🔴 august 2021
🔗 SuPar-UniDic	📥 471	📦 118k	⭐ 20	🔴 repo not found
🔗 depccg	📥 117	📦 46k	⭐ 98	🔴 august 2023
🔗 bertknp	-	-	⭐ 23	🔴 october 2021
🔗 esupar	📥 604	📦 167k	⭐ 54	🟢 january
🔗 yomikata	📥 35	📦 49k	⭐ 32	🔴 october 2023
🔗 jdepp-python	📥 2k	📦 279k	⭐ 4	🔴 february 2024
🔗 lightblue	-	-	⭐ 26	🟢 last thursday
🔗 natsume-simple	-	-	⭐ 5	🔴 february 2025
🔗 jdeppy	📥 41	📦 11k	⭐ 3	🔴 february 2022

Converter

Libraries that convert between character types such as kana, romaji, and full-width/half-width forms

pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
cutlet - Japanese to romaji converter in Python
alphabet2kana - Convert English alphabet to Katakana
Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
mozcpy - Mozc for Python: Kana-Kanji converter
jamorasep - Japanese text parser to separate Hiragana/Katakana string into morae (syllables).
text2phoneme - 日本語文を音素列へ変換するスクリプト
jntajis-python - A fast character conversion and transliteration library based on the scheme defined for Japan National Tax Agency (国税庁) 's
wiredify - Convert japanese kana from ba-bi-bu-be-bo into va-vi-vu-ve-vo
mecab-text-cleaner - Simple Python package (CLI/Python API) for getting japanese readings (yomigana) and accents using MeCab.
pynormalizenumexp - 数量表現や時間表現の抽出・正規化を行うNormalizeNumexpのPython実装
Jusho - Easy wrapper for the postal code data of Japan
yurenizer - Japanese text normalizer that resolves spelling inconsistencies. （日本語表記揺れ解消ツール）
e2k - A tool for automatic English to Katakana conversion
alkana.py - A tool to get the katakana reading of an alphabetical string.
englishtokanaconverter - 英語文字列をカタカナに変換するプログラム
kanjiconv - Kanji Converter to Hiragana, Katakana, Roman alphabet.
kanjize - Kanjize(カンジャイズ): Easy converter between Kanji-Number and Integer

Name	downloads/week	total downloads	stars	last commit
🔗 pykakasi	📥 230k	📦 28M	⭐ 443	🔴 july 2022
🔗 cutlet	📥 21k	📦 1M	⭐ 372	🟡 june 2025
🔗 alphabet2kana	📥 336	📦 56k	⭐ 14	🟢 yesterday
🔗 Convert-Numbers-to-Japanese	-	-	⭐ 50	🔴 november 2020
🔗 mozcpy	📥 176	📦 12k	⭐ 46	🔴 february 2025
🔗 jamorasep	📥 322	📦 9k	⭐ 11	🟢 today
🔗 text2phoneme	-	-	⭐ 13	🔴 may 2023
🔗 jntajis-python	📥 348	📦 105k	⭐ 21	🔴 june 2023
🔗 wiredify	📥 34	📦 6k	⭐ 3	🟢 december 2025
🔗 mecab-text-cleaner	📥 50	📦 4k	⭐ 7	🔴 november 2024
🔗 pynormalizenumexp	📥 72	📦 14k	⭐ 8	🔴 april 2024
🔗 Jusho	📥 704	📦 53k	⭐ 11	🔴 june 2024
🔗 yurenizer	📥 74	📦 17k	⭐ 4	🟡 march 2025
🔗 e2k	📥 821	📦 23k	⭐ 15	🟡 november 2025
🔗 alkana.py	-	-	⭐ 33	🔴 october 2021
🔗 englishtokanaconverter	-	-	⭐ 4	🟢 yesterday
🔗 kanjiconv	📥 111	📦 12k	⭐ 16	🟡 october 2025
🔗 kanjize	📥 8k	📦 1M	⭐ 68	🟡 june 2025

Preprocessor

Libraries that normalize and clean text before analysis

neologdn - Japanese text normalizer for mecab-neologd
jaconv - Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku, and Zenkaku
mojimoji - A fast converter between Japanese hankaku and zenkaku characters
text-cleaning - A powerful text cleaner for Japanese web texts
HojiChar - 複数の前処理を構成して管理するテキスト前処理ツール
utsuho - Utsuho is a Python module that facilitates bidirectional conversion between half-width katakana and full-width katakana in Japanese.
python-habachen - Yet Another Fast Japanese String Converter
kairyou - Quickly preprocesses Japanese text using NLP/NER from SpaCy for Japanese translation or other NLP tasks.

Name	downloads/week	total downloads	stars	last commit
🔗 neologdn	📥 6k	📦 1M	⭐ 286	🟢 december 2025
🔗 jaconv	📥 560k	📦 60M	⭐ 340	🟢 february
🔗 mojimoji	📥 60k	📦 11M	⭐ 153	🔴 january 2024
🔗 text-cleaning	-	-	⭐ 12	🔴 november 2022
🔗 HojiChar	📥 17k	📦 725k	⭐ 125	🟡 november 2025
🔗 utsuho	📥 80	📦 19k	⭐ 4	🟡 october 2025
🔗 python-habachen	📥 5k	📦 2M	⭐ 6	🟡 october 2025
🔗 kairyou	📥 126	📦 30k	⭐ 6	🟡 june 2025

Sentence spliter

Libraries that automatically detect sentence boundaries and split text

Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
japanese-sentence-breaker - Japanese Sentence Breaker
sengiri - Yet another sentence-level tokenizer for the Japanese text
budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
ja_sentence_segmenter - japanese sentence segmentation library for python
hasami - A tool to perform sentence segmentation on Japanese text
kuzukiri - Japanese Text Segmenter for Python written in Rust
ja-senter-benchmark - Comparison of Japanese Sentence Segmentation Tools
fast-bunkai - Japanese sentence splitting(日本語文境界判定器), 40–250× faster via a Rust-accelerated Python library with near-perfect API compatibility with megagonlabs/bunkai.

Name	downloads/week	total downloads	stars	last commit
🔗 bunkai	📥 511	📦 105k	⭐ 199	🔴 august 2023
🔗 japanese-sentence-breaker	📥 38	📦 5k	⭐ 14	🔴 february 2021
🔗 sengiri	📥 80	📦 136k	⭐ 24	🟡 november 2025
🔗 budoux	📥 7k	📦 386k	⭐ 1.6k	🟢 today
🔗 ja_sentence_segmenter	📥 1k	📦 179k	⭐ 73	🔴 april 2023
🔗 hasami	📥 316	📦 37k	⭐ 6	🔴 february 2021
🔗 kuzukiri	📥 101	📦 26k	⭐ 6	🟡 june 2025
🔗 ja-senter-benchmark	-	-	⭐ 9	🔴 february 2023
🔗 fast-bunkai	📥 108	📦 3k	⭐ 65	🟡 october 2025

Sentiment analysis

Libraries that detect emotions or polarity in text

oseti - Dictionary based Sentiment Analysis for Japanese
negapoji - Japanese negative positive classification.日本語文書のネガポジを判定。
pymlask - Emotion analyzer for Japanese text
asari - Japanese sentiment analyzer implemented in Python.

Name	downloads/week	total downloads	stars	last commit
🔗 oseti	📥 314	📦 166k	⭐ 97	🟡 august 2025
🔗 negapoji	-	-	⭐ 151	🔴 august 2017
🔗 pymlask	📥 81	📦 66k	⭐ 116	🔴 july 2024
🔗 asari	📥 180	📦 79k	⭐ 152	🔴 october 2022

Machine translation

Libraries that automatically translate text between languages

jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
VISA - An ambiguous subtitles dataset for visual scene-aware machine translation
plamo-translate-cli - A command-line interface for translation using the plamo-2-translate model with local execution.

Name	downloads/week	total downloads	stars	last commit
🔗 jparacrawl-finetune	-	-	⭐ 105	🔴 april 2021
🔗 JASS	-	-	⭐ 16	🔴 january 2022
🔗 PheMT	-	-	⭐ 18	🔴 february 2021
🔗 VISA	-	-	⭐ 14	🔴 october 2022
🔗 plamo-translate-cli	-	-	⭐ 325	🟡 october 2025

Named entity recognition

Libraries that extract names of people, places, and organizations from text

namaco - Character Based Named Entity Recognition.
entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
noyaki - Converts character span label information to tokenized text-based label information.
bert-japanese-ner-finetuning - Code to perform finetuning of the BERT model. BERTモデルのファインチューニングで固有表現抽出用タスクのモデルを作成・使用するサンプルです
joint-information-extraction-hs - 詳細なアノテーション基準に基づく症例報告コーパスからの固有表現及び関係の抽出精度の推論を行うコード
pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
bert-ner-japanese - BERTによる日本語固有表現抽出のファインチューニング用プログラム
huggingface-finetune-japanese - Examples to finetune encoder-only and encoder-decoder transformers for Japanese language (Hugging Face) Resources
novelanalysisbyner - BERTのfine-tuningによる固有表現抽出

Name	downloads/week	total downloads	stars	last commit
🔗 namaco	-	-	⭐ 40	🔴 february 2018
🔗 entitypedia	-	-	⭐ 13	🔴 december 2018
🔗 noyaki	📥 100	📦 20k	⭐ 5	🔴 august 2022
🔗 bert-japanese-ner-finetuning	-	-	⭐ 11	🔴 june 2022
🔗 joint-information-extraction-hs	-	-	⭐ 1	🔴 november 2021
🔗 pygeonlp	📥 237	📦 21k	⭐ 22	🟡 october 2025
🔗 bert-ner-japanese	-	-	⭐ 5	🔴 september 2022
🔗 huggingface-finetune-japanese	-	-	⭐ 16	🔴 october 2023
🔗 novelanalysisbyner	-	-	⭐ 2	🔴 june 2024

OCR

Libraries that recognize and extract text from images

Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
mokuro - Read Japanese manga inside browser with selectable text.
handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
OCR_Japanease - 日本語OCR
ndlocr_cli - NDLOCRのアプリケーション
donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
JMTrans - manga translator - get japanese manga from url to translate manga image
Kindai-OCR - OCR system for recognizing modern Japanese magazines
text_recognition - NDLOCR用テキスト認識モジュール
Poricom - Optical character recognition in manga images. Manga OCR desktop application
owocr - Optical character recognition for Japanese text
yomitoku - Yomitoku is an AI-powered document image analysis package designed specifically for the Japanese language.
findtextcenternet - Japanese OCR with CenterNet
simple-ocr-for-manga - A simple OCR for manga (Japanese traditional and Japanese vertical)
jp-ocr-evaluation - 日本語の文章画像に対するOCRの性能を評価
paddleocr-vl-sft-for-japanese-manga-on-rtx-3060 - Fine-tune PaddleOCR-VL on the Manga109s dataset for Japanese manga text recognition. The base model struggles with vertical Japanese text reading order in manga. After fine-tuning, the model correctly handles manga-specific text layouts.
MangaOCR - A lightweight OCR model for Japanese text, especially in Manga
meikiocr - high-speed, high-accuracy, local ocr for japanese video games
meikipop - universal japanese ocr popup dictionary for windows, linux and macos

Name	downloads/week	total downloads	stars	last commit
🔗 manga-ocr	📥 4k	📦 240k	⭐ 2.5k	🟡 june 2025
🔗 mokuro	📥 759	📦 89k	⭐ 1.5k	🟡 june 2025
🔗 handwritten-japanese-ocr	-	-	⭐ 38	🔴 april 2022
🔗 OCR_Japanease	-	-	⭐ 244	🔴 april 2021
🔗 ndlocr_cli	-	-	⭐ 567	🟡 september 2025
🔗 donut	📥 342	📦 196k	⭐ 6.8k	🔴 july 2023
🔗 JMTrans	-	-	⭐ 88	🔴 january 2021
🔗 Kindai-OCR	-	-	⭐ 153	🔴 july 2023
🔗 text_recognition	-	-	⭐ 8	🔴 july 2023
🔗 Poricom	-	-	⭐ 415	🔴 june 2023
🔗 owocr	-	-	⭐ 194	🟢 last friday
🔗 yomitoku	📥 1k	📦 75k	⭐ 1.3k	🟢 january
🔗 findtextcenternet	-	-	⭐ 55	🟡 august 2025
🔗 simple-ocr-for-manga	-	-	⭐ 7	🔴 repo not found
🔗 jp-ocr-evaluation	-	-	⭐ 1	🔴 march 2024
🔗 paddleocr-vl-sft-for-japanese-manga-on-rtx-3060	-	-	⭐ 7	🟢 december 2025
🔗 MangaOCR	-	-	⭐ 34	🔴 may 2024
🔗 meikiocr	📥 707	📦 15k	⭐ 55	🟢 january
🔗 meikipop	-	-	⭐ 206	🟢 last saturday

Tool for pretrained models

Libraries that utilize pretrained models to improve accuracy and efficiency

JGLUE - JGLUE: Japanese General Language Understanding Evaluation
ginza-transformers - Use custom tokenizers in spacy-transformers
t5_japanese_dialogue_generation - T5による会話生成
japanese_text_classification - To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
Japanese-BERT-Sentiment-Analyzer - Deploying sentiment analysis server with FastAPI and BERT
jmlm_scoring - Masked Language Model-based Scoring for Japanese and Vietnamese
allennlp-shiba-model - AllenNLP integration for Shiba: Japanese CANINE model
evaluate_japanese_w2v - script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
gector-ja - BERT-based GEC tagging for Japanese
Japanese-BPEEncoder - Japanese-BPEEncoder
Japanese-BPEEncoder_V2 - Japanese-BPEEncoder Version 2
transformer-copy - 日本語文法誤り訂正ツール
japanese-stable-diffusion - Japanese Stable Diffusion is a Japanese specific latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
nagisa_bert - A BERT model for nagisa
prefix-tuning-gpt - Example code for prefix-tuning GPT/GPT-NeoX models and for inference with trained prefixes
JGLUE-benchmark - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
jptranstokenizer - Japanese Tokenizer for transformers library
jp-stable - JP Language Model Evaluation Harness
compare-ja-tokenizer - How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese-ACL SRW 2023
lm-evaluation-harness-jp-stable - A framework for few-shot evaluation of autoregressive language models.
llm-lora-classification - llm-lora-classification
jp-stable - JP Language Model Evaluation Harness
rinna_gpt-neox_ggml-lora - The repository contains scripts and merge scripts that have been modified to adapt an Alpaca-Lora adapter for LoRA tuning when assuming the use of the "rinna/japanese-gpt-neox..." [gpt-neox] model converted to ggml.
japanese-llm-roleplay-benchmark - このリポジトリは日本語LLMのキャラクターロールプレイに関する性能を評価するために作成しました。
japanese-llm-ranking - This repository supports YuzuAI's Rakuda leaderboard of Japanese LLMs, which is a Japanese-focused analogue of LMSYS' Vicuna eval.
llm-jp-eval - このツールは、複数のデータセットを横断して日本語の大規模言語モデルを自動評価するものです．
llm-jp-sft - This repository contains the code for supervised fine-tuning of LLM-jp models.
llm-jp-tokenizer - LLM勉強会（LLM-jp）で開発しているLLM用のトークナイザー関連をまとめたリポジトリです．
japanese-lm-fin-harness - Japanese Language Model Financial Evaluation Harness
ja-vicuna-qa-benchmark - Japanese Vicuna QA Benchmark
swallow-evaluation - Swallowプロジェクト大規模言語モデル評価スクリプト
swallow-evaluation-instruct - Swallowプロジェクト事後学習ずみ大規模言語モデル評価フレームワーク
pretrained_doc2vec_ja - pretrained doc2vec models on Japanese Wikipedia
pl-bert-ja - A repository of Japanese Phoneme-Level BERT

Name	downloads/week	total downloads	stars	last commit
🔗 JGLUE	-	-	⭐ 333	🟡 march 2025
🔗 ginza-transformers	📥 790	📦 173k	⭐ 16	🔴 august 2022
🔗 t5_japanese_dialogue_generation	-	-	⭐ 3	🔴 november 2021
🔗 japanese_text_classification	-	-	⭐ 9	🔴 january 2020
🔗 Japanese-BERT-Sentiment-Analyzer	-	-	⭐ 2	🔴 april 2021
🔗 jmlm_scoring	-	-	⭐ 5	🔴 february 2022
🔗 allennlp-shiba-model	📥 137	📦 20k	⭐ 12	🔴 june 2021
🔗 evaluate_japanese_w2v	-	-	⭐ 12	🔴 november 2024
🔗 gector-ja	-	-	⭐ 19	🔴 june 2021
🔗 Japanese-BPEEncoder	-	-	⭐ 41	🔴 september 2021
🔗 Japanese-BPEEncoder_V2	-	-	⭐ 41	🔴 january 2023
🔗 transformer-copy	-	-	⭐ 29	🔴 september 2020
🔗 japanese-stable-diffusion	-	-	⭐ repo not found	🔴 repo not found
🔗 nagisa_bert	📥 923	📦 56k	⭐ 5	🔴 december 2023
🔗 prefix-tuning-gpt	-	-	⭐ repo not found	🔴 repo not found
🔗 JGLUE-benchmark	-	-	⭐ 18	🟢 february
🔗 jptranstokenizer	📥 115	📦 27k	⭐ 5	🔴 february 2024
🔗 jp-stable	-	-	⭐ 154	🔴 november 2023
🔗 compare-ja-tokenizer	-	-	⭐ 6	🔴 june 2023
🔗 lm-evaluation-harness-jp-stable	-	-	⭐ 1	🔴 june 2023
🔗 llm-lora-classification	-	-	⭐ 98	🔴 july 2023
🔗 jp-stable	-	-	⭐ 154	🔴 november 2023
🔗 rinna_gpt-neox_ggml-lora	-	-	⭐ 18	🔴 may 2023
🔗 japanese-llm-roleplay-benchmark	-	-	⭐ 40	🔴 november 2023
🔗 japanese-llm-ranking	-	-	⭐ 50	🔴 march 2024
🔗 llm-jp-eval	-	-	⭐ 147	🟢 december 2025
🔗 llm-jp-sft	-	-	⭐ 62	🔴 june 2024
🔗 llm-jp-tokenizer	-	-	⭐ 45	🔴 february 2025
🔗 japanese-lm-fin-harness	-	-	⭐ 77	🟢 january
🔗 ja-vicuna-qa-benchmark	-	-	⭐ 33	🔴 june 2024
🔗 swallow-evaluation	-	-	⭐ 23	🟡 september 2025
🔗 swallow-evaluation-instruct	-	-	⭐ 24	🟡 october 2025
🔗 pretrained_doc2vec_ja	-	-	⭐ 25	🔴 january 2019
🔗 pl-bert-ja	-	-	⭐ 22	🔴 december 2023

Others

General-purpose tools supporting Japanese language processing

namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
python_asa - python版日本語意味役割付与システム（ASA）
toiro - A comparison tool of Japanese tokenizers
ja-timex - 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器
JapaneseTokenizers - A set of metrics for feature selection from text data
daaja - This repository has implementations of data augmentation for NLP for Japanese.
accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
nlplot - Visualization Module for Natural Language Processing
rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
jel - Japanese Entity Linker.
MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
darts-clone-python - Darts-clone python binding
jrte-corpus_example - Example codes for Japanese Realistic Textual Entailment Corpus
desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
nlp-recipes-ja - Samples codes for natural language processing in Japanese
Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
DNorm-J - Japanese version of DNorm
pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
ishi - Ishi: A volition classifier for Japanese
python-npylm - ベイズ階層言語モデルによる教師なし形態素解析
python-npycrf - 条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析
unsupervised-pos-tagging - 教師なし品詞タグ推定
negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
YouyakuMan - Extractive summarizer using BertSum as summarization model
japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
kantan - Lookup japanese words by radical patterns
make-meidai-dialogue - Get Japanese dialogue corpus
japanese_summarizer - A summarizer for Japanese articles.
chirptext - ChirpText is a collection of text processing tools for Python.
yubin - Japanese Address Munger
jawiki-cleaner - Japanese Wikipedia Cleaner
japanese2phoneme - A python library to convert Japanese to phoneme.
anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
aozora_classification - This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
NTM - Testing of Neural Topic Modeling for Japanese articles
EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
mbart-finetuning - Code to perform finetuning of the mBART model.
xvector_jtubespeech - xvector model on jtubespeech
TinySegmenterMaker - TinySegmenter用の学習モデルを自作するためのツール．
Grongish - 日本語とグロンギ語の相互変換スクリプト
WordCloud-Japanese - WordCloudでの日本語文章をMecab（形態素解析エンジン）を使用せずに形態素解析チックな表示を実現するスクリプト
snark - 日本語ワードネットを利用したDBアクセスライブラリ
toEmoji - 日本語文を絵文字だけの文に変換するなにか
termextract - - 専門用語抽出アルゴリズムの実装の練習
JDT-with-KenLM-scoring - Japanese-Dialog-Transformerの応答候補に対して、KenLMによるN-gram言語モデルでスコアリングし、フィルタリング若しくはリランキングを行う。
mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
hidden-markov-model - Hidden Markov Model (HMM) and Infinite Hidden Markov Model (iHMM) in Python. (隠れマルコフモデルと無限隠れマルコフモデル)
Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
neural_ime - Neural IME: Neural Input Method Engine
neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
tinysegmenter - tokenizer specified for Japanese
AugLy-jp - Data Augmentation for Japanese Text on AugLy
furigana4epub - A Python script for adding furigana to Japanese epub books using Mecab and Unidic.
PyKatsuyou - Japanese verb/adjective inflections tool
jageocoder - Pure Python Japanese address geocoder
pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
nksnd - New kana-kanji conversion engine
JaMIE - A Japanese Medical Information Extraction Toolkit
fasttext-vs-word2vec-on-twitter-data - fasttextとword2vecの比較と、実行スクリプト、学習スクリプトです
minimal-search-engine - 最小のサーチエンジン/PageRank/tf-idf
5ch-analysis - 5chの過去ログをスクレイピングして、過去流行った単語(ex, 香具師, orz)などを追跡調査
tweet_extructor - Twitter日本語評判分析データセットのためのツイートダウンローダ
japanese-word-aggregation - Aggregating Japanese words based on Juman++ and ConceptNet5.5
jinf - A Japanese inflection converter
kwja - A unified language analyzer for Japanese
mlm-scoring-transformers - Reproduced package based on Masked Language Model Scoring (ACL2020).
ClipCap-for-Japanese - [PyTorch] ClipCap for Japanese
SAT-for-Japanese - [PyTorch] Show, Attend and Tell for Japanese
cihai - Python library for CJK (Chinese, Japanese, and Korean) language dictionary
marine - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation
whisper-asr-finetune - Finetuning Whisper ASR model
japanese_chatbot - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
radicalchar - 部首文字正規化ライブラリ
akaza - Yet another Japanese IME for IBus/Linux
posuto - Japanese postal code data.
tacotron2-japanese - Tacotron2 implementation of Japanese
ibus-hiragana - ひらがなIME for IBus
furiganapad - ふりがなパッド
chikkarpy - Japanese synonym library
ja-tokenizer-docker-py - Mecab + NEologd + Docker + Python3
JapaneseEmbeddingEval - JapaneseEmbeddingEval
gptuber-by-langchain - GPTがYouTuberをやります
shuwa - Extend GNOME On-Screen Keyboard for Input Methods
japanese-nli-model - This repository provides the code for Japanese NLI model, a fine-tuned masked language model.
tra-fugu - A tool for Japanese-English translation and English-Japanese translation by using FuguMT
fugumt - ぷるーふおぶこんせぷとで公開した機械翻訳エンジンを利用する翻訳環境です。フォームに入力された文字列の翻訳、PDFの翻訳が可能です。
JaSPICE - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
Retrieval-based-Voice-Conversion-WebUI-JP-localization - jp-localization
pyopenjtalk - Python wrapper for OpenJTalk
yomigana-ebook - Make learning Japanese easier by adding readings for every kanji in the eBook
N46Whisper - Whisper based Japanese subtitle generator
japanese_llm_simple_webui - Rinna-3.6B、OpenCALM等の日本語対応LLM(大規模言語モデル)用の簡易Webインタフェースです
pdf-translator - pdf-translator translates English PDF files into Japanese, preserving the original layout.
japanese_qa_demo_with_haystack_and_es - Haystack + Elasticsearch + wikipedia(ja) を用いた、日本語の質問応答システムのサンプル
mozc-devices - Automatically exported from code.google.com/p/mozc-morse
natsume - A Japanese text frontend processing toolkit
vits-japros-webui - 日本語TTS（VITS）の学習と音声合成のGradio WebUI
ja-law-parser - A Japanese law parser
dictation-kit - Japanese dictation kit using Julius
julius4seg - Juliusを使ったセグメンテーション支援ツール
voicevox_engine - 無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン
LLaVA-JP - LLaVA-JP is a Japanese VLM trained by LLaVA method
RAG-Japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
bertjsc - Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正
llm-leaderboard - Project of llm evaluation to Japanese tasks
jglue-evaluation-scripts - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
BLIP2-Japanese - Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
wikipedia-passages-jawiki-embeddings-utils - wikipedia 日本語の文を、各種日本語の embeddings や faiss index へと変換するスクリプト等。
simple-simcse-ja - Exploring Japanese SimCSE
wikipedia-japanese-open-rag - Wikipediaの日本語記事を元に、ユーザの質問に回答するGradioベースのRAGのサンプル
gpt4-autoeval - GPT-4 を用いて、言語モデルの応答を自動評価するスクリプト
t5-japanese - 日本語T5モデル
japanese_llm_eval - A repo for evaluating Japanese LLMs　・　日本語LLMを評価するレポ
jmteb - The evaluation scripts of JMTEB (Japanese Massive Text Embedding Benchmark)
pydomino - 日本語音声に対して音素ラベルをアラインメントするためのツールです
easynovelassistant - 軽量で規制も検閲もない日本語ローカル LLM『LightChatAssistant-TypeB』による、簡単なノベル生成アシスタントです。ローカル特権の永続生成 Generate forever で、当たりガチャを積み上げます。読み上げにも対応。
clip-japanese - 日本語データセットでのqlora instruction tuning学習サンプルコード
rime-jaroomaji - Japanese rōmaji input schema for Rime IME
deep-question-generation - 深層学習を用いたクイズ自動生成（日本語T5モデル）
magpie-nemotron - Magpieという手法とNemotron-4-340B-Instructを用いて合成対話データセットを作るコード
qlora_ja - 日本語データセットでのqlora instruction tuning学習サンプルコード
mozcdic-ut-jawiki - Mozc UT Jawiki Dictionary is a dictionary generated from the Japanese Wikipedia for Mozc.
shisa-v2 - Japanese / English Bilingual LLM
llm-translator - Mixtral-based Ja-En (En-Ja) Translation model
llm-jp-asr - Whisperのデコーダをllm-jp-1.3b-v1.0に置き換えた音声認識モデルを学習させるためのコード
rag-japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
monaka - A Japanese Parser (including historical Japanese)
jp-translate.cloud - A state-of-the-art open-source Japanese <--> English machine translation system based on the latest NMT research.
substring-word-finder - 連続部分文字列の単語判定を行います
heron-vlm-leaderboard - This project is a benchmarking tool for evaluating and comparing the performance of various Vision Language Models (VLMs). It uses two datasets: LLaVA-Bench-In-the-Wild and Japanese HERON Bench to measure model performance.
text2dataset - Easily turn large English text datasets into Japanese text datasets using open LLMs.
mecab-web-api - MeCabを利用した日本語形態素解析WebAPI
mecab_controller - Mecab wrapper to generate furigana readings.
vits - VITSによるテキスト読み上げ器&ボイスチェンジャー
akari_chatgpt_bot - 音声認識、文章生成、音声合成を使って対話するチャットボットアプリ
kudasai - Streamlining Japanese-English Translation with Advanced Preprocessing and Integrated Translation Technologies
mecab-visualizer - MeCabの形態素解析結果を可視化するツール
add-dictionary - OpenJTalkのユーザ辞書をGUIで追加するアプリ
j-moshi - J-Moshi: A Japanese Full-duplex Spoken Dialogue System
jatts - JATTS: Japanese TTS (for research)
tsukasa-speech - a Frontier Japanese Speech Generation net
symptom-expression-search - ElasticsearchやGiNZA、患者表現辞書を使った患者表現揺れ吸収する意味構造検索を試した
llm-jp-judge - 生成自動評価を行うためのPythonツール
asagi-vlm-colaboratory-sample - Colaboratory上でAsagi(合成データセットを活用した大規模日本語VLM)をお試しするサンプル
llm-jp-eval-mm - This tool automatically evaluates Japanese multi-modal large language models across multiple datasets.
llm-jp-judge - 生成自動評価を行うためのPythonツール
manga109api - Simple python API to read annotation data of Manga109
fastrtc-jp - fastrtc用の日本語TTSとSTT追加キット
whisper-transcription - Pythonを使用したWhisperモデルによる音声文字起こしツール
pocket-researcher - LLMを活用した自律調査エージェント。手軽に情報収集、概要把握。
jtransbench - A tool to easily benchmark Japanese translation skills
easyllasa - EasyLlasa は 5～15秒の日本語音声と日本語テキストから日本語音声を生成する TSTS (TextSpeechToSpeech) です。
kanjikana-model - 氏名漢字カナ突合モデル
deep-openreview-research-ja - OpenReview論文を自動で発見・分析する日本語対応AIエージェント
pitchbench - Experimental Japanese pitch accent based LLM Benchmark
mini-transformer-from-scratch - English to Japanese Transformer from scratch
vv_core_inference - VOICEVOXのコア内で用いられているディープラーニングモデルの推論コード
pyopenjtalk-plus - pyopenjtalk-plus: A Python wrapper for OpenJTalk with additional improvements
japanese_spelling_correction - Japanese Spelling Correction
py-kaomoji - python kaomoji
llm-jp-vila - This repository contains the code for training llm-jp/llm-jp-3-vila-14b, modified from VILA repository.
kanjivg-radical - kanjivg-radical
japanese-wordnet-visualization - This project visualizes the Japanese Wordnet (日本語ワードネット) with web application built by Django
piper-plus - Enhanced Piper TTS with Japanese support, WebAssembly, multi-GPU training, and quality improvements.
Japanera - Easy Tools for Japanese Era System
bert-abstractive-text-summarization - Japanese Sentence Summarization with BERT
kyujipy - A Python library to convert Japanese texts from Shinjitai (新字体) to Kyujitai (舊字體) and vice versa
jitenbot - Web crawler for creating personal copies of Japanese dictionaries
ja-icd10 - ICD-10 国際疾病分類の日本語情報を扱うためのPythonパッケージ
pl-bert-vits2 - VITS2 using Phoneme-Level Japanese BERT
ndc_predictor - NDCPredictorの機械学習モデル（書誌情報から日本十進分類を推測するfastTextの学習済みモデル）
pfmt-bench-fin-ja - pfmt-bench-fin-ja: Preferred Multi-turn Benchmark for Finance in Japanese
marine-plus - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation (Also supported Windows)
ja-tokenizer-benchmark - Compare the speed of various Japanese tokenizers in Python.
yat - yat: Yet Another Tokenizer for Japanese NLP
igakuqa119 - Evaluating LLMs on the 119th Japanese Medical Licensing Examination
japanese-luw-tokenizer - Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers
ibus-jig - ibus-jig: Japanese-language Input-method using GPT-4
jp-stopword-filter - A lightweight Python library designed to filter stopwords from Japanese text based on customizable rules.
yasumail - Synthetic Japanese business email generator for ML training data
himotoki - A Python-based Japanese Tokenizer, Dictionary, Morphological Analyzer and Romanization Tool. Based on JMDict for Language Learning.

Name	downloads/week	total downloads	stars	last commit
🔗 namedivider-python	📥 475	📦 78k	⭐ 251	🟡 november 2025
🔗 asa-python	📥 106	📦 30k	⭐ 11	🔴 february 2019
🔗 python_asa	-	-	⭐ 22	🔴 january 2020
🔗 toiro	📥 69	📦 26k	⭐ 121	🟡 november 2025
🔗 ja-timex	📥 440	📦 89k	⭐ 140	🔴 november 2023
🔗 JapaneseTokenizers	-	-	⭐ 137	🔴 march 2019
🔗 daaja	📥 113	📦 25k	⭐ 64	🔴 february 2023
🔗 accel-brain-code	📥 346	📦 148k	⭐ 320	🔴 december 2023
🔗 JGLUE	-	-	⭐ 333	🟡 march 2025
🔗 kyoto-reader	📥 518	📦 49k	⭐ 10	🔴 june 2024
🔗 nlplot	📥 163	📦 108k	⭐ 240	🔴 september 2022
🔗 rake-ja	-	-	⭐ 21	🔴 october 2018
🔗 jel	📥 45	📦 8k	⭐ 11	🔴 july 2021
🔗 MedNER-J	-	-	⭐ 18	🔴 may 2022
🔗 zunda-python	📥 23	📦 6k	⭐ 10	🔴 november 2019
🔗 AIO2_DPR_baseline	-	-	⭐ 16	🔴 january 2022
🔗 showcase	📥 32	📦 7k	⭐ 6	🔴 june 2018
🔗 darts-clone-python	📥 3k	📦 9M	⭐ 20	🔴 april 2022
🔗 jrte-corpus_example	-	-	⭐ 3	🔴 november 2021
🔗 desuwa	📥 59	📦 10k	⭐ 6	🔴 may 2022
🔗 HotPepperGourmetDialogue	-	-	⭐ 278	🔴 may 2016
🔗 nlp-recipes-ja	-	-	⭐ 65	🔴 april 2021
🔗 Japanese_nlp_scripts	-	-	⭐ 26	🔴 june 2019
🔗 DNorm-J	-	-	⭐ 9	🔴 june 2022
🔗 pyknp-eventgraph	📥 130	📦 65k	⭐ 9	🔴 september 2022
🔗 ishi	📥 38	📦 6k	⭐ 2	🔴 may 2020
🔗 python-npylm	-	-	⭐ 34	🔴 january 2019
🔗 python-npycrf	-	-	⭐ 11	🔴 march 2018
🔗 unsupervised-pos-tagging	-	-	⭐ 16	🔴 october 2017
🔗 negima	📥 39	📦 16k	⭐ 14	🔴 august 2018
🔗 YouyakuMan	-	-	⭐ 52	🔴 september 2020
🔗 japanese-numbers-python	📥 351	📦 2M	⭐ 21	🔴 april 2020
🔗 kantan	-	-	⭐ 8	🔴 october 2024
🔗 make-meidai-dialogue	-	-	⭐ 40	🔴 september 2017
🔗 japanese_summarizer	-	-	⭐ 10	🔴 august 2022
🔗 chirptext	📥 1k	📦 192k	⭐ 7	🔴 october 2022
🔗 yubin	📥 23	📦 3k	⭐ 3	🔴 october 2019
🔗 jawiki-cleaner	📥 265	📦 24k	⭐ 6	🔴 february 2021
🔗 japanese2phoneme	📥 41	📦 4k	⭐ 1	🔴 february 2022
🔗 anlp_nlp2021_d3-1	-	-	⭐ 1	🔴 march 2022
🔗 aozora_classification	-	-	⭐ 11	🔴 september 2017
🔗 aozora-corpus-generator	-	-	⭐ 8	🟡 june 2025
🔗 JLM	-	-	⭐ 111	🔴 june 2019
🔗 NTM	-	-	⭐ 13	🔴 july 2019
🔗 EN-JP-ML-Lexicon	-	-	⭐ 40	🔴 march 2021
🔗 text-generation	-	-	⭐ 19	🟡 august 2025
🔗 chainer_nic	-	-	⭐ 17	🔴 december 2018
🔗 unihan-lm	-	-	⭐ 2	🔴 november 2020
🔗 mbart-finetuning	-	-	⭐ 3	🔴 october 2021
🔗 xvector_jtubespeech	-	-	⭐ 47	🔴 november 2023
🔗 TinySegmenterMaker	-	-	⭐ 72	🔴 september 2022
🔗 Grongish	-	-	⭐ 25	🟢 december 2025
🔗 WordCloud-Japanese	-	-	⭐ 9	🔴 january 2020
🔗 snark	-	-	⭐ 11	🔴 march 2020
🔗 toEmoji	-	-	⭐ 4	🔴 april 2018
🔗 termextract	-	-	⭐ 18	🔴 september 2018
🔗 JDT-with-KenLM-scoring	-	-	⭐ 1	🔴 july 2022
🔗 mixture-of-unigram-model	-	-	⭐ 6	🔴 june 2017
🔗 hidden-markov-model	-	-	⭐ 5	🔴 june 2017
🔗 Ngram-language-model	-	-	⭐ 5	🔴 december 2017
🔗 ASRDeepSpeech	-	-	⭐ 69	🔴 september 2022
🔗 neural_ime	-	-	⭐ 67	🔴 december 2016
🔗 neural_japanese_transliterator	-	-	⭐ 178	🔴 september 2017
🔗 tinysegmenter	📥 131k	📦 171k	⭐ repo not found	🔴 november 2015
🔗 AugLy-jp	📥 173	📦 30k	⭐ 7	🔴 september 2021
🔗 furigana4epub	📥 60	📦 12k	⭐ 29	🔴 september 2021
🔗 PyKatsuyou	📥 76	📦 19k	⭐ 12	🟡 march 2025
🔗 jageocoder	📥 7k	📦 320k	⭐ 93	🟡 september 2025
🔗 pygeonlp	📥 237	📦 21k	⭐ 22	🟡 october 2025
🔗 nksnd	-	-	⭐ 26	🔴 may 2018
🔗 JaMIE	-	-	⭐ 9	🔴 may 2023
🔗 fasttext-vs-word2vec-on-twitter-data	-	-	⭐ 48	🔴 august 2017
🔗 minimal-search-engine	-	-	⭐ 19	🔴 july 2019
🔗 5ch-analysis	-	-	⭐ 75	🔴 november 2018
🔗 tweet_extructor	-	-	⭐ 3	🔴 august 2022
🔗 japanese-word-aggregation	-	-	⭐ 2	🔴 august 2018
🔗 jinf	📥 107	📦 54k	⭐ 4	🔴 december 2022
🔗 kwja	📥 597	📦 55k	⭐ 138	🟡 august 2025
🔗 mlm-scoring-transformers	-	-	⭐ 6	🔴 december 2022
🔗 ClipCap-for-Japanese	-	-	⭐ 12	🔴 october 2022
🔗 SAT-for-Japanese	-	-	⭐ 2	🔴 october 2022
🔗 cihai	📥 832	📦 209k	⭐ 93	🟢 last saturday
🔗 marine	📥 109	📦 15k	⭐ 36	🔴 september 2022
🔗 whisper-asr-finetune	-	-	⭐ 32	🔴 december 2022
🔗 japanese_chatbot	-	-	⭐ repo not found	🔴 repo not found
🔗 radicalchar	-	-	⭐ 8	🔴 december 2022
🔗 akaza	-	-	⭐ 233	🟢 today
🔗 posuto	📥 6k	📦 653k	⭐ 224	🟢 february
🔗 tacotron2-japanese	-	-	⭐ 269	🔴 september 2022
🔗 ibus-hiragana	-	-	⭐ 78	🟡 november 2025
🔗 furiganapad	-	-	⭐ 19	🟡 april 2025
🔗 chikkarpy	📥 312	📦 58k	⭐ 55	🔴 february 2022
🔗 ja-tokenizer-docker-py	-	-	⭐ 36	🔴 may 2022
🔗 JapaneseEmbeddingEval	-	-	⭐ 183	🔴 october 2024
🔗 gptuber-by-langchain	-	-	⭐ 63	🔴 january 2023
🔗 shuwa	-	-	⭐ 145	🔴 december 2022
🔗 japanese-nli-model	-	-	⭐ 5	🔴 october 2022
🔗 tra-fugu	-	-	⭐ 6	🔴 march 2023
🔗 fugumt	-	-	⭐ 65	🔴 february 2021
🔗 JaSPICE	📥 35	📦 2k	⭐ 9	🔴 november 2023
🔗 Retrieval-based-Voice-Conversion-WebUI-JP-localization	-	-	⭐ 48	🔴 april 2023
🔗 pyopenjtalk	📥 15k	📦 1M	⭐ 243	🟡 april 2025
🔗 yomigana-ebook	📥 39	📦 7k	⭐ 25	🔴 february 2024
🔗 N46Whisper	-	-	⭐ 1.7k	🔴 february 2025
🔗 japanese_llm_simple_webui	-	-	⭐ 17	🔴 may 2024
🔗 pdf-translator	-	-	⭐ 337	🔴 may 2024
🔗 japanese_qa_demo_with_haystack_and_es	-	-	⭐ 1	🔴 december 2022
🔗 mozc-devices	-	-	⭐ 2.7k	🟡 november 2025
🔗 natsume	📥 0	📦 3k	⭐ repo not found	🔴 repo not found
🔗 vits-japros-webui	-	-	⭐ 42	🔴 january 2024
🔗 ja-law-parser	-	-	⭐ 25	🔴 january 2024
🔗 dictation-kit	-	-	⭐ 164	🔴 april 2019
🔗 julius4seg	-	-	⭐ 7	🔴 august 2021
🔗 voicevox_engine	-	-	⭐ 1.6k	🟢 today
🔗 LLaVA-JP	-	-	⭐ 64	🔴 june 2024
🔗 RAG-Japanese	-	-	⭐ 10	🟡 may 2025
🔗 bertjsc	-	-	⭐ 14	🔴 august 2024
🔗 llm-leaderboard	-	-	⭐ 91	🟡 september 2025
🔗 jglue-evaluation-scripts	-	-	⭐ 18	🟢 february
🔗 BLIP2-Japanese	-	-	⭐ 13	🟡 september 2025
🔗 wikipedia-passages-jawiki-embeddings-utils	-	-	⭐ 11	🔴 march 2024
🔗 simple-simcse-ja	-	-	⭐ 69	🔴 october 2023
🔗 wikipedia-japanese-open-rag	-	-	⭐ repo not found	🔴 repo not found
🔗 gpt4-autoeval	-	-	⭐ 16	🔴 june 2024
🔗 t5-japanese	-	-	⭐ 116	🟡 september 2025
🔗 japanese_llm_eval	-	-	⭐ 5	🔴 april 2024
🔗 jmteb	-	-	⭐ 84	🟡 november 2025
🔗 pydomino	-	-	⭐ 36	🟡 august 2025
🔗 easynovelassistant	-	-	⭐ 216	🔴 july 2024
🔗 clip-japanese	-	-	⭐ 13	🟡 september 2025
🔗 rime-jaroomaji	-	-	⭐ 47	🟢 last thursday
🔗 deep-question-generation	-	-	⭐ 12	🔴 march 2023
🔗 magpie-nemotron	-	-	⭐ 9	🔴 july 2024
🔗 qlora_ja	-	-	⭐ 1	🔴 july 2024
🔗 mozcdic-ut-jawiki	-	-	⭐ 26	🟢 february
🔗 shisa-v2	-	-	⭐ 28	🟢 december 2025
🔗 llm-translator	-	-	⭐ 20	🔴 january 2025
🔗 llm-jp-asr	-	-	⭐ 9	🔴 september 2024
🔗 rag-japanese	-	-	⭐ 10	🟡 may 2025
🔗 monaka	-	-	⭐ 4	🔴 january 2025
🔗 jp-translate.cloud	-	-	⭐ 3	🔴 september 2024
🔗 substring-word-finder	-	-	⭐ 4	🟡 november 2025
🔗 heron-vlm-leaderboard	-	-	⭐ 6	🔴 december 2024
🔗 text2dataset	-	-	⭐ 26	🔴 january 2025
🔗 mecab-web-api	-	-	⭐ 40	🔴 july 2022
🔗 mecab_controller	-	-	⭐ 18	🟢 january
🔗 vits	-	-	⭐ 91	🔴 february 2023
🔗 akari_chatgpt_bot	-	-	⭐ 48	🟡 october 2025
🔗 kudasai	-	-	⭐ 26	🟡 june 2025
🔗 mecab-visualizer	-	-	⭐ 2	🔴 september 2023
🔗 add-dictionary	-	-	⭐ 3	🟡 october 2025
🔗 j-moshi	-	-	⭐ 298	🟡 june 2025
🔗 jatts	-	-	⭐ 44	🟡 may 2025
🔗 tsukasa-speech	-	-	⭐ 60	🟡 may 2025
🔗 symptom-expression-search	-	-	⭐ 2	🔴 february 2021
🔗 llm-jp-judge	-	-	⭐ 38	🟢 december 2025
🔗 asagi-vlm-colaboratory-sample	-	-	⭐ 1	🟡 march 2025
🔗 llm-jp-eval-mm	-	-	⭐ 41	🟢 january
🔗 llm-jp-judge	-	-	⭐ 38	🟢 december 2025
🔗 manga109api	📥 194	📦 45k	⭐ 128	🔴 march 2022
🔗 fastrtc-jp	-	-	⭐ 5	🟡 may 2025
🔗 whisper-transcription	-	-	⭐ 16	🟢 january
🔗 pocket-researcher	-	-	⭐ 11	🟡 april 2025
🔗 jtransbench	-	-	⭐ 13	🟡 october 2025
🔗 easyllasa	-	-	⭐ 25	🟡 september 2025
🔗 kanjikana-model	-	-	⭐ 111	🟢 december 2025
🔗 deep-openreview-research-ja	-	-	⭐ 13	🟡 november 2025
🔗 pitchbench	-	-	⭐ 1	🟢 december 2025
🔗 mini-transformer-from-scratch	-	-	⭐ 2	🟡 november 2025
🔗 vv_core_inference	-	-	⭐ 31	🟢 december 2025
🔗 pyopenjtalk-plus	📥 6k	📦 298k	⭐ 55	🟡 november 2025
🔗 japanese_spelling_correction	-	-	⭐ 14	🔴 september 2023
🔗 py-kaomoji	📥 81	📦 37k	⭐ 6	🔴 december 2018
🔗 llm-jp-vila	-	-	⭐ 10	🟡 august 2025
🔗 kanjivg-radical	-	-	⭐ 105	🔴 august 2018
🔗 japanese-wordnet-visualization	-	-	⭐ 3	🔴 november 2022
🔗 piper-plus	-	-	⭐ 29	🟢 last wednesday
🔗 Japanera	📥 2k	📦 342k	⭐ 35	🟡 june 2025
🔗 bert-abstractive-text-summarization	-	-	⭐ 49	🔴 december 2019
🔗 kyujipy	📥 77	📦 22k	⭐ 22	🟢 january
🔗 jitenbot	-	-	⭐ 4	🔴 december 2024
🔗 ja-icd10	-	-	⭐ 5	🔴 july 2021
🔗 pl-bert-vits2	-	-	⭐ 14	🔴 december 2023
🔗 ndc_predictor	-	-	⭐ 11	🔴 august 2021
🔗 pfmt-bench-fin-ja	-	-	⭐ 9	🟡 march 2025
🔗 marine-plus	📥 120	📦 11k	⭐ 8	🟡 october 2025
🔗 ja-tokenizer-benchmark	-	-	⭐ 7	🔴 february 2022
🔗 yat	-	-	⭐ 7	🔴 june 2018
🔗 igakuqa119	-	-	⭐ 7	🟢 january
🔗 japanese-luw-tokenizer	-	-	⭐ 6	🔴 december 2021
🔗 ibus-jig	-	-	⭐ 4	🔴 december 2023
🔗 jp-stopword-filter	📥 42	📦 5k	⭐ 3	🔴 november 2024
🔗 yasumail	-	-	⭐ 2	🟢 january
🔗 himotoki	📥 437	📦 3k	⭐ 2	🟢 last friday

C++

Morphology analysis

High-performance libraries for Japanese morphological analysis

mecab - Yet another Japanese morphological analyzer
jumanpp - Juman++ (a Morphological Analyzer Toolkit)
kytea - The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.
juman - Japanese Morphological Analysis System JUMAN

Name	downloads/week	total downloads	stars	last commit
🔗 mecab	-	-	⭐ 1.1k	🔴 february 2025
🔗 jumanpp	-	-	⭐ 407	🔴 march 2023
🔗 kytea	-	-	⭐ 212	🔴 april 2020
🔗 juman	-	-	⭐ 11	🔴 december 2021

Parsing

Libraries for dependency and syntactic parsing of Japanese sentences

cabocha - Yet Another Japanese Dependency Structure Analyzer
knp - A Japanese Parser

Name	downloads/week	total downloads	stars	last commit
🔗 cabocha	-	-	⭐ 121	🔴 february 2025
🔗 knp	-	-	⭐ 33	🔴 november 2023

Others

Other Japanese NLP and text processing libraries

jsc - Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.
aquaskk - An input method without morphological analysis.
mozc - Mozc - a Japanese Input Method Editor designed for multi-platform
trimatch - Trimatch: An (Exact|Prefix|Approximate) String Matching Library
resembla - Resembla: Word-based Japanese similar sentence search library
corvusskk - ▽▼ SKK-like Japanese Input Method Editor for Windows
mozuku - 日本語文章の解析・校正を行う LSP サーバー。

Name	downloads/week	total downloads	stars	last commit
🔗 jsc	-	-	⭐ 15	🔴 december 2012
🔗 aquaskk	-	-	⭐ 367	🔴 july 2023
🔗 mozc	-	-	⭐ 2.8k	🟢 last thursday
🔗 trimatch	-	-	⭐ 2	🟢 february
🔗 resembla	-	-	⭐ 73	🟡 august 2025
🔗 corvusskk	-	-	⭐ 352	🟢 january
🔗 mozuku	-	-	⭐ 402	🟢 december 2025

Rust crate

Morphology analysis

Fast Japanese morphological analysis crates written in Rust

lindera - A morphological analysis library.
vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
goya - Japanese Morphological Analysis written in Rust
vibrato - vibrato: Viterbi-based accelerated tokenizer
yoin - A Japanese Morphological Analyzer written in pure Rust
mecab-rs - Safe Rust bindings for mecab a part-of-speech and morphological analyzer library
awabi - A morphological analyzer using mecab dictionary
kanpyo - Japanese Morphological Analyzer written in Rust

Name	downloads/week	total downloads	stars	last commit
🔗 lindera	-	📦 863k	⭐ 601	🟢 last tuesday
🔗 vaporetto	-	📦 172k	⭐ 251	🟢 february
🔗 goya	-	📦 11k	⭐ 81	🔴 december 2021
🔗 vibrato	-	📦 55k	⭐ 399	🟢 february
🔗 yoin	-	📦 3k	⭐ 25	🔴 october 2017
🔗 mecab-rs	-	📦 39k	⭐ 66	🔴 september 2023
🔗 awabi	-	📦 24k	⭐ 10	🟡 november 2025
🔗 kanpyo	-	📦 2.5k	⭐ 107	🟢 january

Converter

Crates for script and character conversion in Japanese text

wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角ｶﾅ] and Wide-alphanumeric[全角英数] into normal ones
kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
kanaria - このライブラリは、ひらがな・カタカナ、半角・全角の相互変換や判別を始めとした機能を提供します。
japanese-address-parser - 日本の住所を都道府県/市区町村/町名/その他に分割するライブラリです
yosina - Yosina is a transliteration library deals with the letters and symbols used in Japanese writing.
mojimoji-rs - Rust implementation of a fast converter between Japanese hankaku and zenkaku characters, mojimoji.

Name	downloads/week	total downloads	stars	last commit
🔗 wana_kana_rust	-	📦 298k	⭐ 88	🟡 march 2025
🔗 unicode-jp-rs	-	📦 61k	⭐ 19	🔴 april 2020
🔗 kana	-	-	⭐ 12	🔴 january 2023
🔗 kanaria	-	-	⭐ 21	🟢 last saturday
🔗 japanese-address-parser	-	-	⭐ 10	🟡 november 2025
🔗 yosina	-	-	⭐ 20	🟡 september 2025
🔗 mojimoji-rs	-	-	⭐ 4	🔴 november 2022

Search engine library

Libraries for Japanese full-text search and indexing

lindera-tantivy - Lindera tokenizer for Tantivy.
tantivy-vibrato - A Tantivy tokenizer using Vibrato.

Name	downloads/week	total downloads	stars	last commit
🔗 lindera-tantivy	-	📦 160k	⭐ 67	🟢 january
🔗 tantivy-vibrato	-	📦 1.5k	⭐ 3	🔴 january 2023

Others

Supplementary crates for Japanese text and IME processing

daachorse - A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
find-simdoc - Finding all pairs of similar documents time- and memory-efficiently
crawdad - Rust library of natural language dictionaries using character-wise double-array tries.
tokenizer-speed-bench - Comparison code of various tokenizers
stringmatch-bench - Here provides benchmark tools to compare the performance of data structures for string matching.
vime - Using Vim as an input method for X11 apps
voicevox_core - 無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのコア
akaza - Yet another Japanese IME for IBus/Linux
Jotoba - A free online, self-hostable, multilang Japanese dictionary.
dvorakjp-romantable - Google 日本語入力用DvorakJPローマ字テーブル / DvorakJP Roman Table for Google Japanese Input
niinii - Japanese glossator for assisted reading of text using Ichiran
cskk - SKK (Simple Kana Kanji henkan) library
japanki - Learn Japanese vocabs 🇯🇵 by doing quizzes on CLI!
jpreprocess - Japanese text preprocessor for Text-to-Speech applications (OpenJTalk rewrite in rust language)
listup_precedent - 裁判例のデータ一覧を裁判所のホームページ(https://www.courts.go.jp/index.html) をスクレイピングして生成するソフトウェア
jisho - Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.
kanalizer - 英単語から読みを推測するライブラリ。
koharu - Automated manga translation tool with LLM, written in Rust.
yomine - A Japanese vocabulary mining tool designed to help language learners mine new words and expressions.
matsuba - lightweight japanese ime written in rust
hujiang_dictionary - 日本語辞書 by Rust, support Telegram bot, AWS Lambda and Cloudflare Workers. Support LLM and search RAG.

Name	downloads/week	total downloads	stars	last commit
🔗 daachorse	-	📦 651k	⭐ 242	🟢 january
🔗 find-simdoc	-	📦 29k	⭐ 62	🟡 march 2025
🔗 crawdad	-	📦 59k	⭐ 36	🔴 january 2025
🔗 tokenizer-speed-bench	-	-	⭐ 4	🔴 march 2023
🔗 stringmatch-bench	-	-	⭐ 3	🔴 september 2022
🔗 vime	-	-	⭐ 230	🔴 november 2022
🔗 voicevox_core	-	-	⭐ 1k	🟢 today
🔗 akaza	-	-	⭐ 233	🟢 today
🔗 Jotoba	-	-	⭐ 197	🔴 january 2024
🔗 dvorakjp-romantable	-	-	⭐ 55	🟢 february
🔗 niinii	-	-	⭐ 15	🟢 february
🔗 cskk	-	-	⭐ 79	🟢 january
🔗 japanki	-	-	⭐ 3	🔴 october 2023
🔗 jpreprocess	-	-	⭐ 51	🟢 february
🔗 listup_precedent	-	-	⭐ 5	🔴 february 2025
🔗 jisho	-	-	⭐ 17	🟢 january
🔗 kanalizer	-	-	⭐ 26	🟡 november 2025
🔗 koharu	-	-	⭐ 635	🟢 last saturday
🔗 yomine	-	-	⭐ 47	🟢 last thursday
🔗 matsuba	-	-	⭐ 18	🔴 march 2023
🔗 hujiang_dictionary	-	-	⭐ 69	🟢 last tuesday

JavaScript

Morphology analysis

Japanese morphological analysis libraries for browser and Node.js

kuromoji.js - JavaScript implementation of Japanese morphological analyzer
rakutenma - Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
node-mecab-ya - Yet another mecab wrapper for nodejs
juman-bin - a User-Extensible Morphological Analyzer for Japanese. 日本語形態素解析システム
node-mecab-async - Asynchronous japanese morphological analyser using MeCab.

Name	downloads/week	total downloads	stars	last commit
🔗 kuromoji.js	📥 149k/week	📦 7.9M	⭐ 966	🔴 november 2018
🔗 rakutenma	📥 11/week	📦 879	⭐ 473	🔴 january 2015
🔗 node-mecab-ya	📥 102/week	📦 7.2k	⭐ 110	🔴 repo not found
🔗 juman-bin	📥 6/week	📦 297	⭐ 3	🔴 may 2017
🔗 node-mecab-async	📥 4.8k/week	📦 340k	⭐ 103	🔴 october 2017

Converter

Libraries for converting Japanese scripts and readings

kuroshiro - Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
kuroshiro-analyzer-kuromoji - Kuromoji morphological analyzer for kuroshiro.
hepburn - Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation
japanese-numerals-to-number - Converts Japanese Numerals into number
jslingua - Javascript libraries to process text: Arabic, Japanese, etc.
WanaKana - Javascript library for detecting and transliterating Hiragana <--> Katakana <--> Romaji
node-romaji-name - Normalize and fix common issues with Romaji-based Japanese names.
kyujitai.js - Utility collections for making Japanese text old-fashioned
normalize-japanese-addresses - オープンソースの住所正規化ライブラリ。
jaconv - 日本語文字変換ライブラリ (javascript)
romaji-conv - Convert romaji into hiragana
japanese-addresses-v2 - 全国の住所データAPI
jptext-to-emoji - テキストの単語を絵文字に変換する
japanese.js - Util collection for Japanese text processing. Hiraganize, Katakanize, and Romanize.

Name	downloads/week	total downloads	stars	last commit
🔗 kuroshiro	📥 11k/week	📦 366k	⭐ 948	🔴 june 2021
🔗 kuroshiro-analyzer-kuromoji	📥 11k/week	📦 343k	⭐ 67	🔴 august 2018
🔗 hepburn	📥 126k/week	📦 3.1M	⭐ 137	🟡 september 2025
🔗 japanese-numerals-to-number	📥 38k/week	📦 2.2M	⭐ 59	🔴 february 2023
🔗 jslingua	📥 192/week	📦 7.4k	⭐ 52	🔴 october 2023
🔗 WanaKana	📥 33k/week	📦 2.1M	⭐ 901	🟡 september 2025
🔗 node-romaji-name	📥 68/week	📦 12k	⭐ 41	🔴 december 2023
🔗 kyujitai.js	📥 12/week	📦 1.1k	⭐ 23	🔴 august 2020
🔗 normalize-japanese-addresses	-	-	⭐ 945	🟡 july 2025
🔗 jaconv	-	-	⭐ 86	🟡 june 2025
🔗 romaji-conv	-	-	⭐ 26	🟢 december 2025
🔗 japanese-addresses-v2	-	-	⭐ 69	🔴 january 2025
🔗 jptext-to-emoji	-	-	⭐ 2	🟢 february
🔗 japanese.js	-	-	⭐ 167	🔴 august 2020

Others

Other libraries for Japanese NLP in JavaScript

bangumi-data - Raw data for Japanese Anime
yomichan - Japanese pop-up dictionary extension for Chrome and Firefox.
proofreading-tool - GUIで動作する文書校正ツール GUI tool for textlinting.
kanjigrid - A web-app displaying the 2200 kanji characters taught in James Heisig's "Remembering the Kanji", 6th edition.
japanese-toolkit - Monorepo for Kanji, Furigana, Japanese DB, and others
analyze-desumasu-dearu - 文の敬体(ですます調)、常体(である調)を解析するJavaScriptライブラリ
hatsuon - Japanese pitch accent utils
sentiment_ja_js - Sentiment Analysis in Japanese. sentiment_ja with JavaScript
mecab-ipadic-seed - mecab-ipadic seed dictionary reader
Japanese-Word-Of-The-Day - Well, a different Japanese word everyday.
oskim - Extend GNOME On-Screen Keyboard for Input Methods
tweetMapping - 東日本大震災発生から24時間以内につぶやかれたジオタグ付きツイートのデジタルアーカイブです。
pitch-accent - Predict pitch accent in Japanese
kana2ipa - 「ひらがな」または「カタカナ」を日本語で発音する際の音声記号(IPA)に変換するコマンド
voicevox - 無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXのエディター
kamiya-codec - Towards a Japanese verb conjugator and deconjugator based on Taeko Kamiya's The Handbook of Japanese Verbs and The Handbook of Japanese Adjectives and Adverbs opuses.
closewords - 最も似た単語を単語群から検索する日本語(漢字含む)対応のライブラリ
japanese-analyzer - Japanese Sentence Analyzer (日本語文章解析器)
japanese-furigana-normalize - Normalize Japanese Furigana
yama - acquire Japanese vocabulary on any website
kaitai - An application for analyzing Japanese sentence structure using AI. This tool visualizes how words and phrases relate to each other, showing grammatical relationships with interactive diagrams.

Name	downloads/week	total downloads	stars	last commit
🔗 bangumi-data	📥 1.1k/week	📦 60k	⭐ 592	🟢 february
🔗 yomichan	-	-	⭐ 1.1k	🔴 february 2023
🔗 proofreading-tool	-	-	⭐ 87	🟡 october 2025
🔗 kanjigrid	-	-	⭐ 44	🔴 november 2018
🔗 japanese-toolkit	-	-	⭐ 62	🔴 january 2023
🔗 analyze-desumasu-dearu	📥 83k/week	📦 4.8M	⭐ 18	🔴 january 2025
🔗 hatsuon	📥 21/week	📦 948	⭐ 37	🔴 march 2022
🔗 sentiment_ja_js	-	-	⭐ 10	🔴 december 2021
🔗 mecab-ipadic-seed	📥 76/week	📦 5.6k	⭐ 8	🔴 july 2016
🔗 Japanese-Word-Of-The-Day	📥 5/week	📦 292	⭐ repo not found	🔴 repo not found
🔗 oskim	-	-	⭐ 2	🔴 february 2023
🔗 tweetMapping	-	-	⭐ 25	🟢 today
🔗 pitch-accent	📥 1/week	📦 84	⭐ 2	🔴 september 2023
🔗 kana2ipa	-	-	⭐ 17	🔴 october 2020
🔗 voicevox	-	-	⭐ 3k	🟢 today
🔗 kamiya-codec	-	-	⭐ 20	🟡 may 2025
🔗 closewords	-	-	⭐ 3	🟡 august 2025
🔗 japanese-analyzer	-	-	⭐ 683	🟢 december 2025
🔗 japanese-furigana-normalize	-	-	⭐ 6	🔴 july 2024
🔗 yama	-	-	⭐ 7	🟢 last wednesday
🔗 kaitai	-	-	⭐ 1	🟢 last saturday

Go

Morphology analysis

Lightweight Japanese morphological analysis libraries in Go

kagome - Self-contained Japanese Morphological Analyzer written in pure Go

Name	downloads/week	total downloads	stars	last commit
🔗 kagome	-	-	⭐ 943	🟢 january

Others

Additional Go-based Japanese text processing libraries

ojosama - テキストを壱百満天原サロメお嬢様風の口調に変換します
nihongo - Japanese Dictionary
yomichan-import - External dictionary importer for Yomichan.
imas-ime-dic - THE IDOLM@STER words dictionary for Japanese IME (by imas-db.jp)
go-kakasi - Kanji transliteration to hiragana/katakana/romaji, in Go
go-moji - A Go library for Zenkaku/Hankaku conversion
ojichat - おじさんがLINEやメールで送ってきそうな文を生成する
name - Name Searcher in Japanese

Name	downloads/week	total downloads	stars	last commit
🔗 ojosama	-	-	⭐ 388	🟢 last tuesday
🔗 nihongo	-	-	⭐ 83	🔴 february 2024
🔗 yomichan-import	-	-	⭐ 85	🔴 february 2023
🔗 imas-ime-dic	-	-	⭐ 31	🟢 january
🔗 go-kakasi	-	-	⭐ 6	🟢 december 2025
🔗 go-moji	-	-	⭐ 20	🔴 april 2019
🔗 ojichat	-	-	⭐ 1.3k	🔴 october 2024
🔗 name	-	-	⭐ 11	🔴 january 2025

Java

Morphology analysis

Japanese morphological analysis and dictionary management libraries

kuromoji - Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Sudachi -　A Japanese Tokenizer for Business
SudachiDict - A lexicon for Sudachi
meval - 形態素解析器性能評価システム MevAL

Name	downloads/week	total downloads	stars	last commit
🔗 kuromoji	-	-	⭐ 1k	🔴 september 2019
🔗 Sudachi	-	-	⭐ 932	🔴 november 2024
🔗 SudachiDict	-	-	⭐ 277	🟢 january
🔗 meval	-	-	⭐ 7	🔴 august 2019

Others

Java libraries for Japanese NLP and OCR

kanjitomo-ocr - Java library for identifying Japanese characters from images
jakaroma - Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
kakasi-java - Kanji transliteration to hiragana/katakana/romaji, in Java
Kamite - A desktop language immersion companion for learners of Japanese
react-native-japanese-tokenizer - Async Japanese Tokenizer Native Plugin for React Native for iOS and Android
elasticsearch-analysis-japanese - Japanese analyzer uses kuromoji japanese tokenizer for ElasticSearch
moji4j - A Java library to converts between Japanese Hiragana, Katakana, and Romaji scripts.
neologdn-java - Japanese text normalizer for mecab-neologd
elasticsearch-sudachi - The Japanese analysis plugin for elasticsearch

Name	downloads/week	total downloads	stars	last commit
🔗 kanjitomo-ocr	-	-	⭐ 203	🔴 may 2021
🔗 jakaroma	-	-	⭐ 67	🟡 june 2025
🔗 kakasi-java	-	-	⭐ 55	🔴 april 2016
🔗 Kamite	-	-	⭐ 129	🟡 march 2025
🔗 react-native-japanese-tokenizer	-	-	⭐ 38	🔴 june 2023
🔗 elasticsearch-analysis-japanese	-	-	⭐ 29	🔴 march 2012
🔗 moji4j	-	-	⭐ 33	🔴 june 2022
🔗 neologdn-java	-	-	⭐ 5	🟢 february
🔗 elasticsearch-sudachi	-	-	⭐ 218	🟢 january

Pretrained model

Word2Vec

Models that convert words into numeric vectors to capture semantic similarity

japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
chiVe - Japanese word embedding with Sudachi and NWJC
elmo-japanese - elmo-japanese
embedrank - Python Implementation of EmbedRank
aovec - Easy aozorabunko Word2Vec Builder - 青空文庫全書籍のWord2Vecビルダー+構築済みモデル
dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "係り受けに基づく日本語単語埋込 (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
jawiki_word_vector_updater - 最新の日本語Wikipediaのダンプデータから，MeCabを用いてIPA辞書と最新のNeologd辞書の両方で形態素解析を実施し，その結果に基づいた word2vec，fastText，GloVeの単語分散表現を学習するためのスクリプト

Name	downloads/week	total downloads	stars	last commit
🔗 japanese-words-to-vectors	-	-	⭐ 87	🔴 august 2020
🔗 chiVe	-	-	⭐ 169	🔴 march 2024
🔗 elmo-japanese	-	-	⭐ 5	🔴 october 2019
🔗 embedrank	-	-	⭐ 48	🔴 march 2019
🔗 aovec	📥 425	📦 81k	⭐ 3	🔴 january 2023
🔗 dependency-based-japanese-word-embeddings	-	-	⭐ 8	🔴 august 2019
🔗 jawikivec	-	-	⭐ 2	🔴 november 2018
🔗 jawiki_word_vector_updater	-	-	⭐ 11	🔴 may 2020

Transformer based models

Models that use self-attention to understand context and perform advanced language tasks

bert-japanese - BERT models for Japanese text.
japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
bert-japanese - BERT with SentencePiece for Japanese text.
SudachiTra - Japanese tokenizer for Transformers
japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
ILYS-aoba-chatbot - ILYS-aoba-chatbot
t5-japanese - Codes to pre-train Japanese T5 models
pytorch_bert_japanese - PytorchでBERTの日本語学習済みモデルを利用する
Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
RoBERTa-japanese - Japanese BERT Pretrained Model
aMLP-japanese - aMLP Transformer Model for Japanese
bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
gpt2-japanese - Japanese GPT2 Generation Model
text2text-japanese - gpt-2 based text2text conversion model
gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
albert-japanese - BERT with SentencePiece for Japanese text.
ja_text_bert - 日本語WikipediaコーパスでBERTのPre-Trainedモデルを生成するためのリポジトリ
DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
luke - LUKE -- Language Understanding with Knowledge-based Embeddings
GPTSAN - General-purpose Swich transformer based Japanese language mode
japanese-clip - Japanese CLIP by rinna Co., Ltd.
AcademicBART - We pretrained a BART-based Japanese masked language model on paper abstracts from the academic database CiNii Articles
AcademicRoBERTa - We pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles.
LINE-DistilBERT-Japanese - DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
Japanese-Alpaca-LoRA - 日本語に翻訳したStanford Alpacaのデータセットを用いてLLaMAをファインチューニングし作成したLow-Rank AdapterのリンクとGenerateサンプルコード
albert-japanese-tinysegmenter - Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert) on Japanese Wikipedia Resources
japanese-llama-experiment - Japanese LLaMa experiment
easylightchatassistant - EasyLightChatAssistant は軽量で検閲や規制のないローカル日本語モデルのLightChatAssistant を、KoboldCpp で簡単にお試しする環境です。

Name	downloads/week	total downloads	stars	last commit
🔗 bert-japanese	-	-	⭐ 543	🔴 march 2024
🔗 japanese-pretrained-models	-	-	⭐ repo not found	🔴 repo not found
🔗 bert-japanese	-	-	⭐ 498	🔴 february 2021
🔗 SudachiTra	📥 331	📦 159k	⭐ 79	🔴 december 2023
🔗 japanese-dialog-transformers	-	-	⭐ 245	🔴 june 2023
🔗 shiba	📥 39	📦 7k	⭐ 89	🔴 november 2023
🔗 Dialog	-	-	⭐ 73	🔴 october 2020
🔗 language-pretraining	-	-	⭐ 50	🔴 may 2023
🔗 medbertjp	-	-	⭐ 12	🔴 november 2020
🔗 ILYS-aoba-chatbot	-	-	⭐ 23	🔴 october 2021
🔗 t5-japanese	-	-	⭐ 40	🔴 september 2021
🔗 pytorch_bert_japanese	-	-	⭐ 35	🔴 june 2019
🔗 Laboro-BERT-Japanese	-	-	⭐ 73	🔴 may 2022
🔗 RoBERTa-japanese	-	-	⭐ 23	🔴 november 2021
🔗 aMLP-japanese	-	-	⭐ 16	🔴 may 2022
🔗 bert-japanese-aozora	-	-	⭐ 40	🔴 august 2020
🔗 sbert-ja	-	-	⭐ 11	🔴 august 2021
🔗 BERT-Japan-vaccination	-	-	⭐ 7	🔴 may 2022
🔗 gpt2-japanese	-	-	⭐ 325	🔴 september 2023
🔗 text2text-japanese	-	-	⭐ 33	🔴 july 2021
🔗 gpt-ja	-	-	⭐ 3	🔴 september 2021
🔗 friendly_JA-Model	-	-	⭐ 1	🔴 may 2022
🔗 albert-japanese	-	-	⭐ 33	🔴 october 2021
🔗 ja_text_bert	-	-	⭐ 115	🔴 november 2018
🔗 DistilBERT-base-jp	-	-	⭐ 161	🔴 april 2020
🔗 bert	-	-	⭐ 28	🔴 april 2022
🔗 Laboro-DistilBERT-Japanese	-	-	⭐ 16	🔴 december 2020
🔗 luke	-	-	⭐ 726	🔴 june 2023
🔗 GPTSAN	-	-	⭐ 118	🔴 september 2023
🔗 japanese-clip	-	-	⭐ repo not found	🔴 repo not found
🔗 AcademicBART	-	-	⭐ 2	🔴 july 2024
🔗 AcademicRoBERTa	-	-	⭐ 9	🔴 september 2024
🔗 LINE-DistilBERT-Japanese	-	-	⭐ 46	🔴 march 2023
🔗 Japanese-Alpaca-LoRA	-	-	⭐ 141	🔴 april 2023
🔗 albert-japanese-tinysegmenter	-	-	⭐ 13	🔴 september 2023
🔗 japanese-llama-experiment	-	-	⭐ 54	🟢 december 2025
🔗 easylightchatassistant	-	-	⭐ 39	🔴 april 2024

ChatGPT

Resources for using ChatGPT and APIs for Japanese dialogue and text generation

VRChatGPT - ChatGPTを使ってVRChat上でお喋り出来るようにするプログラム。
AITuberDegikkoMirii - AITuberの基礎となる部分を開発しています
wanna - Shell command launcher with natural language
ChatdollKit - ChatdollKit enables you to make your 3D model into a chatbot
ChuanhuChatGPTJapanese - GUI for ChatGPT API For Japanese
AISisterAIChan - ChatGPT3.5を搭載した伺かゴースト「AI妹アイちゃん」です。利用には別途ChatGPTのAPIキーが必要です。
vrchatbot - VRChatにAI Botを作るためのリポジトリ
gptuber-by-langchain - GPTがYouTuberをやります
openai-chatfriend - A chatbox application built using Nuxt 3 powered by Open AI Text completion endpoint. You can select different personality of your AI friend. The default will respond in Japanese. You can use this app to practice your Nihongo skills!
chrome-ext-translate-to-hiragana-with-chatgpt - This Chrome extension can translate selected Japanese text to Hiragana by using ChatGPT.
azure-search-openai-demo - このサンプルでは、Retrieval Augmented Generation パターンを使用して、独自のデータに対してChatGPT のような体験を作成するためのいくつかのアプローチを示しています。
chatvrm - ChatVRMはブラウザで簡単に3Dキャラクターと会話ができるデモアプリケーションです。
sftly-replace - A Chrome extention to replace the selected text softly
summarize_arxv - Summarize arXiv paper with figures
aiavatarkit - Building AI-based conversational avatars lightning fast
pva-aoai-integration-solution - このリポジトリは、神戸市役所でのChatGPTの試行利用に向けて作成したフロー等をソリューション化し公開するものです。
jp-azureopenai-samples - Azure OpenAIを活用したアプリケーション実装のリファレンスを目的として、アプリのサンプル（リファレンスアーキテクチャ、サンプルコードとデプロイ手順）を無償提供しています。
character_chat - OpenAIのAPIを利用して、設定したキャラクターと日本語で会話するチャットスクリプトです。
chatgpt-slackbot - OpenAIのChatGPT APIをSlack上で利用するためのSlackbotスクリプト (日本語での利用が前提)
chatgpt-prompt-sample-japanese - ChatGPT の Prompt のサンプルです。
kanji-flashcard-app-gpt4 - A Japanese Kanji Flashcard App built using Python and Langchain, enhanced with the intelligence of GPT-4.
IgakuQA - Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
japagen - 日本語タスクにおけるLLMを用いた疑似学習データ生成の検討
generativeai-prompt-sample-japanese - ChatGPTやCopilotなど各種生成AI用の「日本語]の Prompt のサンプル

Name	downloads/week	total downloads	stars	last commit
🔗 VRChatGPT	-	-	⭐ 15	🔴 march 2023
🔗 AITuberDegikkoMirii	-	-	⭐ 5	🔴 march 2023
🔗 wanna	📥 35	📦 20k	⭐ 141	🔴 april 2023
🔗 ChatdollKit	-	-	⭐ 1.1k	🟢 last saturday
🔗 ChuanhuChatGPTJapanese	-	-	⭐ 1	🔴 march 2023
🔗 AISisterAIChan	-	-	⭐ 27	🔴 may 2023
🔗 vrchatbot	-	-	⭐ 28	🔴 december 2022
🔗 gptuber-by-langchain	-	-	⭐ 63	🔴 january 2023
🔗 openai-chatfriend	-	-	⭐ 16	🔴 april 2023
🔗 chrome-ext-translate-to-hiragana-with-chatgpt	-	-	⭐ 1	🔴 april 2023
🔗 azure-search-openai-demo	-	-	⭐ 46	🔴 december 2023
🔗 chatvrm	-	-	⭐ 818	🟡 may 2025
🔗 sftly-replace	-	-	⭐ 4	🔴 may 2023
🔗 summarize_arxv	-	-	⭐ 173	🔴 may 2023
🔗 aiavatarkit	-	-	⭐ 532	🟢 last wednesday
🔗 pva-aoai-integration-solution	-	-	⭐ repo not found	🔴 repo not found
🔗 jp-azureopenai-samples	-	-	⭐ 280	🟡 september 2025
🔗 character_chat	-	-	⭐ 16	🔴 june 2023
🔗 chatgpt-slackbot	-	-	⭐ 64	🔴 july 2024
🔗 chatgpt-prompt-sample-japanese	-	-	⭐ 420	🟢 today
🔗 kanji-flashcard-app-gpt4	-	-	⭐ 6	🔴 october 2023
🔗 IgakuQA	-	-	⭐ 48	🔴 march 2023
🔗 japagen	-	-	⭐ 1	🔴 october 2024
🔗 generativeai-prompt-sample-japanese	-	-	⭐ 420	🟢 today

Dictionary and IME

Resources for Japanese dictionaries and input method editors (IME)

mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
tdmelodic - A Japanese accent dictionary generator
jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
unidic-py - Unidic packaged for installation via pip.
Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
manbyo-sudachi - Sudachi向け万病辞書
jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
JIWC-Dictionary - dictionary to find emotion related to text
JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
ipadic-py - IPAdic packaged for easy use from Python.
unidic-lite - A small version of UniDic for easy pip installs.
emoji-ime-dictionary - 日本語で絵文字入力をするための IME 追加辞書 orange_book Google 日本語入力などで日本語から絵文字への変換を可能にする IME 拡張辞書
google-ime-dictionary - 日英変換・英語略語展開のための IME 追加辞書 orange_book 日本語から英語への和英変換や英語略語の展開を Google 日本語入力や ATOK などで可能にする IME 拡張辞書
dic-nico-intersection-pixiv - ニコニコ大百科とピクシブ百科事典の共通部分のIME辞書
google-ime-user-dictionary-ja-en - GoogleIME用カタカナ語辞書プロジェクトのアーカイブです。Project archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
emoticon - Google日本語入力の顔文字辞書∩(,,Ò‿Ó,,)∩
mecab-mozcdic - open source mozc dictionaryをMeCab辞書のフォーマットに変換したものです。
denonbu-ime-dic - 電音IME: Microsoft IMEなどで利用することを想定した「電音部」関連用語の辞書
nijisanji-ime-dic - Microsoft IMEなどで利用することを想定した「にじさんじ」関連用語の用語辞書です。
pokemon-ime-dic - Microsoft IMEなどで利用することを想定した、現状判明している全てのポケモンの名前を網羅した用語辞書です。
EJDict - English-Japanese Dictionary data (Public Domain) EJDict-hand
Ayashiy-Nipongo-Dic - 贵樣ばこゐ辞畫を使て正レい日本语を使ラことが出來ゑ。
genshin-dict - Windows/macOSで使える原神の単語辞書です
jmdict-simplified - JMdict and JMnedict in JSON format
mozcdict-ext - Convert external words into Mozc system dictionary
mh-dict-jp - MonsterHunterのユーザー辞書を作りたい…
jitenbot - Convert data from Japanese dictionary websites and applications into portable file formats
mecab-unidic-neologd - Neologism dictionary based on the language resources on the Web for mecab-unidic
hololive-dictionary - ホロライブ（ホロライブプロダクション）に関する辞書ファイルです。./dictionary フォルダ内のテキストファイルを使って、IMEに単語を追加できます。詳細はREADME.mdをご覧ください。
jmdict-yomitan - JMdict, JMnedict, KANJIDIC for Yomitan/Yomichan.
yomichan-jlpt-vocab - JLPT level tags for words in Yomichan
Jitendex - A free and openly licensed Japanese-to-English dictionary compatible with multiple dictionary clients
jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語　辞典　和英辞典　漢英字典　和独辞典　和蘭辞典
pixiv-yomitan - Pixiv Encyclopedia Dictionary for Yomitan
uchinaaguchi_dict - うちなーぐち辞典（沖縄語辞典）
yomitan-dictionaries - Japanese and Chinese dictionaries for Yomitan.
mouse_over_dictionary - マウスオーバーした単語を自動で読み取る汎用辞書ツール
jisyo - かな漢字変換エンジン SKKのための新しい辞書形式
skk-jisyo.emoji-ja - 日本語の読みから Emoji に変換するための SKK 辞書 😂
anthy - Anthy is a kana-kanji conversion engine for Japanese. It converts roma-ji to kana, and the kana text to a mixed kana and kanji.
aws_dic_for_google_ime - AWSサービス名のGoogle日本語入力向けの辞書
cl-skkserv - Common LispによるSKK辞書サーバーとその拡張
anthy - Anthy maintenance
anthy-unicode - Anthy Unicode - Another Anthy
azooKey - azooKey: A Japanese Keyboard iOS Application Fully Developed in Swift
azookey-desktop - Japanese Input Method "azooKey" for Desktop, supporting macOS
fcitx5-hazkey - Japanese input method for fcitx5, powered by azooKey engine
mozcdic-ut-place-names - Mozc UT Place Name Dictionary is a dictionary converted from the Japan Post's ZIP code data for Mozc.
azookeykanakanjiconverter - Kana-Kanji Conversion Module written in Swift
libkkc - Japanese Kana Kanji conversion input method library
libskk - Japanese SKK input method library
kanayomi-dict - openjtalk形式のユーザー辞書
cjkvi-dict - 漢字データベースの辞書関連データ
wlsp-classical - 古典日本語の分類語彙表データ
kanji-dict - 漢字の書き順(筆順)・読み方・画数・部首・用例・成り立ちを調べるための漢字辞書です。Unicode 15.1 のすべての漢字 98,682字を収録しています。
Kaomoji_proj - (๑ ᴖ ᴑ ᴖ ๑)みょんかおもじ（旧Kaomoji_proj）はMicrosoft社の入力ソフト、Microsoft IME向けの顔文字の辞書を作成するプロジェクトです。
kotlin-kana-kanji-converter - Kotlin かな漢字変換プログラム
alfred-japanese-dictionary - Japanese-English Dictionary using jisho.org with audio, csv export of entries, and preview of dictionary sites.
ichiran - Linguistic tools for texts in Japanese language
mikan - A Japanese input method.
colloquial-kansai-dictionary - A quick reference for the material taught in Colloquial Kansai Japanese.
jisho-open - Web frontend for the JMdict Japanese-English dictionary project, with study list support!
macskk - Yet Another macOS SKK Input Method
nandoku - 難読漢字を学年別にまとめた辞書です。
japanese_android_ime - A FOSS Japanese IME for Android
anthywl - Japanese input method for Sway using libanthy
sekka - Yet another Japanese Input Method inspired by SKK.
sumibi - Japanese input method powered by ChatGPT API
jinmei-dict - 辞書データから人名だけを抜き出し、読み仮名（カタカナ）をキーとして、候補となる書き文字をリストで保持するようなJSON形式に整形しています。
japanesekeyboard - スミレ完全オフラインの日本語キーボードアプリ
japanesearabic - JapaneseArabic Dictionary (日本語・アラビア語辞書) قاموس اللغة اليابانية والعربية (Yomitan)
o-dic - 沖縄辞書
skk-emoji-jisyo - SKK 絵文字辞書
mozcdic-ut-personal-names - A personal name dictionary for Mozc.
mozcdic-ut-sudachidict - A dictionary converted from SudachiDict for Mozc.
nihongo - japanese language data and dictionary
kagome-dict - Dictionary Library for Kagome v2
canna - Canna Japanese input system
kansai-accent-dictionary - 京阪式アクセント（関西弁）辞書 - 4,615語を収録した日本語方言アクセント辞書

Name	downloads/week	total downloads	stars	last commit
🔗 mecab-ipadic-neologd	-	-	⭐ 2.8k	🔴 september 2020
🔗 tdmelodic	-	-	⭐ 123	🔴 march 2024
🔗 jamdict	📥 469	📦 51k	⭐ 164	🔴 june 2021
🔗 unidic-py	📥 86k	📦 9M	⭐ 108	🔴 february 2025
🔗 Japanese-Company-Lexicon	-	-	⭐ 100	🔴 january 2023
🔗 manbyo-sudachi	-	-	⭐ 7	🔴 april 2021
🔗 jawiki-kana-kanji-dict	-	-	⭐ 59	🟢 february
🔗 JIWC-Dictionary	-	-	⭐ 40	🔴 january 2021
🔗 JumanDIC	-	-	⭐ 4	🔴 august 2022
🔗 ipadic-py	📥 89k	📦 6M	⭐ 24	🔴 october 2021
🔗 unidic-lite	📥 86k	📦 9M	⭐ 49	🔴 september 2020
🔗 emoji-ime-dictionary	-	-	⭐ 365	🔴 january 2023
🔗 google-ime-dictionary	-	-	⭐ 101	🔴 january 2023
🔗 dic-nico-intersection-pixiv	-	-	⭐ 83	🔴 september 2024
🔗 google-ime-user-dictionary-ja-en	-	-	⭐ 58	🔴 december 2016
🔗 emoticon	-	-	⭐ 43	🔴 may 2020
🔗 mecab-mozcdic	-	-	⭐ 10	🔴 january 2018
🔗 denonbu-ime-dic	-	-	⭐ 2	🔴 november 2022
🔗 nijisanji-ime-dic	-	-	⭐ 35	🟢 february
🔗 pokemon-ime-dic	-	-	⭐ 0	🔴 january 2020
🔗 EJDict	-	-	⭐ 241	🟡 november 2025
🔗 Ayashiy-Nipongo-Dic	-	-	⭐ 26	🔴 may 2024
🔗 genshin-dict	-	-	⭐ 126	🟢 february
🔗 jmdict-simplified	-	-	⭐ 329	🟢 today
🔗 mozcdict-ext	-	-	⭐ 68	🟡 september 2025
🔗 mh-dict-jp	-	-	⭐ 5	🟡 april 2025
🔗 jitenbot	-	-	⭐ repo not found	🔴 repo not found
🔗 mecab-unidic-neologd	-	-	⭐ 87	🔴 september 2020
🔗 hololive-dictionary	-	-	⭐ 24	🔴 december 2024
🔗 jmdict-yomitan	-	-	⭐ 237	🟢 today
🔗 yomichan-jlpt-vocab	-	-	⭐ 119	🟡 august 2025
🔗 Jitendex	-	-	⭐ 446	🟢 today
🔗 jiten	-	-	⭐ 125	🔴 december 2023
🔗 pixiv-yomitan	-	-	⭐ 49	🟢 january
🔗 uchinaaguchi_dict	-	-	⭐ 4	🟢 february
🔗 yomitan-dictionaries	-	-	⭐ 712	🟡 august 2025
🔗 mouse_over_dictionary	-	-	⭐ 72	🔴 january 2020
🔗 jisyo	-	-	⭐ 28	🔴 september 2023
🔗 skk-jisyo.emoji-ja	-	-	⭐ 30	🔴 march 2018
🔗 aws_dic_for_google_ime	-	-	⭐ 7	🔴 november 2019
🔗 cl-skkserv	-	-	⭐ 31	🔴 october 2024
🔗 anthy	-	-	⭐ 3	🔴 july 2013
🔗 anthy-unicode	-	-	⭐ 41	🟢 last friday
🔗 azooKey	-	-	⭐ 633	🟢 yesterday
🔗 azookey-desktop	-	-	⭐ 790	🟢 yesterday
🔗 fcitx5-hazkey	-	-	⭐ 160	🟢 last friday
🔗 mozcdic-ut-place-names	-	-	⭐ 20	🟢 february
🔗 azookeykanakanjiconverter	-	-	⭐ 135	🟢 last saturday
🔗 libkkc	-	-	⭐ 112	🔴 august 2024
🔗 libskk	-	-	⭐ 98	🟢 last friday
🔗 kanayomi-dict	-	-	⭐ repo not found	🔴 repo not found
🔗 cjkvi-dict	-	-	⭐ 108	🔴 september 2017
🔗 wlsp-classical	-	-	⭐ 2	🟡 november 2025
🔗 kanji-dict	-	-	⭐ 6	🟢 february
🔗 Kaomoji_proj	-	-	⭐ 11	🟡 october 2025
🔗 kotlin-kana-kanji-converter	-	-	⭐ 5	🟡 november 2025
🔗 alfred-japanese-dictionary	-	-	⭐ 6	🟢 february
🔗 ichiran	-	-	⭐ 386	🟢 january
🔗 mikan	-	-	⭐ 24	🟡 june 2025
🔗 colloquial-kansai-dictionary	-	-	⭐ 9	🟢 last saturday
🔗 jisho-open	-	-	⭐ 57	🟡 june 2025
🔗 macskk	-	-	⭐ 258	🟢 february
🔗 nandoku	-	-	⭐ 1	🟢 february
🔗 japanese_android_ime	-	-	⭐ 2	🟡 september 2025
🔗 anthywl	-	-	⭐ 34	🟡 april 2025
🔗 sekka	-	-	⭐ 24	🟡 july 2025
🔗 sumibi	-	-	⭐ 36	🟢 last saturday
🔗 jinmei-dict	-	-	⭐ 6	🔴 april 2020
🔗 japanesekeyboard	-	-	⭐ 202	🟢 last wednesday
🔗 japanesearabic	-	-	⭐ 17	🟡 may 2025
🔗 o-dic	-	-	⭐ 5	🟡 march 2025
🔗 skk-emoji-jisyo	-	-	⭐ 140	🔴 january 2025
🔗 mozcdic-ut-personal-names	-	-	⭐ 24	🟢 february
🔗 mozcdic-ut-sudachidict	-	-	⭐ 20	🟢 february
🔗 nihongo	-	-	⭐ 19	🔴 january 2025
🔗 kagome-dict	-	-	⭐ 15	🟢 last tuesday
🔗 canna	-	-	⭐ 3	🟡 august 2025
🔗 kansai-accent-dictionary	-	-	⭐ 1	🟢 december 2025

Corpus

Part-of-speech tagging / Named entity recognition

Corpora annotated with part-of-speech tags and named entities

ner-wikipedia-dataset - Wikipediaを用いた日本語の固有表現抽出データセット
IOB2Corpus - Japanese IOB2 tagged corpus for Named Entity Recognition.
TwitterCorpus - 首都大日本語 Twitter コーパス
UD_Japanese-PUD - Parallel Universal Dependencies.
UD_Japanese-GSD - Japanese data from the Google UDT 2.0.
KWDLC - Kyoto University Web Document Leads Corpus
AnnotatedFKCCorpus - Annotated Fuman Kaitori Center Corpus
UD_Japanese-GSDLUW - Long-unit-word version of UD_Japanese-GSD
ud_japanese-bccwj - This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation.

Name	downloads/week	total downloads	stars	last commit
🔗 ner-wikipedia-dataset	-	-	⭐ 142	🔴 september 2023
🔗 IOB2Corpus	-	-	⭐ 61	🔴 february 2020
🔗 TwitterCorpus	-	-	⭐ 21	🔴 march 2016
🔗 UD_Japanese-PUD	-	-	⭐ 0	🔴 may 2020
🔗 UD_Japanese-GSD	-	-	⭐ 28	🔴 may 2022
🔗 KWDLC	-	-	⭐ 83	🔴 december 2023
🔗 AnnotatedFKCCorpus	-	-	⭐ 18	🔴 december 2023
🔗 anthy	-	-	⭐ 15	🔴 february 2023
🔗 UD_Japanese-GSDLUW	-	-	⭐ 3	🟡 november 2025
🔗 ud_japanese-bccwj	-	-	⭐ 26	🟡 november 2025

Parallel corpus

Bilingual corpora containing aligned sentences for translation tasks

small_parallel_enja - 50k English-Japanese Parallel Corpus for Machine Translation Benchmark.
Web-Crawled-Corpus-for-Japanese-Chinese-NMT - A Web Crawled Corpus for Japanese-Chinese NMT
CourseraParallelCorpusMining - Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
JESC - A large parallel corpus of English and Japanese
AMI-Meeting-Parallel-Corpus - AMI Meeting Parallel Corpus
giant_ja-en_parallel_corpus - This directory includes a giant Japanese-English subtitle corpus. The raw data comes from the Stanford’s JESC project.
jesc_small - Small Japanese-English Subtitle Corpus
graded-enja-corpus - 禁止用語や単語レベルを考慮した日英対訳コーパスです。
cjk-compsci-terms - CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조
Laboro-ParaCorpus - Scripts for creating a Japanese-English parallel corpus and training NMT models
google-vs-deepl-je - google-vs-deepl-je
matcha - 訪日観光客向けメディアMATCHAの記事から、日本語のテキスト平易化のためのデータセットを構築しました。
en-ja-el - EnJaEL: En-Ja Parallel Entity Linking Dataset (Version 1.0)

Name	downloads/week	total downloads	stars	last commit
🔗 small_parallel_enja	-	-	⭐ 98	🔴 september 2019
🔗 Web-Crawled-Corpus-for-Japanese-Chinese-NMT	-	-	⭐ 15	🔴 september 2023
🔗 CourseraParallelCorpusMining	-	-	⭐ 15	🔴 august 2024
🔗 JESC	-	-	⭐ 87	🔴 november 2017
🔗 AMI-Meeting-Parallel-Corpus	-	-	⭐ 11	🔴 december 2020
🔗 giant_ja-en_parallel_corpus	-	-	⭐ 5	🔴 august 2019
🔗 jesc_small	-	-	⭐ 3	🔴 july 2019
🔗 graded-enja-corpus	-	-	⭐ 6	🟡 august 2025
🔗 cjk-compsci-terms	-	-	⭐ 145	🟢 january
🔗 Laboro-ParaCorpus	-	-	⭐ 18	🔴 november 2021
🔗 google-vs-deepl-je	-	-	⭐ 4	🔴 march 2020
🔗 matcha	-	-	⭐ 6	🔴 january 2025
🔗 en-ja-el	-	-	⭐ 2	🔴 january 2025

Dialog corpus

Collections of conversation data for training dialogue systems

JMRD - Japanese Movie Recommendation Dialogue dataset
open2ch-dialogue-corpus - おーぷん2ちゃんねるをクロールして作成した対話コーパス
BSD - The Business Scene Dialogue corpus
asdc - Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
japanese-corpus - 日本語の対話データ for seq2seq etc
BPersona-chat - This repository contains the Japanese–English bilingual chat corpus BPersona-chat published in the paper Chat Translation Error Detection for Assisting Cross-lingual Communications at AACL-IJCNLP 2022's Workshop Eval4NLP 2022.
japanese-daily-dialogue - Japanese Daily Dialogue, or 日本語日常対話コーパス in Japanese, is a high-quality multi-turn dialogue dataset containing daily conversations on five topics: dailylife, school, travel, health, and entertainment.
llm-japanese-dataset - LLM構築用の日本語チャットデータセット
kokorochat - ロールプレイで収集した日本語のカウンセリング対話データセット

Name	downloads/week	total downloads	stars	last commit
🔗 JMRD	-	-	⭐ 28	🔴 july 2022
🔗 open2ch-dialogue-corpus	-	-	⭐ 98	🔴 june 2021
🔗 BSD	-	-	⭐ 73	🔴 november 2021
🔗 asdc	-	-	⭐ 25	🔴 august 2023
🔗 japanese-corpus	-	-	⭐ 3	🔴 october 2018
🔗 BPersona-chat	-	-	⭐ 5	🔴 january 2023
🔗 japanese-daily-dialogue	-	-	⭐ 54	🔴 march 2023
🔗 llm-japanese-dataset	-	-	⭐ 87	🔴 january 2024
🔗 kokorochat	-	-	⭐ 16	🟡 august 2025

Others

Corpora for tasks such as question answering or entailment recognition

jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
simple-jppdb - A paraphrase database for Japanese text simplification
chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
JaNLI - Japanese Adversarial Natural Language Inference Dataset
ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
emoji-ja - UNICODE絵文字の日本語読み/キーワード/分類辞書
nayose-wikipedia-ja - Wikipediaから作成した日本語名寄せデータセット
ja.text8 - Japanese text8 corpus for word embedding.
ThreeLineSummaryDataset - 3行要約データセット
japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
kanji-frequency - Kanji usage frequency data collected from various sources
TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
CoARiJ - Corpus of Annual Reports in Japan
technological-book-corpus-ja - 日本語で書かれた技術書を収集した生コーパス/ツール
ita-corpus-chuwa - Chunked word annotation for ITA corpus
wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
inappropriate-words-ja - 日本語における不適切表現を収集します。自然言語処理の時のデータクリーニング用等に使えると思います。
house-of-councillors - 参議院の公式ウェブサイトから会派、議員、議案、質問主意書のデータを整理しました。
house-of-representatives - 国会議案データベース：衆議院
STAIR-captions - STAIR captions: large-scale Japanese image caption dataset
Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
speechBSD - An extension of the BSD corpus with audio and speaker attribute information
ita-corpus - ITAコーパスの文章リスト
rohan4600 - モーラバランス型日本語コーパス
anlp-jp-history - 言語処理学会年次大会講演の全リスト・機械可読版など
keigo_transfer_task - 敬語変換タスクにおける評価用データセット
loanwords_gairaigo - English loanwords in Japanese
jawikicorpus - Japanese-Wikipedia Wikification Corpus
GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
wrime - WRIME: 主観と客観の感情分析データセット
jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
WikipediaWordFrequencyList - 日本語Wikipediaで使用される頻出単語のリスト
kokkosho_data - 車両不具合情報に関するデータセット
pdmocrdataset-part1 - デジタル化資料OCRテキスト化事業において作成されたOCR学習用データセット
huriganacorpus-ndlbib - 全国書誌データから作成した振り仮名のデータセット
jvs_hiho - JVS (Japanese versatile speech) コーパスの自作のラベル
hirakanadic - Allows Sudachi to normalize from hiragana to katakana from any compound word list
animedb - 約100年に渡るアニメ作品リストデータベース
security_words - サイバーセキュリティに関連する公的な組織の日英対応
Data-on-Japanese-Diet-Members - 日本の国会議員のデータ
honkoku-data - 歴史資料の市民参加型翻刻プラットフォーム「みんなで翻刻」のテキストデータ置き場です。 / Transcription texts created on Minna de Honkoku (https://honkoku.org), a crowdsourced transcription platform for historical Japanese documents.
wikihow_japanese - wikiHow dataset (Japanese version)
engineer-vocabulary-list - Engineer Vocabulary List in Japanese/English
JSICK - Japanese Sentences Involving Compositional Knowledge (JSICK) Dataset/JSICK-stress Test Set
phishurl-list - Phishing URL dataset from JPCERT/CC
jcms - A Japanese Corpus of Many Specialized Domains (JCMS)
aozorabunko_text - text-only archives of www.aozora.gr.jp
friendly_JA-Corpus - friendly_JA is a parallel Japanese-to-Japanese corpus aimed at making Japanese easier by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
topokanji - Topologically ordered lists of kanji for effective learning
isbn4groups - ISBN-13における日本語での出版物 (978-4-XXXXXXXXX) に関するデータ等
NMeCab - NMeCab: About Japanese morphological analyzer on .NET
ndlngramdata - デジタル化資料から作成したOCRテキストデータのngram頻度統計情報のデータセット
ndlngramviewer_v2 - 2023年1月にリニューアルしたNDL Ngram Viewerのソースコード等一式
data_set - 法律・判例関係のデータセット
huggingface-datasets_wrime - WRIME for huggingface datasets
ndl-minhon-ocrdataset - NDL古典籍OCR学習用データセット（みんなで翻刻加工データ）
PAX_SAPIENTICA - GIS & Archaeological Simulator. 2023 in development.
j-liwc2015 - Japanese version of LIWC2015
huggingface-datasets_livedoor-news-corpus - Japanese Livedoor news corpus for huggingface datasets
huggingface-datasets_JGLUE - JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
comet-atomic-ja - COMET-ATOMIC ja
dcsg-ja - Dialogue Commonsense Graph in Japanese
japanese-toxic-dataset - "Proposal and Evaluation of Japanese Toxicity Schema" provides a schema and dataset for toxicity in the Japanese language.
camera - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) is the Japanese ad text generation dataset.
Japanese-Fakenews-Dataset - 日本語フェイクニュースデータセット
jpn_explainable_qa_dataset - jpn_explainable_qa_dataset
copa-japanese - COPA Dataset in Japanese
WLSP-familiarity - Word Familiarity Rate for 'Word List by Semantic Principles (WLSP)'
ProSub - A cross-linguistic study of pronoun substitutes and address terms
commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
ramendb - なんとかデータベース( https://supleks.jp/ )からのスクレイピングツールと収集データ
huggingface-datasets_CAMERA - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) for huggingface datasets
FactCheckSentenceNLI-FCSNLI- - FactCheckSentenceNLIデータセット
databricks-dolly-15k-ja - databricks/dolly-v2-12b の学習データに使用されたdatabricks-dolly-15k.jsonl を日本語に翻訳したデータセットになります。
EaST-MELD - EaST-MELD is an English-Japanese dataset for emotion-aware speech translation based on MELD.
meconaudio - Mecon Audio(Medical Conference Audio)は厚生労働省主催の先進医療会議の議事録の読み上げデータセットです。
japanese-addresses - 全国の町丁目レベル（277,191件）の住所データのオープンデータ
aozorasearch - The full-text search system for Aozora Bunko by Groonga. 青空文庫全文検索ライブラリ兼Webアプリ。
llm-jp-corpus - This repository contains scripts to reproduce the LLM-jp corpus.
alpaca_ja - alpacaデータセットを日本語化したものです
instruction_ja - Japanese instruction data (日本語指示データ)
japanese-family-names - Top 5000 Japanese family names, with readings, ordered by frequency.
kanji-data-media - Japanese language data on kanji, radicals, media files, fonts and related resources from Kanji alive
reazonspeech - Construct large-scale Japanese audio corpus at home
huriganacorpus-aozora - 青空文庫及びサピエの点字データから作成した振り仮名のデータセット
koniwa - An open collection of annotated voices in Japanese language
JMMLU - 日本語マルチタスク言語理解ベンチマーク Japanese Massive Multitask Language Understanding Benchmark
hurigana-speech-corpus-aozora - 青空文庫振り仮名注釈付き音声コーパスのデータセット
jqara - JQaRA: Japanese Question Answering with Retrieval Augmentation - 検索拡張(RAG)評価のための日本語Q&Aデータセット
jemhopqa - JEMHopQA (Japanese Explainable Multi-hop Question Answering) is a Japanese multi-hop QA dataset that can evaluate internal reasoning.
jacred - Repository for Japanese Document-level Relation Extraction Dataset (plan to be released in March).
jades - JADES is a dataset for text simplification in Japanese, described in "JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers" (the paper will be available soon).
do-not-answer-ja - 2023年8月にメルボルン大学から公開された安全性評価データセット『Do-Not-Answer』を日本語LLMの評価においても使用できるように日本語に自動翻訳し、さらに日本文化も考慮して修正したデータセット。
oasst1-89k-ja - OpenAssistant のオープンソースデータ OASST1 を日本語に翻訳したデータセットになります。
jacwir - JaCWIR: Japanese Casual Web IR - 日本語情報検索評価のための小規模でカジュアルなWebタイトルと概要のデータセット
japanese-technical-dict - 日本語学習者のための科学技術業界でよく使われる片仮名と元の単語対照表
j-unimorph - Dataset of UniMorph in Japanese
GazeVQA - Dataset for the LREC-COLING 2024 paper "A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions"
J-CRe3 - Code for J-CRe3 experiments (Ueda et al., LREC-COLING, 2024)
jmed-llm - JMED-LLM: Japanese Medical Evaluation Dataset for Large Language Models
lawtext - Plain text format for Japanese law
pdmocrdataset-part2 - OCR処理プログラム研究開発事業において作成されたOCR学習用データセット
japanesetopicwsd - 話題に基づく語義曖昧性解消評価セット
temporalNLI_dataset - Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating Generalization Capacity of Language Models
JSeM - Japanese semantic test suite (FraCaS counterpart and extensions)
niilc-qa - NIILC QA data
chain-of-thought-ja-dataset - Dataset of paper "Verification of Chain-of-Thought Prompting in Japanese"
WikipediaAnnotatedCorpus - This is a Japanese text corpus that consists of Wikipedia articles with various linguistic annotations.
elaws-history - e-Gov 法令検索で配布されている「全ての法令データ」を定期的にダウンロードし、アーカイブしています
Japanese-RP-Bench - Japanese-RP-BenchはLLMの日本語ロールプレイ能力を測定するためのベンチマークです。
hdic - HDIC : Integrated Database of Hanzi Dictionaries in Early Japan
awesome-japan-opendata - Awesome Japan Open Data - 日本のオープンデータ情報一覧・まとめ
kanji-data - 常用漢字表他、漢字に関するデータ
openchj-genji - 「源氏物語」形態論情報データ
AdParaphrase - This repository contains data for our paper "AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts".
Jamp_sp - アスペクトを考慮した日本語時間推論データセットの構築（Jamp_sp: Controlled Japanese Temporal Inference Dataset Considering Aspect）
jnli-neg - 否定理解能力を評価するための日本語言語推論データセット JNLI-Neg の公開用リポジトリです。
swallow-corpus - This repository provides Python implementation for building Swallow Corpus Version 1, a large Japanese web corpus (Okazaki et al., 2024), from Common Crawl archives.
jalecon - A Dataset of Japanese Lexical Complexity for Non-Native Readers
multils-japanese - MultiLS-Japanese Lexical Complexity Prediction and Lexical Simplification Dataset for Japanese: annotator profiles, unaggregated annotation, and annotatation guidelines.
nwjc - NINJAL Web Japanese Corpus
open-mantra-dataset - Dataset introduced in the paper "Towards Fully Automated Manga Translation" presented in AAAI21
public-annotations - Various annotations of Manga109 dataset
gimei - random Japanese name and address generator
safety-boundary-test - 日本語言語モデルの安全性の振る舞いを評価するテストセット
j-ono-data - A simple, open-source collection of Japanese onomatopoeic and mimetic sound words in JSON format. With manga samples.
kanji - List of japanese kanji radicals to learn
jethics - 日本語道徳理解度評価用データセットJETHICSの概説ページ (to be update)
waon - WAON: Large-Scale and High-Quality Japanese Image-Text Dataset for Vision-Language Models
kuci - Kyoto University Commonsense Inference dataset (KUCI)
japanese-address-testdata - 解析が難しい日本の住所のテストデータセット
jlpt-word-list - Japanese word list from JLPT vocabulary
hiragana_mojigazo - 文字画像データセット(平仮名73文字版)
lawqa_jp - 日本の法令に関する多肢選択式QAデータセット
yjcaptions - YJ Captions 26k Dataset
ja-vg-vqa - Japanese Visual Genome VQA dataset
lawhub - Repository to track Japanese Law in text format
japanese-subtitles-word-kanji-frequency-lists - A word frequency list derived from subtitles from Japanese drama, anime and films.
jconj - A table-based Japanese word conjugator
extract_jawp_names - Extracts personal names in Wikipedia Japanese.
cejc_yomichan_freq_dict - Frequency dictionary for yomichan based on the Corpus of Everyday Japanese Conversation dataset
wikidict-ja - Wikipedia Bilingual Reference Data (Japanese)
ajimee-bench - AJIMEE-Bench (Advanced Japanese IME Evaluation Benchmark)
j-spaw - J-SpAW: Japanese speech corpus for speaker verification and anti-spoofing
camera3 - CAMERA3: An Evaluation Dataset for Controllable Ad Text Generation in Japanese
jgpqa - Japanese translation of the GPQA dataset
tanaka-corpus-plus - Tanaka Corpus のノイズを除去しています。
emotioncorpusjapanesetokushimaa2lab - Japanese emotion corpus Tokushima Univ. A-2 Lab.
osworld-jp - 言語を考慮した評価のための、日本語版コンピュータユースベンチマーク
quasi_japanese_reviews - Quasi Japanese Reviews (擬似レビューデータ)
psychiatry-clinical-notes - 精神科初診カルテ作成アンケートデータセット
merged-town-names - 市町村合併などにより消滅した旧地名と新地名の対応表
japanesetextemoticondata - Japanese text-emoticon data.
mishearing-corpus - 聞き間違えコーパス︱CSV＋Table Schema で約 1 万件を管理し、VS Code＋pre-commit＋Frictionless＋GitHub Actions で自動検証を行う日本語データセット
kotowaza - Structured JSON dataset of Japanese proverbs (kotowaza) with meanings in Indonesian & English, examples, JLPT levels, and tags.
selective-rag-kasensabo - 建設の技術基準に関する質問の専門性粒度（細かい/粗い）を96%正確に自動判定し、最適なRAGシステム（ColBERT/Naive）を選択する実用的なAgentic RAGシステムのMVPです。2025年11月に公開された河川砂防ダムの技術基準を対象に４つのRAGシステムを構築し、専門性の粒度が異なる200問の質問に対して、精度と速度を比較した。

Name	downloads/week	total downloads	stars	last commit
🔗 jrte-corpus	-	-	⭐ 77	🔴 june 2023
🔗 kanji-data	-	-	⭐ 201	🔴 december 2019
🔗 JapaneseWordSimilarityDataset	-	-	⭐ 102	🔴 december 2021
🔗 simple-jppdb	-	-	⭐ 32	🔴 march 2017
🔗 chABSA-dataset	-	-	⭐ 141	🔴 september 2018
🔗 JaQuAD	-	-	⭐ 108	🔴 january 2022
🔗 JaNLI	-	-	⭐ 17	🔴 may 2023
🔗 ebe-dataset	-	-	⭐ 18	🔴 december 2020
🔗 emoji-ja	-	-	⭐ 82	🟡 march 2025
🔗 nayose-wikipedia-ja	-	-	⭐ 35	🔴 march 2020
🔗 ja.text8	-	-	⭐ 111	🔴 october 2017
🔗 ThreeLineSummaryDataset	-	-	⭐ 31	🔴 april 2018
🔗 japanese	-	-	⭐ 83	🔴 september 2018
🔗 kanji-frequency	-	-	⭐ 155	🟢 january
🔗 TEDxJP-10K	-	-	⭐ 24	🔴 january 2021
🔗 CoARiJ	-	-	⭐ 94	🔴 december 2020
🔗 technological-book-corpus-ja	-	-	⭐ 26	🔴 july 2023
🔗 ita-corpus-chuwa	-	-	⭐ 5	🔴 august 2021
🔗 wikipedia-utils	-	-	⭐ 78	🔴 april 2024
🔗 inappropriate-words-ja	-	-	⭐ 201	🔴 december 2021
🔗 house-of-councillors	-	-	⭐ 105	🟢 yesterday
🔗 house-of-representatives	-	-	⭐ 176	🟢 yesterday
🔗 STAIR-captions	-	-	⭐ 90	🔴 july 2018
🔗 Winograd-Schema-Challenge-Ja	-	-	⭐ 6	🔴 january 2019
🔗 speechBSD	-	-	⭐ 3	🔴 february 2024
🔗 ita-corpus	-	-	⭐ 222	🔴 december 2024
🔗 rohan4600	-	-	⭐ 66	🔴 february 2023
🔗 anlp-jp-history	-	-	⭐ 3	🔴 april 2024
🔗 keigo_transfer_task	-	-	⭐ 21	🔴 november 2022
🔗 loanwords_gairaigo	-	-	⭐ 19	🔴 january 2021
🔗 jawikicorpus	-	-	⭐ 4	🔴 november 2018
🔗 GeneralPolicySpeechOfPrimeMinisterOfJapan	-	-	⭐ 6	🔴 january 2020
🔗 wrime	-	-	⭐ 174	🟡 september 2025
🔗 jtubespeech	-	-	⭐ 229	🔴 march 2023
🔗 WikipediaWordFrequencyList	-	-	⭐ 2	🔴 april 2022
🔗 kokkosho_data	-	-	⭐ 1	🔴 july 2019
🔗 pdmocrdataset-part1	-	-	⭐ 75	🔴 june 2024
🔗 huriganacorpus-ndlbib	-	-	⭐ 28	🔴 september 2021
🔗 jvs_hiho	-	-	⭐ 31	🔴 february 2021
🔗 hirakanadic	📥 61	📦 14k	⭐ 7	🔴 july 2023
🔗 animedb	-	-	⭐ 328	🔴 january 2023
🔗 security_words	-	-	⭐ 27	🔴 august 2023
🔗 Data-on-Japanese-Diet-Members	-	-	⭐ 3	🔴 september 2022
🔗 honkoku-data	-	-	⭐ 17	🟢 yesterday
🔗 wikihow_japanese	-	-	⭐ 35	🔴 december 2020
🔗 engineer-vocabulary-list	-	-	⭐ 1.9k	🔴 november 2020
🔗 JSICK	-	-	⭐ 45	🔴 may 2023
🔗 phishurl-list	-	-	⭐ 199	🟡 november 2025
🔗 jcms	-	-	⭐ 9	🔴 november 2022
🔗 aozorabunko_text	-	-	⭐ 85	🔴 march 2023
🔗 friendly_JA-Corpus	-	-	⭐ repo not found	🔴 repo not found
🔗 topokanji	-	-	⭐ 199	🔴 january 2016
🔗 isbn4groups	-	-	⭐ 1	🔴 june 2024
🔗 NMeCab	-	-	⭐ 97	🔴 march 2024
🔗 ndlngramdata	-	-	⭐ 14	🔴 january 2023
🔗 ndlngramviewer_v2	-	-	⭐ 3	🔴 july 2023
🔗 data_set	-	-	⭐ 49	🔴 january 2025
🔗 huggingface-datasets_wrime	-	-	⭐ 4	🔴 january 2023
🔗 ndl-minhon-ocrdataset	-	-	⭐ 18	🔴 february 2025
🔗 PAX_SAPIENTICA	-	-	⭐ 180	🟢 december 2025
🔗 j-liwc2015	-	-	⭐ 13	🔴 november 2024
🔗 huggingface-datasets_livedoor-news-corpus	-	-	⭐ 2	🔴 october 2023
🔗 huggingface-datasets_JGLUE	-	-	⭐ 12	🟡 march 2025
🔗 commonsense-moral-ja	-	-	⭐ 15	🟡 november 2025
🔗 comet-atomic-ja	-	-	⭐ 31	🔴 march 2024
🔗 dcsg-ja	-	-	⭐ 6	🔴 march 2023
🔗 japanese-toxic-dataset	-	-	⭐ 21	🔴 january 2023
🔗 camera	-	-	⭐ 26	🔴 august 2024
🔗 Japanese-Fakenews-Dataset	-	-	⭐ 20	🔴 may 2021
🔗 jpn_explainable_qa_dataset	-	-	⭐ repo not found	🔴 repo not found
🔗 copa-japanese	-	-	⭐ 1	🔴 february 2023
🔗 WLSP-familiarity	-	-	⭐ 12	🔴 january 2025
🔗 ProSub	-	-	⭐ 5	🟡 april 2025
🔗 commonsense-moral-ja	-	-	⭐ 15	🟡 november 2025
🔗 ramendb	-	-	⭐ 7	🟢 today
🔗 huggingface-datasets_CAMERA	-	-	⭐ 3	🔴 march 2023
🔗 FactCheckSentenceNLI-FCSNLI-	-	-	⭐ 0	🔴 march 2021
🔗 databricks-dolly-15k-ja	-	-	⭐ 89	🔴 july 2023
🔗 EaST-MELD	-	-	⭐ 0	🔴 june 2023
🔗 meconaudio	-	-	⭐ 9	🔴 october 2023
🔗 japanese-addresses	-	-	⭐ 758	🟢 december 2025
🔗 aozorasearch	-	-	⭐ 21	🔴 september 2020
🔗 llm-jp-corpus	-	-	⭐ 43	🔴 october 2023
🔗 alpaca_ja	-	-	⭐ 86	🔴 may 2023
🔗 instruction_ja	-	-	⭐ 24	🔴 july 2023
🔗 japanese-family-names	-	-	⭐ 18	🔴 june 2017
🔗 kanji-data-media	-	-	⭐ 401	🔴 november 2023
🔗 reazonspeech	-	-	⭐ 360	🟢 january
🔗 huriganacorpus-aozora	-	-	⭐ 17	🔴 january 2024
🔗 koniwa	-	-	⭐ 56	🟡 april 2025
🔗 JMMLU	-	-	⭐ 38	🟡 october 2025
🔗 hurigana-speech-corpus-aozora	-	-	⭐ 42	🟡 march 2025
🔗 jqara	-	-	⭐ 42	🟡 september 2025
🔗 jemhopqa	-	-	⭐ 29	🟡 april 2025
🔗 jacred	-	-	⭐ 7	🔴 march 2024
🔗 jades	-	-	⭐ 0	🔴 december 2022
🔗 do-not-answer-ja	-	-	⭐ 24	🔴 december 2023
🔗 oasst1-89k-ja	-	-	⭐ 16	🔴 november 2023
🔗 jacwir	-	-	⭐ 8	🟡 september 2025
🔗 japanese-technical-dict	-	-	⭐ 3	🔴 november 2024
🔗 j-unimorph	-	-	⭐ 9	🟢 january
🔗 GazeVQA	-	-	⭐ 0	🔴 september 2024
🔗 J-CRe3	-	-	⭐ 9	🔴 january 2025
🔗 jmed-llm	-	-	⭐ 56	🔴 september 2024
🔗 lawtext	-	-	⭐ 91	🟢 january
🔗 pdmocrdataset-part2	-	-	⭐ 14	🔴 june 2024
🔗 japanesetopicwsd	-	-	⭐ 2	🔴 september 2018
🔗 temporalNLI_dataset	-	-	⭐ 1	🔴 july 2023
🔗 JSeM	-	-	⭐ 13	🔴 november 2024
🔗 niilc-qa	-	-	⭐ 18	🔴 november 2015
🔗 chain-of-thought-ja-dataset	-	-	⭐ 5	🔴 september 2023
🔗 WikipediaAnnotatedCorpus	-	-	⭐ 12	🟢 last thursday
🔗 elaws-history	-	-	⭐ 4	🟢 yesterday
🔗 Japanese-RP-Bench	-	-	⭐ 18	🔴 september 2024
🔗 hdic	-	-	⭐ 40	🟢 today
🔗 awesome-japan-opendata	-	-	⭐ 156	🟢 february
🔗 kanji-data	-	-	⭐ 11	🟢 yesterday
🔗 openchj-genji	-	-	⭐ 2	🟡 march 2025
🔗 AdParaphrase	-	-	⭐ 1	🟡 may 2025
🔗 Jamp_sp	-	-	⭐ 0	🔴 june 2024
🔗 jnli-neg	-	-	⭐ 0	🟢 december 2025
🔗 swallow-corpus	-	-	⭐ 5	🔴 november 2024
🔗 jalecon	-	-	⭐ 5	🔴 july 2023
🔗 multils-japanese	-	-	⭐ 0	🟢 january
🔗 nwjc	-	-	⭐ 10	🔴 april 2022
🔗 open-mantra-dataset	-	-	⭐ 198	🔴 march 2023
🔗 gimei	-	-	⭐ 424	🟢 january
🔗 safety-boundary-test	-	-	⭐ 9	🟡 july 2025
🔗 j-ono-data	-	-	⭐ 6	🟢 february
🔗 kanji	-	-	⭐ 28	🟢 yesterday
🔗 jethics	-	-	⭐ 2	🟡 june 2025
🔗 waon	-	-	⭐ 6	🟡 november 2025
🔗 kuci	-	-	⭐ 5	🔴 february 2024
🔗 japanese-address-testdata	-	-	⭐ 14	🔴 september 2023
🔗 jlpt-word-list	-	-	⭐ 58	🔴 february 2022
🔗 hiragana_mojigazo	-	-	⭐ 18	🔴 april 2020
🔗 lawqa_jp	-	-	⭐ 264	🟢 last friday
🔗 yjcaptions	-	-	⭐ 60	🔴 november 2016
🔗 ja-vg-vqa	-	-	⭐ 30	🔴 november 2018
🔗 lawhub	-	-	⭐ 152	🔴 november 2020
🔗 japanese-subtitles-word-kanji-frequency-lists	-	-	⭐ 39	🔴 december 2023
🔗 jconj	-	-	⭐ 35	🔴 may 2020
🔗 extract_jawp_names	-	-	⭐ 21	🔴 december 2022
🔗 cejc_yomichan_freq_dict	-	-	⭐ 9	🔴 june 2023
🔗 wikidict-ja	-	-	⭐ 5	🔴 june 2016
🔗 ajimee-bench	-	-	⭐ 18	🔴 january 2025
🔗 j-spaw	-	-	⭐ 5	🟡 august 2025
🔗 camera3	-	-	⭐ 4	🔴 may 2024
🔗 jgpqa	-	-	⭐ 2	🟡 september 2025
🔗 tanaka-corpus-plus	-	-	⭐ 2	🔴 june 2021
🔗 emotioncorpusjapanesetokushimaa2lab	-	-	⭐ 2	🔴 september 2024
🔗 osworld-jp	-	-	⭐ 2	🟡 november 2025
🔗 quasi_japanese_reviews	-	-	⭐ 1	🔴 july 2023
🔗 psychiatry-clinical-notes	-	-	⭐ 1	🟡 october 2025
🔗 merged-town-names	-	-	⭐ 1	🔴 may 2022
🔗 japanesetextemoticondata	-	-	⭐ 1	🔴 march 2021
🔗 mishearing-corpus	-	-	⭐ 1	🟢 january
🔗 kotowaza	-	-	⭐ 1	🟢 yesterday
🔗 selective-rag-kasensabo	-	-	⭐ 1	🟡 november 2025

Tutorial

Guides and tutorials for learning Japanese NLP tools and techniques

spacy_tutorial - spaCy tutorial in English and Japanese. spacy-transformers, BERT, GiNZA.
fastTextJapaneseTutorial - Tutorial to train fastText with Japanese corpus
allennlp-NER-ja - AllenNLP-NER-ja: AllenNLP による日本語を対象とした固有表現抽出
chariot-PyTorch-Japanese-text-classification - Experiment for Japanese Text classification using chariot and PyTorch
ginza-examples - 日本語NLPライブラリGiNZAのすゝめ
DocumentClassificationUsingBERT-Japanese - DocumentClassificationUsingBERT-Japanese
BERT_Japanese_Google_Colaboratory - Google Colaboratoryで日本語のBERTを動かす方法です。
bert-book - 「BERTによる自然言語処理入門: Transformersを使った実践プログラミング」サポートページ
janome-tutorial - Janome を使ったテキストマイニング入門チュートリアルです。
handson-language-models - 日本語の言語モデルのハンズオン資料です
JapaneseNLI - Google Colabで日本語テキスト推論を試す
deep-learning-with-pytorch-ja - deep-learning-with-pytorchの日本語版repositoryです。
bert-classification-tutorial -【2023年版】BERTによるテキスト分類
python-nlp-book - ディープラーニングによる自然言語処理（共立出版）のサポートページです
llm-book - 「大規模言語モデル入門」（技術評論社, 2023）のGitHubリポジトリ
nlp2024-tutorial-3 - NLP2024 チュートリアル３作って学ぶ日本語大規模言語モデル - 環境構築手順とソースコード
japanese-ir-tutorial - 日本語情報検索チュートリアル
nlpbook - 「自然言語処理の教科書」サポートサイト
kantan-regex-book - 作って学ぶ正規表現エンジン
bert-classification-tutorial-2024 - 【2024年版】BERTによるテキスト分類
Gemma2_2b_Japanese_finetuning_colab.ipynb - Fine-Tuning Google Gemma for Japanese Instructions
nlp100v2020 - 「言語処理100本ノック 2020」をPythonで解く
textmining-ja - Rによる自然言語処理・テキスト分析の練習
nlp2025-tutorial-2 - NLP2025 のチュートリアル「地理情報と言語処理実践入門」の資料とソースコード
nlp100v2025 - 「言語処理100本ノック 2025」をPythonで解く
topic-models-ao - 『トピックモデル』(機械学習プロフェッショナルシリーズ)のノート
slp2025 -音学シンポジウム2025チュートリアル「マルチモーダル大規模言語モデル入門」資料
book_impress_it-basic-education-ai - インプレス出版「IT基礎教養自然言語処理＆画像解析」
genai-agent-advanced-book - 書籍「現場で活用するための生成AIエージェント実践入門」（講談社サイエンティフィック社）で利用されるソースコード
course2024-nlp - 2024年度立教大学大学院人工知能科学研究科自然言語処理特論
support-genai-book - 原論文から解き明かす生成AI（技術評論社）のサポートページです
ir100 - 情報検索100本ノック
kaggle_llm_book - 『Kaggle ではじめる大規模言語モデル入門　～自然言語処理〈実践〉プログラミング～』のサポートサイト

Name	downloads/week	total downloads	stars	last commit
🔗 spacy_tutorial	-	-	⭐ 63	🔴 january 2020
🔗 fastTextJapaneseTutorial	-	-	⭐ 205	🔴 september 2016
🔗 allennlp-NER-ja	-	-	⭐ 5	🔴 may 2022
🔗 chariot-PyTorch-Japanese-text-classification	-	-	⭐ 5	🔴 march 2019
🔗 ginza-examples	-	-	⭐ 16	🔴 january 2021
🔗 DocumentClassificationUsingBERT-Japanese	-	-	⭐ 0	🟡 august 2025
🔗 BERT_Japanese_Google_Colaboratory	-	-	⭐ 29	🔴 january 2022
🔗 bert-book	-	-	⭐ 264	🔴 february 2024
🔗 janome-tutorial	-	-	⭐ 31	🔴 march 2019
🔗 handson-language-models	-	-	⭐ 3	🔴 march 2021
🔗 JapaneseNLI	-	-	⭐ 6	🔴 june 2021
🔗 deep-learning-with-pytorch-ja	-	-	⭐ 142	🔴 may 2021
🔗 bert-classification-tutorial	-	-	⭐ 235	🔴 may 2024
🔗 python-nlp-book	-	-	⭐ 10	🔴 may 2023
🔗 llm-book	-	-	⭐ 461	🟢 december 2025
🔗 nlp2024-tutorial-3	-	-	⭐ 112	🔴 april 2024
🔗 japanese-ir-tutorial	-	-	⭐ 3	🔴 june 2024
🔗 nlpbook	-	-	⭐ 14	🟡 april 2025
🔗 kantan-regex-book	-	-	⭐ 22	🔴 march 2024
🔗 bert-classification-tutorial-2024	-	-	⭐ 30	🔴 july 2024
🔗 Gemma2_2b_Japanese_finetuning_colab.ipynb	-	-	⭐ repo not found	🔴 august 2024
🔗 nlp100v2020	-	-	⭐ 91	🟡 april 2025
🔗 textmining-ja	-	-	⭐ 3	🟡 october 2025
🔗 nlp2025-tutorial-2	-	-	⭐ 17	🟢 january
🔗 nlp100v2025	-	-	⭐ 91	🟡 april 2025
🔗 public-annotations	-	-	⭐ 13	🟡 april 2025
🔗 topic-models-ao	-	-	⭐ 4	🟡 may 2025
🔗 slp2025	-	-	⭐ 62	🟢 february
🔗 book_impress_it-basic-education-ai	-	-	⭐ 4	🟡 june 2025
🔗 genai-agent-advanced-book	-	-	⭐ 185	🟡 september 2025
🔗 course2024-nlp	-	-	⭐ repo not found	🔴 repo not found
🔗 support-genai-book	-	-	⭐ 89	🟢 january
🔗 ir100	-	-	⭐ 93	🟢 december 2025
🔗 kaggle_llm_book	-	-	⭐ 19	🟢 last saturday

Research summary

Summaries of studies and papers in Japanese NLP research

awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
GEC-Info-ja - 文法誤り訂正に関する日本語文献を収集・分類するためのリポジトリ
dataset-list - lists of text corpus and more (mainly Japanese)
tuning_playbook_ja - ディープラーニングモデルの性能を体系的に最大化するためのプレイブック
japanese-pitch-accent-resources - Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
awesome-japanese-llm - オープンソースの日本語LLMまとめ

Name	downloads/week	total downloads	stars	last commit
🔗 awesome-bert-japanese	-	-	⭐ 131	🔴 march 2023
🔗 GEC-Info-ja	-	-	⭐ 12	🟡 april 2025
🔗 dataset-list	-	-	⭐ 117	🔴 july 2024
🔗 tuning_playbook_ja	-	-	⭐ 190	🔴 january 2023
🔗 japanese-pitch-accent-resources	-	-	⭐ 124	🔴 february 2024
🔗 awesome-japanese-llm	-	-	⭐ 1.3k	🟢 yesterday

Reference

Contributors

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-japanese-nlp-resources

Similar Open Source Tools

awesome-japanese-nlp-resources

github

: 914

Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models (like ChatGPT, LLaMA, GLM, Baichuan, etc) Evaluation on Language capabilities, Knowledge, Reasoning, Fairness and Safety.

github

: 280

Github-Ranking-AI

This repository provides a list of the most starred and forked repositories on GitHub. It is updated automatically and includes information such as the project name, number of stars, number of forks, language, number of open issues, description, and last commit date. The repository is divided into two sections: LLM and chatGPT. The LLM section includes repositories related to large language models, while the chatGPT section includes repositories related to the chatGPT chatbot.

github

: 369

GeneratedOnBoardings

GeneratedOnBoardings is a repository containing automatically generated onboarding diagrams for over 800+ Python projects using CodeBoarding, an open-source tool for creating interactive visual documentation. The tool helps developers explore unfamiliar codebases through visual documentation, making it easier to understand and contribute to open-source projects. Users can provide feedback to improve the tool, and can also generate onboarding diagrams for their own projects by running CodeBoarding locally or trying the online demo at CodeBoarding.org/demo.

github

: 84

Awesome-LLM-3D

This repository is a curated list of papers related to 3D tasks empowered by Large Language Models (LLMs). It covers tasks such as 3D understanding, reasoning, generation, and embodied agents. The repository also includes other Foundation Models like CLIP and SAM to provide a comprehensive view of the area. It is actively maintained and updated to showcase the latest advances in the field. Users can find a variety of research papers and projects related to 3D tasks and LLMs in this repository.

github

: 1.6k

kumo-search

Kumo search is an end-to-end search engine framework that supports full-text search, inverted index, forward index, sorting, caching, hierarchical indexing, intervention system, feature collection, offline computation, storage system, and more. It runs on the EA (Elastic automic infrastructure architecture) platform, enabling engineering automation, service governance, real-time data, service degradation, and disaster recovery across multiple data centers and clusters. The framework aims to provide a ready-to-use search engine framework to help users quickly build their own search engines. Users can write business logic in Python using the AOT compiler in the project, which generates C++ code and binary dynamic libraries for rapid iteration of the search engine.

github

: 248

Cool-GenAI-Fashion-Papers

Cool-GenAI-Fashion-Papers is a curated list of resources related to GenAI-Fashion, including papers, workshops, companies, and products. It covers a wide range of topics such as fashion design synthesis, outfit recommendation, fashion knowledge extraction, trend analysis, and more. The repository provides valuable insights and resources for researchers, industry professionals, and enthusiasts interested in the intersection of AI and fashion.

github

: 129

open-llms

Open LLMs is a repository containing various Large Language Models licensed for commercial use. It includes models like T5, GPT-NeoX, UL2, Bloom, Cerebras-GPT, Pythia, Dolly, and more. These models are designed for tasks such as transfer learning, language understanding, chatbot development, code generation, and more. The repository provides information on release dates, checkpoints, papers/blogs, parameters, context length, and licenses for each model. Contributions to the repository are welcome, and it serves as a resource for exploring the capabilities of different language models.

github

: 10.3k

LLM-Agent-Survey

Autonomous agents are designed to achieve specific objectives through self-guided instructions. With the emergence and growth of large language models (LLMs), there is a growing trend in utilizing LLMs as fundamental controllers for these autonomous agents. This repository conducts a comprehensive survey study on the construction, application, and evaluation of LLM-based autonomous agents. It explores essential components of AI agents, application domains in natural sciences, social sciences, and engineering, and evaluation strategies. The survey aims to be a resource for researchers and practitioners in this rapidly evolving field.

github

: 2.2k

so-vits-models

This repository collects various LLM, AI-related models, applications, and datasets, including LLM-Chat for dialogue models, LLMs for large models, so-vits-svc for sound-related models, stable-diffusion for image-related models, and virtual-digital-person for generating videos. It also provides resources for deep learning courses and overviews, AI competitions, and specific AI tasks such as text, image, voice, and video processing.

github

: 164

LLM-for-Healthcare

The repository 'LLM-for-Healthcare' provides a comprehensive survey of large language models (LLMs) for healthcare, covering data, technology, applications, and accountability and ethics. It includes information on various LLM models, training data, evaluation methods, and computation costs. The repository also discusses tasks such as NER, text classification, question answering, dialogue systems, and generation of medical reports from images in the healthcare domain.

github

: 96

llm-export

llm-export is a tool for exporting llm models to onnx and mnn formats. It has features such as passing onnxruntime correctness tests, optimizing the original code to support dynamic shapes, reducing constant parts, optimizing onnx models using OnnxSlim for performance improvement, and exporting lora weights to onnx and mnn formats. Users can clone the project locally, clone the desired LLM project locally, and use LLMExporter to export the model. The tool supports various export options like exporting the entire model as one onnx model, exporting model segments as multiple models, exporting model vocabulary to a text file, exporting specific model layers like Embedding and lm_head, testing the model with queries, validating onnx model consistency with onnxruntime, converting onnx models to mnn models, and more. Users can specify export paths, skip optimization steps, and merge lora weights before exporting.

github

: 255

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

ailia-models

The collection of pre-trained, state-of-the-art AI models. ailia SDK is a self-contained, cross-platform, high-speed inference SDK for AI. The ailia SDK provides a consistent C++ API across Windows, Mac, Linux, iOS, Android, Jetson, and Raspberry Pi platforms. It also supports Unity (C#), Python, Rust, Flutter(Dart) and JNI for efficient AI implementation. The ailia SDK makes extensive use of the GPU through Vulkan and Metal to enable accelerated computing. # Supported models 323 models as of April 8th, 2024

github

: 2.2k

ailia-models

github

: 2.3k

ape-papers

APE Working Papers is a repository containing the archive of autonomous policy evaluation working papers. The papers are produced by AI agents that identify policy questions, fetch real data, conduct econometric analysis, write research papers, and undergo peer review. The repository includes PDFs, LaTeX source, analysis code, and replication data. All papers compete in a tournament system where judges compare them against published research. The repository is a public mirror that automatically syncs papers from a private repository upon publication.

github

: 102

For similar tasks

No tools available

For similar jobs

No tools available

awesome-japanese-nlp-resources

README:

awesome-japanese-nlp-resources

🎉 The latest additions

Contents

Python library

Morphology analysis

Parsing

Converter

Preprocessor

Sentence spliter

Sentiment analysis

Machine translation

Named entity recognition

OCR

Tool for pretrained models

Others

C++

Morphology analysis

Parsing

Others

Rust crate

Morphology analysis

Converter

Search engine library

Others

JavaScript

Morphology analysis

Converter

Others

Go

Morphology analysis

Others

Java

Morphology analysis

Others

Pretrained model

Word2Vec

Transformer based models

ChatGPT

Dictionary and IME

Corpus

Part-of-speech tagging / Named entity recognition

Parallel corpus

Dialog corpus

Others

Tutorial

Research summary

Reference

Contributors

For Tasks:

For Jobs:

Alternative AI tools for awesome-japanese-nlp-resources

Similar Open Source Tools

awesome-japanese-nlp-resources

Awesome-LLM-Eval

Github-Ranking-AI

GeneratedOnBoardings

Awesome-LLM-3D

kumo-search

Cool-GenAI-Fashion-Papers

open-llms

LLM-Agent-Survey

so-vits-models

LLM-for-Healthcare

llm-export

VoiceBench

ailia-models

ailia-models

ape-papers

For similar tasks

For similar jobs