catalyst

catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Stars: 706

Visit
 screenshot

Catalyst is a C# Natural Language Processing library designed for speed, inspired by spaCy's design. It provides pre-trained models, support for training word and document embeddings, and flexible entity recognition models. The library is fast, modern, and pure-C#, supporting .NET standard 2.0. It is cross-platform, running on Windows, Linux, macOS, and ARM. Catalyst offers non-destructive tokenization, named entity recognition, part-of-speech tagging, language detection, and efficient binary serialization. It includes pre-built models for language packages and lemmatization. Users can store and load models using streams. Getting started with Catalyst involves installing its NuGet Package and setting the storage to use the online repository. The library supports lazy loading of models from disk or online. Users can take advantage of C# lazy evaluation and native multi-threading support to process documents in parallel. Training a new FastText word2vec embedding model is straightforward, and Catalyst also provides algorithms for fast embedding search and dimensionality reduction.

README:

Nuget Build Status

catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

Gitter

⚡ Features

Language Packages ✨

All language-specific data and models are provided as NuGet packages, you can find all packages here.

The new models are trained on the latest release of Universal Dependencies v2.7.

We've also added the option to store and load models using streams:

// Creates and stores the model
var isApattern = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");
isApattern.NewPattern(
    "Is+Noun",
    mp => mp.Add(
        new PatternUnit(P.Single().WithToken("is").WithPOS(PartOfSpeech.VERB)),
        new PatternUnit(P.Multiple().WithPOS(PartOfSpeech.NOUN, PartOfSpeech.PROPN, PartOfSpeech.AUX, PartOfSpeech.DET, PartOfSpeech.ADJ))
));
using(var f = File.OpenWrite("my-pattern-spotter.bin"))
{
    await isApattern.StoreAsync(f);
}

// Load the model back from disk
var isApattern2 = new PatternSpotter(Language.English, 0, tag: "is-a-pattern", captureTag: "IsA");

using(var f = File.OpenRead("my-pattern-spotter.bin"))
{
    await isApattern2.LoadAsync(f);
}

✨ Getting Started

Using catalyst is as simple as installing its NuGet Package, and setting the storage to use our online repository. This way, models will be lazy loaded either from disk or downloaded from our online repository. Check out also some of the sample projects for more examples on how to use catalyst.

Catalyst.Models.English.Register(); //You need to pre-register each language (and install the respective NuGet Packages)

Storage.Current = new DiskStorage("catalyst-models");
var nlp = await Pipeline.ForAsync(Language.English);
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.ProcessSingle(doc);
Console.WriteLine(doc.ToJson());

You can also take advantage of C# lazy evaluation and native multi-threading support to process a large number of documents in parallel:

var docs = GetDocuments();
var parsed = nlp.Process(docs);
DoSomething(parsed);

IEnumerable<IDocument> GetDocuments()
{
    //Generates a few documents, to demonstrate multi-threading & lazy evaluation
    for(int i = 0; i < 1000; i++)
    {
        yield return new Document("The quick brown fox jumps over the lazy dog", Language.English);
    }
}

void DoSomething(IEnumerable<IDocument> docs)
{
    foreach(var doc in docs)
    {
        Console.WriteLine(doc.ToJson());
    }
}

Training a new FastText word2vec embedding model is as simple as this:

var nlp = await Pipeline.ForAsync(Language.English);
var ft = new FastText(Language.English, 0, "wiki-word2vec");
ft.Data.Type = FastText.ModelType.CBow;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(nlp.Process(GetDocs()));
ft.StoreAsync();

For fast embedding search, we have also released a C# version of the "Hierarchical Navigable Small World" (HNSW) algorithm on NuGet, based on our fork of Microsoft's HNSW.Net. We have also released a C# version of the "Uniform Manifold Approximation and Projection" (UMAP) algorithm for dimensionality reduction on GitHub and on NuGet.

📖 Links

Documentation
Contribute How to contribute to catalyst codebase.
Samples Sample projects demonstrating catalyst capabilities
Gitter Join our gitter channel

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for catalyst

Similar Open Source Tools

For similar tasks

For similar jobs