imodels
Interpretable ML package π for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: 1378
Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. _For interpretability in NLP, check out our new package:imodelsX _
README:
Python package for concise, transparent, and accurate predictive modeling.
All sklearn-compatible and easy to use.
For interpretability in NLP, check out our new package: imodelsX
π docs β’ π demo notebooks
Modern machine-learning models are increasingly complex, often making them difficult to interpret. This package provides a simple interface for fitting and using state-of-the-art interpretable models, all compatible with scikit-learn. These models can often replace black-box models (e.g. random forests) with simpler models (e.g. rule lists) while improving interpretability and computational efficiency, all without sacrificing predictive accuracy! Simply import a classifier or regressor and use the fit
and predict
methods, same as standard scikit-learn models.
from sklearn.model_selection import train_test_split
from imodels import get_clean_dataset, HSTreeClassifierCV # import any imodels model here
# prepare data (a sample clinical dataset)
X, y, feature_names = get_clean_dataset('csi_pecarn_pred')
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42)
# fit the model
model = HSTreeClassifierCV(max_leaf_nodes=4) # initialize a tree model and specify only 4 leaf nodes
model.fit(X_train, y_train, feature_names=feature_names) # fit model
preds = model.predict(X_test) # discrete predictions: shape is (n_test, 1)
preds_proba = model.predict_proba(X_test) # predicted probabilities: shape is (n_test, n_classes)
print(model) # print the model
------------------------------
Decision Tree with Hierarchical Shrinkage
Prediction is made by looking at the value in the appropriate leaf of the tree
------------------------------
|--- FocalNeuroFindings2 <= 0.50
| |--- HighriskDiving <= 0.50
| | |--- Torticollis2 <= 0.50
| | | |--- value: [0.10]
| | |--- Torticollis2 > 0.50
| | | |--- value: [0.30]
| |--- HighriskDiving > 0.50
| | |--- value: [0.68]
|--- FocalNeuroFindings2 > 0.50
| |--- value: [0.42]
Install with pip install imodels
(see here for help).
ποΈ Docs β π Research paper β π Reference code implementation
Model | Reference | Description |
---|---|---|
Rulefit rule set | ποΈ, π, π | Fits a sparse linear model on rules extracted from decision trees |
Skope rule set | ποΈ, π | Extracts rules from gradient-boosted trees, deduplicates them, then linearly combines them based on their OOB precision |
Boosted rule set | ποΈ, π, π | Sequentially fits a set of rules with Adaboost |
Slipper rule set | ποΈ, π | Sequentially learns a set of rules with SLIPPER |
Bayesian rule set | ποΈ, π, π | Finds concise rule set with Bayesian sampling (slow) |
Optimal rule list | ποΈ, π, π | Fits rule list using global optimization for sparsity (CORELS) |
Bayesian rule list | ποΈ, π, π | Fits compact rule list distribution with Bayesian sampling (slow) |
Greedy rule list | ποΈ, π | Uses CART to fit a list (only a single path), rather than a tree |
OneR rule list | ποΈ, π | Fits rule list restricted to only one feature |
Optimal rule tree | ποΈ, π, π | Fits succinct tree using global optimization for sparsity (GOSDT) |
Greedy rule tree | ποΈ, π, π | Greedily fits tree using CART |
C4.5 rule tree | ποΈ, π, π | Greedily fits tree using C4.5 |
TAO rule tree | ποΈ, π | Fits tree using alternating optimization |
Iterative random forest |
ποΈ, π, π | Repeatedly fit random forest, giving features with high importance a higher chance of being selected |
Sparse integer linear model |
ποΈ, π | Sparse linear model with integer coefficients |
Tree GAM | ποΈ, π, π | Generalized additive model fit with short boosted trees |
Greedy treesums (FIGS) | ποΈ,γ €π | Sum of small trees with very few total rules (FIGS) |
Hierarchical shrinkage wrapper |
ποΈ, π | Improve a decision tree, random forest, or gradient-boosting ensemble with ultra-fast, post-hoc regularization |
RF+ (MDI+) | ποΈ, π | Flexible random forest-based feature importance |
Distillation wrapper |
ποΈ | Train a black-box model, then distill it into an interpretable model |
AutoML wrapper | ποΈ | Automatically fit and select an interpretable model |
More models | β | (Coming soon!) Lightweight Rule Induction, MLRules, ... |
Demos are contained in the notebooks folder.
Quickstart demo
Shows how to fit, predict, and visualize with different interpretable modelsAutogluon demo
Fit/select an interpretable model automatically using Autogluon AutoMLQuickstart colab demo
Shows how to fit, predict, and visualize with different interpretable modelsClinical decision rule notebook
Shows an example of usingimodels
for deriving a clinical decision rule
Posthoc analysis
We also include some demos of posthoc analysis, which occurs after fitting models: posthoc.ipynb shows different simple analyses to interpret a trained model and uncertainty.ipynb contains basic code to get uncertainty estimates for a modelThe final form of the above models takes one of the following forms, which aim to be simultaneously simple to understand and highly predictive:
Rule set | Rule list | Rule tree | Algebraic models |
---|---|---|---|
Different models and algorithms vary not only in their final form but also in different choices made during modeling, such as how they generate, select, and postprocess rules:
Rule candidate generation | Rule selection | Rule postprocessing |
---|---|---|
Ex. RuleFit vs. SkopeRules
RuleFit and SkopeRules differ only in the way they prune rules: RuleFit uses a linear model whereas SkopeRules heuristically deduplicates rules sharing overlap.Ex. Bayesian rule lists vs. greedy rule lists
Bayesian rule lists and greedy rule lists differ in how they select rules; bayesian rule lists perform a global optimization over possible rule lists while Greedy rule lists pick splits sequentially to maximize a given criterion.Ex. FPSkope vs. SkopeRules
FPSkope and SkopeRules differ only in the way they generate candidate rules: FPSkope uses FPgrowth whereas SkopeRules extracts rules from decision trees.Different models support different machine-learning tasks. Current support for different models is given below (each of these models can be imported directly from imodels (e.g. from imodels import RuleFitClassifier
):
Model | Binary classification | Regression | Notes |
---|---|---|---|
Rulefit rule set | RuleFitClassifier | RuleFitRegressor | |
Skope rule set | SkopeRulesClassifier | ||
Boosted rule set | BoostedRulesClassifier | BoostedRulesRegressor | |
SLIPPER rule set | SlipperClassifier | ||
Bayesian rule set | BayesianRuleSetClassifier | Fails for large problems | |
Optimal rule list (CORELS) | OptimalRuleListClassifier | Requires corels, fails for large problems | |
Bayesian rule list | BayesianRuleListClassifier | ||
Greedy rule list | GreedyRuleListClassifier | ||
OneR rule list | OneRClassifier | ||
Optimal rule tree (GOSDT) | OptimalTreeClassifier | Requires gosdt, fails for large problems | |
Greedy rule tree (CART) | GreedyTreeClassifier | GreedyTreeRegressor | |
C4.5 rule tree | C45TreeClassifier | ||
TAO rule tree | TaoTreeClassifier | TaoTreeRegressor | |
Iterative random forest | IRFClassifier | Requires irf | |
Sparse integer linear model | SLIMClassifier | SLIMRegressor | Requires extra dependencies for speed |
Tree GAM | TreeGAMClassifier | TreeGAMRegressor | |
Greedy tree sums (FIGS) | FIGSClassifier | FIGSRegressor | |
Hierarchical shrinkage | HSTreeClassifierCV | HSTreeRegressorCV | Wraps any sklearn tree-based model |
Distillation | DistilledRegressor | Wraps any sklearn-compatible models | |
AutoML model | AutoInterpretableClassifierοΈ | AutoInterpretableRegressorοΈ |
Data-wrangling functions for working with popular tabular datasets (e.g. compas).
These functions, in conjunction with imodels-data and imodels-experiments, make it simple to download data and run experiments on new models.Explain classification errors with a simple posthoc function.
Fit an interpretable model to explain a previous model's errors (ex. in this notebookπ).Fast and effective discretizers for data preprocessing.
Discretizer | Reference | Description |
---|---|---|
MDLP | ποΈ, π, π | Discretize using entropy minimization heuristic |
Simple | ποΈ, π | Simple KBins discretization |
Random Forest | ποΈ | Discretize into bins based on random forest split popularity |
Rule-based utils for customizing models
The code here contains many useful and customizable functions for rule-based learning in the util folder. This includes functions / classes for rule deduplication, rule screening, and converting between trees, rulesets, and neural networks.After developing and playing with imodels
, we developed a few new models to overcome limitations of existing interpretable models.
π Paper, π Post, π Citation
Fast Interpretable Greedy-Tree Sums (FIGS) is an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes CART to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, keeping the model interpretable. Experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance when restricted to just a few splits (e.g. less than 20).
Example FIGS model. FIGS learns a sum of trees with a flexible number of trees; to make its prediction, it sums the result from each tree.
π Paper (ICML 2022), π Post, π Citation
Hierarchical shrinkage is an extremely fast post-hoc regularization method which works on any decision tree (or tree-based ensemble, such as Random Forest). It does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors (using a single regularization parameter). Experiments over a wide variety of datasets show that hierarchical shrinkage substantially increases the predictive performance of individual decision trees and decision-tree ensembles.
HS Example. HS applies post-hoc regularization to any decision tree by shrinking each node towards its parent.
π Paper, π Post, π Citation
MDI+ is a novel feature importance framework, which generalizes the popular mean decrease in impurity (MDI) importance score for random forests. At its core, MDI+ expands upon a recently discovered connection between linear regression and decision trees. In doing so, MDI+ enables practitioners to (1) tailor the feature importance computation to the data/problem structure and (2) incorporate additional features or knowledge to mitigate known biases of decision trees. In both real data case studies and extensive real-data-inspired simulations, MDI+ outperforms commonly used feature importance measures (e.g., MDI, permutation-based scores, and TreeSHAP) by substantional margins.
Readings
Reference implementations (also linked above)
The code here heavily derives from the wonderful work of previous projects. We seek to to extract out, unify, and maintain key parts of these projects.- pycorels - by @fingoldin and the original CORELS team
- sklearn-expertsys - by @tmadl and @kenben based on original code by Ben Letham
- rulefit - by @christophM
- skope-rules - by the skope-rules team (including @ngoix, @floriangardin, @datajms, Bibi Ndiaye, Ronan Gautier)
- boa - by @wangtongada
Related packages
- gplearn: symbolic regression/classification
- pysr: fast symbolic regression
- pygam: generative additive models
- interpretml: boosting-based gam
- h20 ai: gams + glms (and more)
- optbinning: data discretization / scoring models
Updates
- For updates, star the repo, see this related repo, or follow @csinva_
- Please make sure to give authors of original methods / base implementations appropriate credit!
- Contributing: pull requests very welcome!
Please cite the package if you use it in an academic work :)
@software{
imodels2021,
title = {imodels: a python package for fitting interpretable models},
journal = {Journal of Open Source Software},
publisher = {The Open Journal},
year = {2021},
author = {Singh, Chandan and Nasseri, Keyan and Tan, Yan Shuo and Tang, Tiffany and Yu, Bin},
volume = {6},
number = {61},
pages = {3192},
doi = {10.21105/joss.03192},
url = {https://doi.org/10.21105/joss.03192},
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for imodels
Similar Open Source Tools
imodels
Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. _For interpretability in NLP, check out our new package:imodelsX _
amber-train
Amber is the first model in the LLM360 family, an initiative for comprehensive and fully open-sourced LLMs. It is a 7B English language model with the LLaMA architecture. The model type is a language model with the same architecture as LLaMA-7B. It is licensed under Apache 2.0. The resources available include training code, data preparation, metrics, and fully processed Amber pretraining data. The model has been trained on various datasets like Arxiv, Book, C4, Refined-Web, StarCoder, StackExchange, and Wikipedia. The hyperparameters include a total of 6.7B parameters, hidden size of 4096, intermediate size of 11008, 32 attention heads, 32 hidden layers, RMSNorm Ξ΅ of 1e^-6, max sequence length of 2048, and a vocabulary size of 32000.
LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.
Groma
Groma is a grounded multimodal assistant that excels in region understanding and visual grounding. It can process user-defined region inputs and generate contextually grounded long-form responses. The tool presents a unique paradigm for multimodal large language models, focusing on visual tokenization for localization. Groma achieves state-of-the-art performance in referring expression comprehension benchmarks. The tool provides pretrained model weights and instructions for data preparation, training, inference, and evaluation. Users can customize training by starting from intermediate checkpoints. Groma is designed to handle tasks related to detection pretraining, alignment pretraining, instruction finetuning, instruction following, and more.
nncf
Neural Network Compression Framework (NNCF) provides a suite of post-training and training-time algorithms for optimizing inference of neural networks in OpenVINOβ’ with a minimal accuracy drop. It is designed to work with models from PyTorch, TorchFX, TensorFlow, ONNX, and OpenVINOβ’. NNCF offers samples demonstrating compression algorithms for various use cases and models, with the ability to add different compression algorithms easily. It supports GPU-accelerated layers, distributed training, and seamless combination of pruning, sparsity, and quantization algorithms. NNCF allows exporting compressed models to ONNX or TensorFlow formats for use with OpenVINOβ’ toolkit, and supports Accuracy-Aware model training pipelines via Adaptive Compression Level Training and Early Exit Training.
floneum
Floneum is a graph editor that makes it easy to develop your own AI workflows. It uses large language models (LLMs) to run AI models locally, without any external dependencies or even a GPU. This makes it easy to use LLMs with your own data, without worrying about privacy. Floneum also has a plugin system that allows you to improve the performance of LLMs and make them work better for your specific use case. Plugins can be used in any language that supports web assembly, and they can control the output of LLMs with a process similar to JSONformer or guidance.
AV-Deepfake1M
The AV-Deepfake1M repository is the official repository for the paper AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset. It addresses the challenge of detecting and localizing deepfake audio-visual content by proposing a dataset containing video manipulations, audio manipulations, and audio-visual manipulations for over 2K subjects resulting in more than 1M videos. The dataset is crucial for developing next-generation deepfake localization methods.
dl_model_infer
This project is a c++ version of the AI reasoning library that supports the reasoning of tensorrt models. It provides accelerated deployment cases of deep learning CV popular models and supports dynamic-batch image processing, inference, decode, and NMS. The project has been updated with various models and provides tutorials for model exports. It also includes a producer-consumer inference model for specific tasks. The project directory includes implementations for model inference applications, backend reasoning classes, post-processing, pre-processing, and target detection and tracking. Speed tests have been conducted on various models, and onnx downloads are available for different models.
langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).
SemanticFinder
SemanticFinder is a frontend-only live semantic search tool that calculates embeddings and cosine similarity client-side using transformers.js and SOTA embedding models from Huggingface. It allows users to search through large texts like books with pre-indexed examples, customize search parameters, and offers data privacy by keeping input text in the browser. The tool can be used for basic search tasks, analyzing texts for recurring themes, and has potential integrations with various applications like wikis, chat apps, and personal history search. It also provides options for building browser extensions and future ideas for further enhancements and integrations.
pipeline
Pipeline is a Python library designed for constructing computational flows for AI/ML models. It supports both development and production environments, offering capabilities for inference, training, and finetuning. The library serves as an interface to Mystic, enabling the execution of pipelines at scale and on enterprise GPUs. Users can also utilize this SDK with Pipeline Core on a private hosted cluster. The syntax for defining AI/ML pipelines is reminiscent of sessions in Tensorflow v1 and Flows in Prefect.
recommenders
Recommenders is a project under the Linux Foundation of AI and Data that assists researchers, developers, and enthusiasts in prototyping, experimenting with, and bringing to production a range of classic and state-of-the-art recommendation systems. The repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. It covers tasks such as preparing data, building models using various recommendation algorithms, evaluating algorithms, tuning hyperparameters, and operationalizing models in a production environment on Azure. The project provides utilities to support common tasks like loading datasets, evaluating model outputs, and splitting training/test data. It includes implementations of state-of-the-art algorithms for self-study and customization in applications.
Pearl
Pearl is a production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. It enables researchers and practitioners to develop Reinforcement Learning AI agents that prioritize cumulative long-term feedback over immediate feedback and can adapt to environments with limited observability, sparse feedback, and high stochasticity. Pearl offers a diverse set of unique features for production environments, including dynamic action spaces, offline learning, intelligent neural exploration, safe decision making, history summarization, and data augmentation.
TableLLM
TableLLM is a large language model designed for efficient tabular data manipulation tasks in real office scenarios. It can generate code solutions or direct text answers for tasks like insert, delete, update, query, merge, and chart operations on tables embedded in spreadsheets or documents. The model has been fine-tuned based on CodeLlama-7B and 13B, offering two scales: TableLLM-7B and TableLLM-13B. Evaluation results show its performance on benchmarks like WikiSQL, Spider, and self-created table operation benchmark. Users can use TableLLM for code and text generation tasks on tabular data.
llm-datasets
LLM Datasets is a repository containing high-quality datasets, tools, and concepts for LLM fine-tuning. It provides datasets with characteristics like accuracy, diversity, and complexity to train large language models for various tasks. The repository includes datasets for general-purpose, math & logic, code, conversation & role-play, and agent & function calling domains. It also offers guidance on creating high-quality datasets through data deduplication, data quality assessment, data exploration, and data generation techniques.
PIXIU
PIXIU is a project designed to support the development, fine-tuning, and evaluation of Large Language Models (LLMs) in the financial domain. It includes components like FinBen, a Financial Language Understanding and Prediction Evaluation Benchmark, FIT, a Financial Instruction Dataset, and FinMA, a Financial Large Language Model. The project provides open resources, multi-task and multi-modal financial data, and diverse financial tasks for training and evaluation. It aims to encourage open research and transparency in the financial NLP field.
For similar tasks
imodels
Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. _For interpretability in NLP, check out our new package:imodelsX _
vectordb-recipes
This repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects. * These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. * It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc. * LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions! This repository is divided into 3 sections: - Examples - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes! - Applications - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools - Tutorials - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.
For similar jobs
lollms-webui
LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customerβs subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.
minio
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.
mage-ai
Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.
AiTreasureBox
AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.
tidb
TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
airbyte
Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.
labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.