Data-and-AI-Concepts

This repository contains Data Science interview questions covered on my Threads page.

Stars: 152

Visit

This repository is a curated collection of data science and AI concepts and IQs, covering topics from foundational mathematics to cutting-edge generative AI concepts. It aims to support learners and professionals preparing for various data science roles by providing detailed explanations and notebooks for each concept.

README:

Data and AI concepts and IQs by @AIinMinutes 🎯

This repository contains a curated collection of data science/analysis and AI concepts and IQs, shared on my Threads page @AIinMinutes. Topics range from foundational mathematics to cutting-edge generative AI concepts, aiming to support learners and professionals preparing for various data science roles. 📚

The book based on this repo is currently under development. Check it out here: AI in Minutes.

Generative AI 🤖

Concept #	Concept Name	Notebook
1	Causal Attention	View Notebook
2	Text Decoding Strategies: Greedy vs Beam	View Notebook
3	Layer vs RMS Normalization	View Notebook
4	Multi-head Attention	View Notebook
5	Energy	View Notebook
6	Gaussian Mixture Models	View Notebook
7	Hyperplanes	View Notebook
8	Inner Product	View Notebook
9	Moore Penrose Inverse	View Notebook
10	Jacobians and Gradients behind Multi-class Classification	View Notebook
11	Norm and Metric	View Notebook
12	Rank One Matrices	View Notebook
13	Auto-encoder Latent Space	View Notebook
14	PCA for Anomaly Detection	View Notebook
15	Variational AutoEncoder for Anomaly Detection	View Notebook
16	Variational AutoEncoder Loss Function	View Notebook
17	Attention Mechanism	View Notebook
18	GELU	View Notebook
19	Orthogonality	View Notebook
20	Perplexity	View Notebook

Machine Learning 🔧

Concept #	Concept Name	Notebook
1	Gini Impurity vs Entropy	View Notebook
2	Agglomerative Clustering	View Notebook
3	Elastic Net	View Notebook
4	Huber Loss	View Notebook
5	Mahalanobis Distance	View Notebook
6	Natural Breaks	View Notebook
7	Oversampling	View Notebook
8	PCA vs Feature Agglomeration	View Notebook
9	Permutation Importance	View Notebook
10	Pseudo R^2	View Notebook

Deep Learning 🧠

Concept #	Concept Name	Notebook
1	Balanced Focal Loss	View Notebook
2	Jensen's Inequality	View Notebook
3	Reparametrization Trick	View Notebook
4	Temperature Scaled Softmax	View Notebook

Interpretable AI 🔍

Concept #	Concept Name	Notebook
1	Logistic Regression Coefficient Interpretation	View Notebook
2	Shapley values and SHAP for ML	View Notebook
3	Counterfactuals	View Notebook

Applied Statistics 📊

Concept #	Concept Name	Notebook
1	Autocorrelation Function vs Partial Autocorrelation Function	View Notebook
2	Adjusted R^2	View Notebook
3	Condition Number	View Notebook
4	Cramer's V	View Notebook
5	Exponentially Weighted Average and Bias Correction	View Notebook
6	Kendall's Tau Rank Correlation	View Notebook
7	Kruskal Wallis	View Notebook
8	Spurious Correlation	View Notebook
9	Leave One Out Cross Validation and PRESS	View Notebook

Multivariate Statistics 📈

Concept #	Concept Name	Notebook
1	Canonical Correlation Analysis	View Notebook
2	Correspondence Analysis	View Notebook
3	Factor Analysis	View Notebook
4	Hotelling's T^2	View Notebook
5	Principal Component Analysis	View Notebook

Mathematical Statistics 🎲

Concept #	Concept Name	Notebook
1	Chebyshev's Inequality	View Notebook
2	Distribution of Minimum	View Notebook
3	Matrix Calculus Jacobians and Gradients	View Notebook
4	Multivariate Normal Distribution	View Notebook
5	Mutual Information	View Notebook
6	Point Biserial Correlation Coefficient	View Notebook
7	Unbiasesd vs Consistent Estimator	View Notebook
8	ECDF	View Notebook

Graph Data Science 🌐

Concept #	Concept Name	Notebook
1	User Item Interaction Matrix	View Notebook

Prerequisite Mathematics ➗

Concept #	Concept Name	Notebook
1	Spectral Decomposition	View Notebook

Programming 💻

Concept #	Concept Name	Notebook
1	Kadane's Algorithm	View Notebook
2	Prefix Sum and Sliding Window	View Notebook
3	Pivoting in Pandas	View Notebook

Visualization 📉

Concept #	Concept Name	Notebook
1	Plotnine: Python's ggplot2	View Notebook

Current Concept and IQ Generating Process 🔄

I follow a structured approach to sharing knowledge on Threads, posting concepts and thought-provoking questions (IQs) that stem from my professional experience, academic background, and interview scenarios. These questions are either ones I have encountered, been asked in interviews, or would consider posing in a data science discussion. I refine each question to maximize conceptual coverage and, at times, deliberately choose intellectually stimulating topics to encourage deeper engagement.

Star History ⭐

Contributing 🤝

Contributions are welcome! If you have suggestions for new questions, additional resources, or improvements to the current answers, feel free to submit a pull request or open an issue.

License

Code in this repository is licensed under the MIT License.
Content (text, explanations, visualizations, etc.) is licensed under Creative Commons Attribution 4.0 (CC BY 4.0).

This means:

You are free to use and modify the code as per the MIT license.
You may reuse and share content, but you must provide proper attribution.

For details, check the LICENSE file.

Contact 📫

Email: [email protected]

For more updates, follow me on Threads @AIinMinutes.

For Tasks:

Click tags to check more tools for each tasks

analyze data build models interpret results visualize data optimize algorithms

For Jobs:

data scientist machine learning engineer ai researcher data analyst business intelligence analyst

Alternative AI tools for Data-and-AI-Concepts

Similar Open Source Tools

Data-and-AI-Concepts

github

: 152

so-vits-models

This repository collects various LLM, AI-related models, applications, and datasets, including LLM-Chat for dialogue models, LLMs for large models, so-vits-svc for sound-related models, stable-diffusion for image-related models, and virtual-digital-person for generating videos. It also provides resources for deep learning courses and overviews, AI competitions, and specific AI tasks such as text, image, voice, and video processing.

github

: 164

MobileLLM

This repository contains the training code of MobileLLM, a language model optimized for on-device use cases with fewer than a billion parameters. It integrates SwiGLU activation function, deep and thin architectures, embedding sharing, and grouped-query attention to achieve high-quality LLMs. MobileLLM-125M/350M shows significant accuracy improvements over previous models on zero-shot commonsense reasoning tasks. The design philosophy scales effectively to larger models, with state-of-the-art results for MobileLLM-600M/1B/1.5B.

github

: 917

ai-reference-models

The Intel® AI Reference Models repository contains links to pre-trained models, sample scripts, best practices, and tutorials for popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors and Intel® Data Center GPUs. The purpose is to quickly replicate complete software environments showcasing the AI capabilities of Intel platforms. It includes optimizations for popular deep learning frameworks like TensorFlow and PyTorch, with additional plugins/extensions for improved performance. The repository is licensed under Apache License Version 2.0.

github

: 676

kumo-search

Kumo search is an end-to-end search engine framework that supports full-text search, inverted index, forward index, sorting, caching, hierarchical indexing, intervention system, feature collection, offline computation, storage system, and more. It runs on the EA (Elastic automic infrastructure architecture) platform, enabling engineering automation, service governance, real-time data, service degradation, and disaster recovery across multiple data centers and clusters. The framework aims to provide a ready-to-use search engine framework to help users quickly build their own search engines. Users can write business logic in Python using the AOT compiler in the project, which generates C++ code and binary dynamic libraries for rapid iteration of the search engine.

github

: 248

rknn-llm

RKLLM software stack is a toolkit designed to help users quickly deploy AI models to Rockchip chips. It consists of RKLLM-Toolkit for model conversion and quantization, RKLLM Runtime for deploying models on Rockchip NPU platform, and RKNPU kernel driver for hardware interaction. The toolkit supports RK3588 and RK3576 series chips and various models like TinyLLAMA, Qwen, Phi, ChatGLM3, Gemma, InternLM2, and MiniCPM. Users can download packages, docker images, examples, and docs from RKLLM_SDK. Additionally, RKNN-Toolkit2 SDK is available for deploying additional AI models.

github

: 368

models

The Intel® AI Reference Models repository contains links to pre-trained models, sample scripts, best practices, and tutorials for popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors and Intel® Data Center GPUs. It aims to replicate the best-known performance of target model/dataset combinations in optimally-configured hardware environments. The repository will be deprecated upon the publication of v3.2.0 and will no longer be maintained or published.

github

: 669

LLM-for-Healthcare

The repository 'LLM-for-Healthcare' provides a comprehensive survey of large language models (LLMs) for healthcare, covering data, technology, applications, and accountability and ethics. It includes information on various LLM models, training data, evaluation methods, and computation costs. The repository also discusses tasks such as NER, text classification, question answering, dialogue systems, and generation of medical reports from images in the healthcare domain.

github

: 96

Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving.

github

: 184

ai-game-development-tools

Here we will keep track of the AI Game Development Tools, including LLM, Agent, Code, Writer, Image, Texture, Shader, 3D Model, Animation, Video, Audio, Music, Singing Voice and Analytics. 🔥 * Tool (AI LLM) * Game (Agent) * Code * Framework * Writer * Image * Texture * Shader * 3D Model * Avatar * Animation * Video * Audio * Music * Singing Voice * Speech * Analytics * Video Tool

github

: 312

Awesome-LLM4IE-Papers

github

: 645

ai-game-devtools

github

: 735

Github-Ranking-AI

This repository provides a list of the most starred and forked repositories on GitHub. It is updated automatically and includes information such as the project name, number of stars, number of forks, language, number of open issues, description, and last commit date. The repository is divided into two sections: LLM and chatGPT. The LLM section includes repositories related to large language models, while the chatGPT section includes repositories related to the chatGPT chatbot.

github

: 227

AIInfra

AIInfra is an open-source project focused on AI infrastructure, specifically targeting large models in distributed clusters, distributed architecture, distributed training, and algorithms related to large models. The project aims to explore and study system design in artificial intelligence and deep learning, with a focus on the hardware and software stack for building AI large model systems. It provides a comprehensive curriculum covering topics such as AI chip principles, communication and storage, AI clusters, large model training, and inference, as well as algorithms for large models. The course is designed for undergraduate and graduate students, as well as professionals working with AI large model systems, to gain a deep understanding of AI computer system architecture and design.

github

: 1.5k

LLM4EC

LLM4EC is an interdisciplinary research repository focusing on the intersection of Large Language Models (LLM) and Evolutionary Computation (EC). It provides a comprehensive collection of papers and resources exploring various applications, enhancements, and synergies between LLM and EC. The repository covers topics such as LLM-assisted optimization, EA-based LLM architecture search, and applications in code generation, software engineering, neural architecture search, and other generative tasks. The goal is to facilitate research and development in leveraging LLM and EC for innovative solutions in diverse domains.

github

: 78

factualNLG

FactualNLG is a tool designed to analyze the consistency of edits in summaries. It includes a benchmark with various LLM models, data release for the SummEdits benchmark, explanation analysis for identifying inconsistent summaries, and prompts used in experiments.

github

: 59

For similar tasks

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

sorrentum

Sorrentum is an open-source project that aims to combine open-source development, startups, and brilliant students to build machine learning, AI, and Web3 / DeFi protocols geared towards finance and economics. The project provides opportunities for internships, research assistantships, and development grants, as well as the chance to work on cutting-edge problems, learn about startups, write academic papers, and get internships and full-time positions at companies working on Sorrentum applications.

github

: 89

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

zep-python

Zep is an open-source platform for building and deploying large language model (LLM) applications. It provides a suite of tools and services that make it easy to integrate LLMs into your applications, including chat history memory, embedding, vector search, and data enrichment. Zep is designed to be scalable, reliable, and easy to use, making it a great choice for developers who want to build LLM-powered applications quickly and easily.

github

: 60

telemetry-airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

github

: 185

mojo

Mojo is a new programming language that bridges the gap between research and production by combining Python syntax and ecosystem with systems programming and metaprogramming features. Mojo is still young, but it is designed to become a superset of Python over time.

github

: 23.0k

pandas-ai

PandasAI is a Python library that makes it easy to ask questions to your data in natural language. It helps you to explore, clean, and analyze your data using generative AI.

github

: 14.0k

databend

Databend is an open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake. With its focus on fast query execution and data ingestion, it's designed for complex analysis of the world's largest datasets.

github

: 7.7k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675