PIXIU
This repository introduces PIXIU, an open-source resource featuring the first financial large language models (LLMs), instruction tuning data, and evaluation benchmarks to holistically assess financial LLMs. Our goal is to continually push forward the open-source development of financial artificial intelligence (AI).
Stars: 525
PIXIU is a project designed to support the development, fine-tuning, and evaluation of Large Language Models (LLMs) in the financial domain. It includes components like FinBen, a Financial Language Understanding and Prediction Evaluation Benchmark, FIT, a Financial Instruction Dataset, and FinMA, a Financial Large Language Model. The project provides open resources, multi-task and multi-modal financial data, and diverse financial tasks for training and evaluation. It aims to encourage open research and transparency in the financial NLP field.
README:
Pixiu Paper | FinBen Leaderboard
Disclaimer
This repository and its contents are provided for academic and educational purposes only. None of the material constitutes financial, legal, or investment advice. No warranties, express or implied, are offered regarding the accuracy, completeness, or utility of the content. The authors and contributors are not responsible for any errors, omissions, or any consequences arising from the use of the information herein. Users should exercise their own judgment and consult professionals before making any financial, legal, or investment decisions. The use of the software and information contained in this repository is entirely at the user's own risk.
By using or accessing the information in this repository, you agree to indemnify, defend, and hold harmless the authors, contributors, and any affiliated organizations or persons from any and all claims or damages.
📢 Update (Date: 09-22-2023)
🚀 We're thrilled to announce that our paper, "PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance", has been accepted by NeurIPS 2023 Track Datasets and Benchmarks!
📢 Update (Date: 10-08-2023)
🌏 We're proud to share that the enhanced versions of FinBen, which now support both Chinese and Spanish!
📢 Update (Date: 02-20-2024)
🌏 We're delighted to share that our paper, "The FinBen: An Holistic Financial Benchmark for Large Language Models", is now available at FinBen.
📢 Update (Date: 05-02-2024)
🌏 We're pleased to invite you to attend the IJCAI2024-challenge, "Financial Challenges in Large Language Models - FinLLM", the starter-kit is available at Starter-kit.
Checkpoints:
Languages
Papers
- PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance
- The FinBen: An Holistic Financial Benchmark for Large Language Models
- No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks
- Dólares or Dollars? Unraveling the Bilingual Prowess of Financial LLMs Between Spanish and English
Evaluations:
- English Evaluation Datasets (More details on FinBen section)
- Spanish Evaluation Datasets
- Chinese Evaluation Datasets
Sentiment Analysis
Classification
- Headlines (flare_headlines)
- FinArg ECC Task1 (flare_finarg_ecc_auc)
- FinArg ECC Task2 (flare_finarg_ecc_arc)
- CFA (flare_cfa)
- MultiFin EN (flare_multifin_en)
- M&A (flare_ma)
- MLESG EN (flare_mlesg)
Knowledge Extraction
- NER (flare_ner)
- Finer Ord (flare_finer_ord)
- FinRED (flare_finred)
- FinCausal20 Task1 (flare_causal20_sc)
- FinCausal20 Task2 (flare_cd)
Number Understanding
Text Summarization
Credit Scoring
- German (flare_german)
- Australian (flare_australian)
- Lendingclub (flare_cra_lendingclub)
- Credit Card Fraud (flare_cra_ccf)
- ccFraud (flare_cra_ccfraud)
- Polish (flare_cra_polish)
- Taiwan Economic Journal (flare_cra_taiwan)
- PortoSeguro (flare_cra_portoseguro)
- Travle Insurance (flare_cra_travelinsurance)
Forecasting
- BigData22 for Stock Movement (flare_sm_bigdata)
- ACL18 for Stock Movement (flare_sm_acl)
- CIKM18 for Stock Movement (flare_sm_cikm)
Welcome to the PIXIU project! This project is designed to support the development, fine-tuning, and evaluation of Large Language Models (LLMs) in the financial domain. PIXIU is a significant step towards understanding and harnessing the power of LLMs in the financial domain.
The repository is organized into several key components, each serving a unique purpose in the financial NLP pipeline:
-
FinBen: Our Financial Language Understanding and Prediction Evaluation Benchmark. FinBen serves as the evaluation suite for financial LLMs, with a focus on understanding and prediction tasks across various financial contexts.
-
FIT: Our Financial Instruction Dataset. FIT is a multi-task and multi-modal instruction dataset specifically tailored for financial tasks. It serves as the training ground for fine-tuning LLMs for these tasks.
-
FinMA: Our Financial Large Language Model (LLM). FinMA is the core of our project, providing the learning and prediction power for our financial tasks.
-
Open resources: PIXIU openly provides the financial LLM, instruction tuning data, and datasets included in the evaluation benchmark to encourage open research and transparency.
-
Multi-task: The instruction tuning data and benchmark in PIXIU cover a diverse set of financial tasks, including four financial NLP tasks and one financial prediction task.
-
Multi-modality: PIXIU's instruction tuning data and benchmark consist of multi-modality financial data, including time series data from the stock movement prediction task. It covers various types of financial texts, including reports, news articles, tweets, and regulatory filings.
-
Diversity: Unlike previous benchmarks focusing mainly on financial NLP tasks, PIXIU's evaluation benchmark includes critical financial prediction tasks aligned with real-world scenarios, making it more challenging.
In this section, we provide a detailed performance analysis of FinMA compared to other leading models, including ChatGPT, GPT-4, and BloombergGPT et al. For this analysis, we've chosen a range of tasks and metrics that span various aspects of financial Natural Language Processing and financial prediction. All model results of FinBen can be found on our leaderboard!
Data | Task | Raw | Data Types | Modalities | License | Paper |
---|---|---|---|---|---|---|
FPB | sentiment analysis | 4,845 | news | text | CC BY-SA 3.0 | [1] |
FiQA-SA | sentiment analysis | 1,173 | news headlines, tweets | text | Public | [2] |
TSA | sentiment analysis | 561 | news headlines | text | CC BY-NC-SA 4.0 | [3] |
FOMC | hawkish-dovish classification | 496 | FOMC transcripts | text | CC BY-NC 4.0 | [4] |
Headlines | news headline classification | 11,412 | news headlines | text | CC BY-SA 3.0 | [5] |
FinArg-ECC-Task1 | argument unit classification | 969 | earnings conference call | text | CC BY-NC-SA 4.0 | [6] |
FinArg-ECC-Task2 | argument relation classification | 690 | earnings conference call | text | CC BY-NC-SA 4.0 | [6] |
Multifin EN | multi-class classification | 546 | article headlines | text | Public | [7] |
M&A | deal completeness classification | 500 | news articles, tweets | text | Public | [8] |
MLESG EN | ESG Issue Identification | 300 | news articles | text | CC BY-NC-ND | [9] |
NER | named entity recognition | 1,366 | financial agreements | text | CC BY-SA 3.0 | [10] |
Finer Ord | named entity recognition | 1,080 | news articles | text | CC BY-NC 4.0 | [11] |
FinRED | relation extraction | 1,070 | earning call transcipts | text | Public | [12] |
FinCausual 2020 Task1 | causal classification | 8,630 | news articles, SEC | text | CC BY 4.0 | [13] |
FinCausual 2020 Task2 | causal detection | 226 | news articles, SEC | text | CC BY 4.0 | [13] |
FinQA | question answering | 8,281 | earnings reports | text, table | MIT License | [14] |
TatQA | question answering | 1,670 | financial reports | text, table | MIT License | [15] |
FNXL | numeric labeling | 318 | SEC | text | Public | [16] |
FSRL | token classification | 97 | news articles | text | MIT License | [17] |
ECTSUM | text summarization | 495 | earning call transcipts | text | Public | [18] |
EDTSUM | text summarization | 2000 | news articles | text | Public | [19] |
German | credit scoring | 1000 | credit records | table | CC BY 4.0 | [20] |
Australian | credit scoring | 690 | credit records | table | CC BY 4.0 | [21] |
Lending Club | credit scoring | 1,3453 | financial information | table | CC0 1.0 | [22] |
BigData22 | stock movement prediction | 7,164 | tweets, historical prices | text, time series | Public | [23] |
ACL18 | stock movement prediction | 27,053 | tweets, historical prices | text, time series | MIT License | [24] |
CIKM18 | stock movement prediction | 4,967 | tweets, historical prices | text, time series | Public | [25] |
ConvFinQA | multi-turn question answering | 1,490 | earnings reports | text, table | MIT License | [26] |
Credit Card Fraud | Fraud Detection | 11,392 | financial information | table | (DbCL) v1.0 | [22] |
ccFraud | Fraud Detection | 10,485 | financial information | table | Public | [22] |
Polish | Financial Distress Identification | 8,681 | financial status features | table | CC BY 4.0 | [22] |
Taiwan Economic Journal | Financial Distress Identification | 6,819 | financial status features | table | CC BY 4.0 | [22] |
PortoSeguro | Claim Analysis | 11,904 | claim and financial information | table | Public | [22] |
Travel Insurance | Claim Analysis | 12,665 | claim and financial information | table | (ODbL) v1.0 | [22] |
1. Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. 2014. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65, 4 (2014), 782–796.
2. Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018. 1941–1942.
3. Keith Cortis, André Freitas, Tobias Daudert, Manuela Huerlimann, Manel Zarrouk, Siegfried Handschuh, and Brian Davis. 2017. SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs and News. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 519–535, Vancouver, Canada. Association for Computational Linguistics.
4. Agam Shah, Suvan Paturi, and Sudheer Chava. 2023. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6664–6679, Toronto, Canada. Association for Computational Linguistics.
5. Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2. Springer, 589–601.
6. Chen C C, Lin C Y, Chiu C J, et al. Overview of the NTCIR-17 FinArg-1 Task: Fine-grained argument understanding in financial analysis[C]//Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies, Tokyo, Japan. 2023.
7. Rasmus Jørgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, Christian Igel, and Desmond Elliott. 2023. MultiFin: A Dataset for Multilingual Financial NLP. In Findings of the Association for Computational Linguistics: EACL 2023, pages 894–909, Dubrovnik, Croatia. Association for Computational Linguistics.
8. Yang, L., Kenny, E.M., Ng, T.L., Yang, Y., Smyth, B., & Dong, R. (2020). Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification. International Conference on Computational Linguistics.
9. Chung-Chi Chen, Yu-Min Tseng, Juyeon Kang, Anaïs Lhuissier, Min-Yuh Day, Teng-Tsai Tu, and Hsin-Hsi Chen. 2023. Multi-lingual esg issue identification. In Proceedings of the Fifth Workshop on Financial Tech- nology and Natural Language Processing (FinNLP) and the Second Multimodal AI For Financial Fore- casting (Muffin).
10. Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015. 84–90.
11. Shah A, Vithani R, Gullapalli A, et al. Finer: Financial named entity recognition dataset and weak-supervision model[J]. arXiv preprint arXiv:2302.11157, 2023.
12. Sharma, Soumya et al. “FinRED: A Dataset for Relation Extraction in Financial Domain.” Companion Proceedings of the Web Conference 2022 (2022): n. pag.
13. Dominique Mariko, Hanna Abi-Akl, Estelle Labidurie, Stephane Durfort, Hugues De Mazancourt, and Mahmoud El-Haj. 2020. The Financial Document Causality Detection Shared Task (FinCausal 2020). In Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation, pages 23–32, Barcelona, Spain (Online). COLING.
14. Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al . 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3697–3711.
15. Zhu, Fengbin, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng and Tat-Seng Chua. “TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance.” ArXiv abs/2105.07624 (2021): n. pag.
16. Soumya Sharma, Subhendu Khatuya, Manjunath Hegde, Afreen Shaikh, Koustuv Dasgupta, Pawan Goyal, and Niloy Ganguly. 2023. Financial Numeric Extreme Labelling: A dataset and benchmarking. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3550–3561, Toronto, Canada. Association for Computational Linguistics.
17. Matthew Lamm, Arun Chaganty, Christopher D. Manning, Dan Jurafsky, and Percy Liang. 2018. Textual Analogy Parsing: What’s Shared and What’s Compared among Analogous Facts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 82–92, Brussels, Belgium. Association for Computational Linguistics.
18. Rajdeep Mukherjee, Abhinav Bohra, Akash Banerjee, Soumya Sharma, Manjunath Hegde, Afreen Shaikh, Shivani Shrivastava, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, and Pawan Goyal. 2022. ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10893–10906, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
19. Zhihan Zhou, Liqian Ma, and Han Liu. 2021. Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2114–2124, Online. Association for Computational Linguistics.
20. Hofmann,Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77.
21. Quinlan,Ross. Statlog (Australian Credit Approval). UCI Machine Learning Repository. https://doi.org/10.24432/C59012.
22. Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Alejandro Lopez-Lira, Hao Wang. 2023. Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models. ArXiv abs/2310.00566 (2023): n. pag.
23. Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. 2022. Accurate Stock Movement Prediction with Self-supervised Learning from Sparse Noisy Tweets. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 1691–1700.
24. Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970–1979.
25. Huizhe Wu, Wei Zhang, Weiwei Shen, and Jun Wang. 2018. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management. 1627–1630.
26. Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279–6292, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
git clone https://github.com/The-FinAI/PIXIU.git --recursive
cd PIXIU
pip install -r requirements.txt
cd src/financial-evaluation
pip install -e .[multilingual]
sudo bash scripts/docker_run.sh
Above command starts a docker container, you can modify docker_run.sh
to fit your environment. We provide pre-built image by running sudo docker pull tothemoon/pixiu:latest
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--network host \
--env https_proxy=$https_proxy \
--env http_proxy=$http_proxy \
--env all_proxy=$all_proxy \
--env HF_HOME=$hf_home \
-it [--rm] \
--name pixiu \
-v $pixiu_path:$pixiu_path \
-v $hf_home:$hf_home \
-v $ssh_pub_key:/root/.ssh/authorized_keys \
-w $workdir \
$docker_user/pixiu:$tag \
[--sshd_port 2201 --cmd "echo 'Hello, world!' && /bin/bash"]
Arguments explain:
-
[]
means ignoreable arguments -
HF_HOME
: huggingface cache dir -
sshd_port
: sshd port of the container, you can runssh -i private_key -p $sshd_port root@$ip
to connect to the container, default to 22001 -
--rm
: remove the container when exit container (ie.CTRL + D
)
Before evaluation, please download BART checkpoint to src/metrics/BARTScore/bart_score.pth
.
For automated evaluation, please follow these instructions:
-
Huggingface Transformer
To evaluate a model hosted on the HuggingFace Hub (for instance, finma-7b-full), use this command:
python eval.py \
--model "hf-causal-llama" \
--model_args "use_accelerate=True,pretrained=TheFinAI/finma-7b-full,tokenizer=TheFinAI/finma-7b-full,use_fast=False" \
--tasks "flare_ner,flare_sm_acl,flare_fpb"
More details can be found in the lm_eval documentation.
- Commercial APIs
Please note, for tasks such as NER, the automated evaluation is based on a specific pattern. This might fail to extract relevant information in zero-shot settings, resulting in relatively lower performance compared to previous human-annotated results.
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python eval.py \
--model gpt-4 \
--tasks flare_ner,flare_sm_acl,flare_fpb
- Self-Hosted Evaluation
To run inference backend:
bash scripts/run_interface.sh
Please adjust run_interface.sh according to your environment requirements.
To evaluate:
python data/*/evaluate.py
Creating a new task for FinBen involves creating a Huggingface dataset and implementing the task in a Python file. This guide walks you through each step of setting up a new task using the FinBen framework.
Your dataset should be created in the following format:
{
"query": "...",
"answer": "...",
"text": "..."
}
In this format:
-
query
: Combination of your prompt and text -
answer
: Your label
For Multi-turn tasks (such as )
For Classification tasks (such as FPB (FinBen_fpb)), additional keys should be defined:
-
choices
: Set of labels -
gold
: Index of the correct label in choices (Start from 0)
For Sequential Labeling tasks (such as Finer Ord (FinBen_finer_ord)), additional keys should be defined:
-
label
: List of token labels -
token
: List of tokens
For Extractive Summarization tasks (such as ECTSUM (FinBen_ectsum)), additional keys should be defined:
-
label
: List of sentence labels
For abstractive Summarization and Question Answering tasks (such as EDTSUM (FinBen_edtsum)), no additional keys should be defined
Once your dataset is ready, you can start implementing your task. Your task should be defined within a new class in flare.py or any other Python file located within the tasks directory.
To cater to a range of tasks, we offer several specialized base classes, including Classification
, SequentialLabeling
, RelationExtraction
, ExtractiveSummarization
, AbstractiveSummarization
and QA
.
For instance, if you are embarking on a classification task, you can directly leverage our Classification
base class. This class allows for efficient and intuitive task creation. To better demonstrate this, let's delve into an example of crafting a task named FinBen-FPB using the Classification
base class:
class flareFPB(Classification):
DATASET_PATH = "flare-fpb"
And that's it! Once you've created your task class, the next step is to register it in the src/tasks/__init__.py
file. To do this, add a new line following the format "task_name": module.ClassName
. Here is how it's done:
TASK_REGISTRY = {
"flare_fpb": flare.FPB,
"your_new_task": your_module.YourTask, # This is where you add your task
}
Task | Metric | Illustration |
---|---|---|
Classification | Accuracy | This metric represents the ratio of correctly predicted observations to total observations. It is calculated as (True Positives + True Negatives) / Total Observations. |
Classification | F1 Score | The F1 Score represents the harmonic mean of precision and recall, thereby creating an equilibrium between these two factors. It proves particularly useful in scenarios where one factor bears more significance than the other. The score ranges from 0 to 1, with 1 signifying perfect precision and recall, and 0 indicating the worst case. Furthermore, we provide both 'weighted' and 'macro' versions of the F1 score. |
Classification | Missing Ratio | This metric calculates the proportion of responses where no options from the given choices in the task are returned. |
Classification | Matthews Correlation Coefficient (MCC) | The MCC is a metric that assesses the quality of binary classifications, producing a score ranging from -1 to +1. A score of +1 signifies perfect prediction, 0 denotes a prediction no better than random chance, and -1 indicates a completely inverse prediction. |
Sequential Labeling | F1 score | In the context of Sequential Labeling tasks, we utilize the F1 Score as computed by the seqeval library, a robust entity-level evaluation metric. This metric mandates an exact match of both the entity's span and type between the predicted and ground truth entities for a correct evaluation. True Positives (TP) represent correctly predicted entities, False Positives (FP) denote incorrectly predicted entities or entities with mismatched spans/types, and False Negatives (FN) signify missed entities from the ground truth. Precision, recall, and F1-score are then computed using these quantities, with the F1 Score representing the harmonic mean of precision and recall. |
Sequential Labeling | Label F1 score | This metric evaluates model performance based solely on the correctness of the labels predicted, without considering entity spans. |
Relation Extraction | Precision | Precision measures the proportion of correctly predicted relations out of all predicted relations. It is calculated as the number of True Positives (TP) divided by the sum of True Positives and False Positives (FP). |
Relation Extraction | Recall | Recall measures the proportion of correctly predicted relations out of all actual relations. It is calculated as the number of True Positives (TP) divided by the sum of True Positives and False Negatives (FN). |
Relation Extraction | F1 score | The F1 Score is the harmonic mean of precision and recall, and it provides a balance between these two metrics. The F1 Score is at its best at 1 (perfect precision and recall) and worst at 0. |
Extractive and Abstractive Summarization | Rouge-N | This measures the overlap of N-grams (a contiguous sequence of N items from a given sample of text) between the system-generated summary and the reference summary. 'N' can be 1, 2, or more, with ROUGE-1 and ROUGE-2 being commonly used to assess unigram and bigram overlaps respectively. |
Extractive and Abstractive Summarization | Rouge-L | This metric evaluates the longest common subsequence (LCS) between the system and the reference summaries. LCS takes into account sentence level structure similarity naturally and identifies longest co-occurring in-sequence n-grams automatically. |
Question Answering | EmACC | EMACC assesses the exact match between the model-generated response and the reference answer. In other words, the model-generated response is considered correct only if it matches the reference answer exactly, word-for-word. |
Additionally, you can determine if the labels should be lowercased during the matching process by specifying
LOWER_CASE
in your class definition. This is pertinent since labels are matched based on their appearance in the generated output. For tasks like examinations where the labels are a specific set of capitalized letters such as 'A', 'B', 'C', this should typically be set to False.
Our instruction dataset is uniquely tailored for the domain-specific LLM, FinMA. This dataset has been meticulously assembled to fine-tune our model on a diverse range of financial tasks. It features publicly available multi-task and multi-modal data derived from the multiple open released financial datasets.
The dataset is multi-faceted, featuring tasks including sentiment analysis, news headline classification, named entity recognition, question answering, and stock movement prediction. It covers both textual and time-series data modalities, offering a rich variety of financial data. The task specific instruction prompts for each task have been carefully degined by domain experts.
The table below summarizes the different tasks, their corresponding modalities, text types, and examples of the instructions used for each task:
Task | Modalities | Text Types | Instructions Examples |
---|---|---|---|
Sentiment Analysis | Text | news headlines,tweets | "Analyze the sentiment of this statement extracted from a financial news article.Provide your answer as either negative, positive or neutral. For instance, 'The company's stocks plummeted following the scandal.' would be classified as negative." |
News Headline Classification | Text | News Headlines | "Consider whether the headline mentions the price of gold. Is there a Price or Not in the gold commodity market indicated in the news headline? Please answer Yes or No." |
Named Entity Recognition | Text | financial agreements | "In the sentences extracted from financial agreements in U.S. SEC filings, identify the named entities that represent a person ('PER'), an organization ('ORG'), or a location ('LOC'). The required answer format is: 'entity name, entity type'. For instance, in 'Elon Musk, CEO of SpaceX, announced the launch from Cape Canaveral.', the entities would be: 'Elon Musk, PER; SpaceX, ORG; Cape Canaveral, LOC'" |
Question Answering | Text | earnings reports | "In the context of this series of interconnected finance-related queries and the additional information provided by the pretext, table data, and post text from a company's financial filings, please provide a response to the final question. This may require extracting information from the context and performing mathematical calculations. Please take into account the information provided in the preceding questions and their answers when formulating your response:" |
Stock Movement Prediction | Text, Time-Series | tweets, Stock Prices | "Analyze the information and social media posts to determine if the closing price of {tid} will ascend or descend at {point}. Please respond with either Rise or Fall." |
The dataset contains a vast amount of instruction data samples (136K), allowing FinMA to capture the nuances of the diverse financial tasks. The table below provides the statistical details of the instruction dataset:
Data | Task | Raw | Instruction | Data Types | Modalities | License | Original Paper |
---|---|---|---|---|---|---|---|
FPB | sentiment analysis | 4,845 | 48,450 | news | text | CC BY-SA 3.0 | [1] |
FiQA-SA | sentiment analysis | 1,173 | 11,730 | news headlines, tweets | text | Public | [2] |
Headline | news headline classification | 11,412 | 11,412 | news headlines | text | CC BY-SA 3.0 | [3] |
NER | named entity recognition | 1,366 | 13,660 | financial agreements | text | CC BY-SA 3.0 | [4] |
FinQA | question answering | 8,281 | 8,281 | earnings reports | text, table | MIT License | [5] |
ConvFinQA | question answering | 3,892 | 3,892 | earnings reports | text, table | MIT License | [6] |
BigData22 | stock movement prediction | 7,164 | 7,164 | tweets, historical prices | text, time series | Public | [7] |
ACL18 | stock movement prediction | 27,053 | 27,053 | tweets, historical prices | text, time series | MIT License | [8] |
CIKM18 | stock movement prediction | 4,967 | 4,967 | tweets, historical prices | text, time series | Public | [9] |
- Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. 2014. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65, 4 (2014), 782–796.
- Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018. 1941–1942
- Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2. Springer, 589–601
- Julio Cesar Salinas Alvarado, Karin Verspoor, and Timothy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015. 84–90.
- Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan R Routledge, et al . 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3697–3711.
- Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. 2022. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv preprint arXiv:2210.03849 (2022).
- Yejun Soun, Jaemin Yoo, Minyong Cho, Jihyeong Jeon, and U Kang. 2022. Accurate Stock Movement Prediction with Self-supervised Learning from Sparse Noisy Tweets. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 1691–1700.
- Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1970–1979.
- Huizhe Wu, Wei Zhang, Weiwei Shen, and Jun Wang. 2018. Hybrid deep sequential modeling for social text-driven stock prediction. In Proceedings of the 27th ACM international conference on information and knowledge management. 1627–1630.
When you are working with the Financial Instruction Dataset (FIT), it's crucial to follow the prescribed format for training and testing models.
The format should look like this:
{
"id": "unique id",
"conversations": [
{
"from": "human",
"value": "Your prompt and text"
},
{
"from": "agent",
"value": "Your answer"
}
],
"text": "Text to be classified",
"label": "Your label"
}
Here's what each field means:
- "id": a unique identifier for each example in your dataset.
- "conversations": a list of conversation turns. Each turn is represented as a dictionary, with "from" representing the speaker, and "value" representing the text spoken in the turn.
- "text": the text to be classified.
- "label": the ground truth label for the text.
The first turn in the "conversations" list should always be from "human", and contain your prompt and the text. The second turn should be from "agent", and contain your answer.
We are pleased to introduce the first version of FinMA, including three models FinMA-7B, FinMA-7B-full, FinMA-30B, fine-tuned on LLaMA 7B and LLaMA-30B. FinMA-7B and FinMA-30B are trained with the NLP instruction data, while FinMA-7B-full is trained with the full instruction data from FIT covering both NLP and prediction tasks.
FinMA v0.1 is now available on Huggingface for public use. We look forward to the valuable contributions that this initial version will make to the financial NLP field and encourage users to apply it to various financial tasks and scenarios. We also invite feedback and shared experiences to help improve future versions.
Coming soon.
FinMem is a novel LLM-based agent framework devised for financial decision-making, encompasses three core modules: Profiling, to outline the agent's characteristics; Memory, with layered processing, to aid the agent in assimilating realistic hierarchical financial data; and Decision-making, to convert insights gained from memories into investment decisions. Currently, FinMem can trade single stocks with high returns after a simple mode warm-up. Below is a quick start for a dockerized version framework, with TSLA as sample input.
Step 1: Set environmental variables
in .env
add HUGGINGFACE TOKEN and OPENAI API KEY as needed.
OPENAI_API_KEY = "<Your OpenAI Key>"
HF_TOKEN = "<Your HF token>"
Step 2: Set endpoint URL in config.toml
Use endpoint URL to deploy models based on the model of choice (OPENAI, Gemini, open source models on HuggingFace, etc.). For open-source models on HuggingFace, one choice for generating TGI endpoints is through RunPod.
[chat]
model = "tgi"
end_point = "<set the your endpoint address>"
tokenization_model_name = "<model name>"
...
Step 3: Build Docker Image and Container
docker build -t test-finmem .devcontainer/.
start container:
docker run -it --rm -v $(pwd):/finmem test-finmem bash
Step 4: Start Simulation!
Usage: run.py sim [OPTIONS]
Start Simulation
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --market-data-path -mdp TEXT The environment data pickle path [default: data/06_input/subset_symbols.pkl] │
│ --start-time -st TEXT The training or test start time [default: 2022-06-30 For Ticker 'TSLA'] │
│ --end-time -et TEXT The training or test end time [default: 2022-10-11] │
│ --run-model -rm TEXT Run mode: train or test [default: train] │
│ --config-path -cp TEXT config file path [default: config/config.toml] │
│ --checkpoint-path -ckp TEXT The checkpoint save path [default: data/10_checkpoint_test] │
│ --result-path -rp TEXT The result save path [default: data/11_train_result] │
│ --trained-agent-path -tap TEXT Only used in test mode, the path of trained agent [default: None. Can be changed to data/05_train_model_output OR data/06_train_checkpoint] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Example Usage:
python run.py sim --market-data-path data/03_model_input/tsla.pkl --start-time 2022-06-30 --end-time 2022-10-11 --run-model train --config-path config/tsla_tgi_config.toml --checkpoint-path data/06_train_checkpoint --result-path data/05_train_model_output
There are also checkpoint functionalities. For more details please visit FinMem Repository directly.
If you use PIXIU in your work, please cite our paper.
@misc{xie2023pixiu,
title={PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance},
author={Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang},
year={2023},
eprint={2306.05443},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{xie2024FinBen,
title={The FinBen: An Holistic Financial Benchmark for Large Language Models},
author={Qianqian Xie and Weiguang Han and Zhengyu Chen and Ruoyu Xiang and Xiao Zhang and Yueru He and Mengxi Xiao and Dong Li and Yongfu Dai and Duanyu Feng and Yijing Xu and Haoqiang Kang and Ziyan Kuang and Chenhan Yuan and Kailai Yang and Zheheng Luo and Tianlin Zhang and Zhiwei Liu and Guojun Xiong and Zhiyang Deng and Yuechen Jiang and Zhiyuan Yao and Haohang Li and Yangyang Yu and Gang Hu and Jiajia Huang and Xiao-Yang Liu and Alejandro Lopez-Lira and Benyou Wang and Yanzhao Lai and Hao Wang and Min Peng and Sophia Ananiadou and Jimin Huang},
year={2024},
eprint={2402.12659},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
PIXIU is licensed under [MIT]. For more details, please see the MIT file.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for PIXIU
Similar Open Source Tools
PIXIU
PIXIU is a project designed to support the development, fine-tuning, and evaluation of Large Language Models (LLMs) in the financial domain. It includes components like FinBen, a Financial Language Understanding and Prediction Evaluation Benchmark, FIT, a Financial Instruction Dataset, and FinMA, a Financial Large Language Model. The project provides open resources, multi-task and multi-modal financial data, and diverse financial tasks for training and evaluation. It aims to encourage open research and transparency in the financial NLP field.
specification
OWASP CycloneDX is a full-stack Bill of Materials (BOM) standard that provides advanced supply chain capabilities for cyber risk reduction. The specification supports various types of Bill of Materials including Software, Hardware, Machine Learning, Cryptography, Manufacturing, and Operations. It also includes support for Vulnerability Disclosure Reports, Vulnerability Exploitability eXchange, and CycloneDX Attestations. CycloneDX helps organizations accurately inventory all components used in software development to identify risks, enhance transparency, and enable rapid impact analysis. The project is managed by the CycloneDX Core Working Group under the OWASP Foundation and is supported by the global information security community.
Odyssey
Odyssey is a framework designed to empower agents with open-world skills in Minecraft. It provides an interactive agent with a skill library, a fine-tuned LLaMA-3 model, and an open-world benchmark for evaluating agent capabilities. The framework enables agents to explore diverse gameplay opportunities in the vast Minecraft world by offering primitive and compositional skills, extensive training data, and various long-term planning tasks. Odyssey aims to advance research on autonomous agent solutions by providing datasets, model weights, and code for public use.
Pearl
Pearl is a production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. It enables researchers and practitioners to develop Reinforcement Learning AI agents that prioritize cumulative long-term feedback over immediate feedback and can adapt to environments with limited observability, sparse feedback, and high stochasticity. Pearl offers a diverse set of unique features for production environments, including dynamic action spaces, offline learning, intelligent neural exploration, safe decision making, history summarization, and data augmentation.
FlagEmbedding
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: * **Long-Context LLM** : Activation Beacon * **Fine-tuning of LM** : LM-Cocktail * **Embedding Model** : Visualized-BGE, BGE-M3, LLM Embedder, BGE Embedding * **Reranker Model** : llm rerankers, BGE Reranker * **Benchmark** : C-MTEB
amber-train
Amber is the first model in the LLM360 family, an initiative for comprehensive and fully open-sourced LLMs. It is a 7B English language model with the LLaMA architecture. The model type is a language model with the same architecture as LLaMA-7B. It is licensed under Apache 2.0. The resources available include training code, data preparation, metrics, and fully processed Amber pretraining data. The model has been trained on various datasets like Arxiv, Book, C4, Refined-Web, StarCoder, StackExchange, and Wikipedia. The hyperparameters include a total of 6.7B parameters, hidden size of 4096, intermediate size of 11008, 32 attention heads, 32 hidden layers, RMSNorm ε of 1e^-6, max sequence length of 2048, and a vocabulary size of 32000.
AV-Deepfake1M
The AV-Deepfake1M repository is the official repository for the paper AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset. It addresses the challenge of detecting and localizing deepfake audio-visual content by proposing a dataset containing video manipulations, audio manipulations, and audio-visual manipulations for over 2K subjects resulting in more than 1M videos. The dataset is crucial for developing next-generation deepfake localization methods.
imodels
Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. _For interpretability in NLP, check out our new package:imodelsX _
pipeline
Pipeline is a Python library designed for constructing computational flows for AI/ML models. It supports both development and production environments, offering capabilities for inference, training, and finetuning. The library serves as an interface to Mystic, enabling the execution of pipelines at scale and on enterprise GPUs. Users can also utilize this SDK with Pipeline Core on a private hosted cluster. The syntax for defining AI/ML pipelines is reminiscent of sessions in Tensorflow v1 and Flows in Prefect.
nncf
Neural Network Compression Framework (NNCF) provides a suite of post-training and training-time algorithms for optimizing inference of neural networks in OpenVINO™ with a minimal accuracy drop. It is designed to work with models from PyTorch, TorchFX, TensorFlow, ONNX, and OpenVINO™. NNCF offers samples demonstrating compression algorithms for various use cases and models, with the ability to add different compression algorithms easily. It supports GPU-accelerated layers, distributed training, and seamless combination of pruning, sparsity, and quantization algorithms. NNCF allows exporting compressed models to ONNX or TensorFlow formats for use with OpenVINO™ toolkit, and supports Accuracy-Aware model training pipelines via Adaptive Compression Level Training and Early Exit Training.
recommenders
Recommenders is a project under the Linux Foundation of AI and Data that assists researchers, developers, and enthusiasts in prototyping, experimenting with, and bringing to production a range of classic and state-of-the-art recommendation systems. The repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. It covers tasks such as preparing data, building models using various recommendation algorithms, evaluating algorithms, tuning hyperparameters, and operationalizing models in a production environment on Azure. The project provides utilities to support common tasks like loading datasets, evaluating model outputs, and splitting training/test data. It includes implementations of state-of-the-art algorithms for self-study and customization in applications.
KwaiAgents
KwaiAgents is a series of Agent-related works open-sourced by the [KwaiKEG](https://github.com/KwaiKEG) from [Kuaishou Technology](https://www.kuaishou.com/en). The open-sourced content includes: 1. **KAgentSys-Lite**: a lite version of the KAgentSys in the paper. While retaining some of the original system's functionality, KAgentSys-Lite has certain differences and limitations when compared to its full-featured counterpart, such as: (1) a more limited set of tools; (2) a lack of memory mechanisms; (3) slightly reduced performance capabilities; and (4) a different codebase, as it evolves from open-source projects like BabyAGI and Auto-GPT. Despite these modifications, KAgentSys-Lite still delivers comparable performance among numerous open-source Agent systems available. 2. **KAgentLMs**: a series of large language models with agent capabilities such as planning, reflection, and tool-use, acquired through the Meta-agent tuning proposed in the paper. 3. **KAgentInstruct**: over 200k Agent-related instructions finetuning data (partially human-edited) proposed in the paper. 4. **KAgentBench**: over 3,000 human-edited, automated evaluation data for testing Agent capabilities, with evaluation dimensions including planning, tool-use, reflection, concluding, and profiling.
GIMP-ML
A.I. for GNU Image Manipulation Program (GIMP-ML) is a repository that provides Python plugins for using computer vision models in GIMP. The code base and models are continuously updated to support newer and more stable functionality. Users can edit images with text, outpaint images, and generate images from text using models like Dalle 2 and Dalle 3. The repository encourages citations using a specific bibtex entry and follows the MIT license for GIMP-ML and the original models.
ReST-MCTS
ReST-MCTS is a reinforced self-training approach that integrates process reward guidance with tree search MCTS to collect higher-quality reasoning traces and per-step value for training policy and reward models. It eliminates the need for manual per-step annotation by estimating the probability of steps leading to correct answers. The inferred rewards refine the process reward model and aid in selecting high-quality traces for policy model self-training.
NSMusicS
NSMusicS is a local music software that is expected to support multiple platforms with AI capabilities and multimodal features. The goal of NSMusicS is to integrate various functions (such as artificial intelligence, streaming, music library management, cross platform, etc.), which can be understood as similar to Navidrome but with more features than Navidrome. It wants to become a plugin integrated application that can almost have all music functions.
litgpt
LitGPT is a command-line tool designed to easily finetune, pretrain, evaluate, and deploy 20+ LLMs **on your own data**. It features highly-optimized training recipes for the world's most powerful open-source large-language-models (LLMs).
For similar tasks
nlp-llms-resources
The 'nlp-llms-resources' repository is a comprehensive resource list for Natural Language Processing (NLP) and Large Language Models (LLMs). It covers a wide range of topics including traditional NLP datasets, data acquisition, libraries for NLP, neural networks, sentiment analysis, optical character recognition, information extraction, semantics, topic modeling, multilingual NLP, domain-specific LLMs, vector databases, ethics, costing, books, courses, surveys, aggregators, newsletters, papers, conferences, and societies. The repository provides valuable information and resources for individuals interested in NLP and LLMs.
adata
AData is a free and open-source A-share database that focuses on transaction-related data. It provides comprehensive data on stocks, including basic information, market data, and sentiment analysis. AData is designed to be easy to use and integrate with other applications, making it a valuable tool for quantitative trading and AI training.
PIXIU
PIXIU is a project designed to support the development, fine-tuning, and evaluation of Large Language Models (LLMs) in the financial domain. It includes components like FinBen, a Financial Language Understanding and Prediction Evaluation Benchmark, FIT, a Financial Instruction Dataset, and FinMA, a Financial Large Language Model. The project provides open resources, multi-task and multi-modal financial data, and diverse financial tasks for training and evaluation. It aims to encourage open research and transparency in the financial NLP field.
hezar
Hezar is an all-in-one AI library designed specifically for the Persian community. It brings together various AI models and tools, making it easy to use AI with just a few lines of code. The library seamlessly integrates with Hugging Face Hub, offering a developer-friendly interface and task-based model interface. In addition to models, Hezar provides tools like word embeddings, tokenizers, feature extractors, and more. It also includes supplementary ML tools for deployment, benchmarking, and optimization.
text-embeddings-inference
Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for popular models like FlagEmbedding, Ember, GTE, and E5. It implements features such as no model graph compilation step, Metal support for local execution on Macs, small docker images with fast boot times, token-based dynamic batching, optimized transformers code for inference using Flash Attention, Candle, and cuBLASLt, Safetensors weight loading, and production-ready features like distributed tracing with Open Telemetry and Prometheus metrics.
CodeProject.AI-Server
CodeProject.AI Server is a standalone, self-hosted, fast, free, and open-source Artificial Intelligence microserver designed for any platform and language. It can be installed locally without the need for off-device or out-of-network data transfer, providing an easy-to-use solution for developers interested in AI programming. The server includes a HTTP REST API server, backend analysis services, and the source code, enabling users to perform various AI tasks locally without relying on external services or cloud computing. Current capabilities include object detection, face detection, scene recognition, sentiment analysis, and more, with ongoing feature expansions planned. The project aims to promote AI development, simplify AI implementation, focus on core use-cases, and leverage the expertise of the developer community.
spark-nlp
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant, and accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 36000+ pretrained pipelines and models in more than 200+ languages. It offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation, Summarization, Question Answering, Table Question Answering, Text Generation, Image Classification, Image to Text (captioning), Automatic Speech Recognition, Zero-Shot Learning, and many more NLP tasks. Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Llama-2, M2M100, BART, Instructor, E5, Google T5, MarianMT, OpenAI GPT2, Vision Transformers (ViT), OpenAI Whisper, and many more not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.
scikit-llm
Scikit-LLM is a tool that seamlessly integrates powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks. It allows users to leverage large language models for various text analysis applications within the familiar scikit-learn framework. The tool simplifies the process of incorporating advanced language processing capabilities into machine learning pipelines, enabling users to benefit from the latest advancements in natural language processing.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.
oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.