OpenAI-CLIP-Feature
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.
Stars: 115
This repository provides code for extracting image and text features using OpenAI CLIP models, supporting both global and local grid visual features. It aims to facilitate multi visual-and-language downstream tasks by allowing users to customize input and output grid resolution easily. The extracted features have shown comparable or superior results in image captioning tasks without hyperparameter tuning. The repo supports various CLIP models and provides detailed information on supported settings and results on MSCOCO image captioning. Users can get started by setting up experiments with the extracted features using X-modaler.
README:
- Updated CLIP source code to support two new models: "RN50x64" and "ViT-L/14".
- Added examples for three public datasets: Flickr30k, NoCaps, and SynthImgCap.
This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.
The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor
classes easily to customize different input and output grid resolution.
To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9
CIDEr score in teacher-forcing setting and 129.6
in reinforcement learning setting when using ViT-B/32
CLIP model, which conflicts with the experiment results in CLIP-ViL paper[1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0
CIDEr score in CLIP-ViT-B_Transformer
setting in COCO Captioning).
We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.
Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d
to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.
Visual Backbone | CLIP Model | Input Resolution | Output Resolution | Feature Map Downsample | Grid Feature Shape | Global Feature Shape | |
---|---|---|---|---|---|---|---|
Standard | RN101 | RN101 | 224 x 224 | 7 x 7 | None | 49 x 2048 | 1 x 512 |
ViT-B/32 | ViT-B/32 | 224 x 224 | 7 x 7 | None | 49 x 768 | 1 x 512 | |
ViT-B/16 | ViT-B/16 | 224 x 224 | 14 x 14 | AvgPool2d(kernel_size=(2,2), stride=2) | 49 x 768 | 1 x 512 | |
Customized | RN101_448 | RN101 | 448 x 448 | 14 x 14 | AvgPool2d(kernel_size=(2,2), stride=2) | 49 x 2048 | 1 x 512 |
ViT-B/32_448 | ViT-B/32 | 448 x 448 | 14 x 14 | AvgPool2d(kernel_size=(2,2), stride=2) | 49 x 768 | 1 x 512 |
We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4
in ViT-B/16
and ViT-B/32_448
teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.
Name | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|---|---|---|
BUTD_feat | 76.4 | 60.3 | 46.5 | 35.8 | 28.2 | 56.7 | 116.6 | 21.3 |
RN101 | 77.3 | 61.3 | 47.7 | 36.9 | 28.7 | 57.5 | 120.6 | 21.8 |
ViT-B/32 | 76.4 | 60.3 | 46.5 | 35.6 | 28.1 | 56.7 | 116.9 | 21.2 |
ViT-B/16 | 78.0 | 62.1 | 48.2 | 37.2 | 28.8 | 57.6 | 122.3 | 22.1 |
RN101_448 | 78.0 | 62.4 | 48.9 | 38.0 | 29.0 | 57.9 | 123.6 | 22.1 |
ViT-B/32_448 | 75.8 | 59.6 | 45.9 | 35.1 | 27.8 | 56.3 | 114.2 | 21.0 |
Name | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |
---|---|---|---|---|---|---|---|---|
BUTD_feat | 80.5 | 65.4 | 51.1 | 39.2 | 29.1 | 58.7 | 130.0 | 23.0 |
RN101 | 81.3 | 66.4 | 52.1 | 40.3 | 29.6 | 59.6 | 134.2 | 23.4 |
ViT-B/32 | 79.9 | 64.6 | 50.4 | 38.5 | 29.0 | 58.6 | 129.6 | 22.8 |
ViT-B/16 | 82.0 | 67.3 | 53.1 | 41.1 | 29.9 | 59.8 | 136.6 | 23.8 |
RN101_448 | 81.6 | 66.9 | 52.6 | 40.6 | 29.9 | 59.8 | 136.2 | 23.9 |
ViT-B/32_448 | 79.9 | 64.6 | 50.4 | 38.7 | 28.8 | 58.4 | 127.8 | 22.6 |
Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.
- PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
- timm ≥ 0.4.5
- Use CLIP
ViT-B/32
model to extract global textual features of MSCOCO sentences fromdataset_coco.json
in Karpathy's released annotations.
CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
--anno dataset_coco.json \
--output_dir ${TXT_OUTPUT_DIR} \
--model_type_or_path 'ViT-B/32'
- Use CLIP
ViT-B/16
model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
--image_list 'example/MSCOCO/image_list_2017.txt' \
--image_dir ${IMG_DIR} \
--output_dir ${IMG_OUTPUT_DIR} \
--ve_name 'ViT-B/16' \
--model_type_or_path 'ViT-B/16'
- Use CLIP
RN101
model to extract global and grid visual features of MSCOCO images.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
--image_list 'example/MSCOCO/image_list_2017.txt' \
--image_dir ${IMG_DIR} \
--output_dir ${IMG_OUTPUT_DIR} \
--ve_name 'RN101' \
--model_type_or_path 'RN101'
- Use CLIP
RN101
model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.
CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
--image_list 'example/MSCOCO/image_list_2017.txt' \
--image_dir ${IMG_DIR} \
--output_dir ${IMG_OUTPUT_DIR} \
--ve_name 'RN101_448' \
--model_type_or_path 'RN101'
You can run the same script with same input list (i.e. --image_list
or --anno
) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.
MIT
This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.
[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.
[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.
[3] X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics. Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, Tao Mei. In ACMMM2021 Open Source Software Competition.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for OpenAI-CLIP-Feature
Similar Open Source Tools
OpenAI-CLIP-Feature
This repository provides code for extracting image and text features using OpenAI CLIP models, supporting both global and local grid visual features. It aims to facilitate multi visual-and-language downstream tasks by allowing users to customize input and output grid resolution easily. The extracted features have shown comparable or superior results in image captioning tasks without hyperparameter tuning. The repo supports various CLIP models and provides detailed information on supported settings and results on MSCOCO image captioning. Users can get started by setting up experiments with the extracted features using X-modaler.
jailbreak_llms
This is the official repository for the ACM CCS 2024 paper 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. The project employs a new framework called JailbreakHub to conduct the first measurement study on jailbreak prompts in the wild, collecting 15,140 prompts from December 2022 to December 2023, including 1,405 jailbreak prompts. The dataset serves as the largest collection of in-the-wild jailbreak prompts. The repository contains examples of harmful language and is intended for research purposes only.
COLD-Attack
COLD-Attack is a framework designed for controllable jailbreaks on large language models (LLMs). It formulates the controllable attack generation problem and utilizes the Energy-based Constrained Decoding with Langevin Dynamics (COLD) algorithm to automate the search of adversarial LLM attacks with control over fluency, stealthiness, sentiment, and left-right-coherence. The framework includes steps for energy function formulation, Langevin dynamics sampling, and decoding process to generate discrete text attacks. It offers diverse jailbreak scenarios such as fluent suffix attacks, paraphrase attacks, and attacks with left-right-coherence.
RobustVLM
This repository contains code for the paper 'Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models'. It focuses on fine-tuning CLIP in an unsupervised manner to enhance its robustness against visual adversarial attacks. By replacing the vision encoder of large vision-language models with the fine-tuned CLIP models, it achieves state-of-the-art adversarial robustness on various vision-language tasks. The repository provides adversarially fine-tuned ViT-L/14 CLIP models and offers insights into zero-shot classification settings and clean accuracy improvements.
llm-compression-intelligence
This repository presents the findings of the paper "Compression Represents Intelligence Linearly". The study reveals a strong linear correlation between the intelligence of LLMs, as measured by benchmark scores, and their ability to compress external text corpora. Compression efficiency, derived from raw text corpora, serves as a reliable evaluation metric that is linearly associated with model capabilities. The repository includes the compression corpora used in the paper, code for computing compression efficiency, and data collection and processing pipelines.
crabml
Crabml is a llama.cpp compatible AI inference engine written in Rust, designed for efficient inference on various platforms with WebGPU support. It focuses on running inference tasks with SIMD acceleration and minimal memory requirements, supporting multiple models and quantization methods. The project is hackable, embeddable, and aims to provide high-performance AI inference capabilities.
LLM-QAT
This repository contains the training code of LLM-QAT for large language models. The work investigates quantization-aware training for LLMs, including quantizing weights, activations, and the KV cache. Experiments were conducted on LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. Significant improvements were observed when quantizing weight, activations, and kv cache to 4-bit, 8-bit, and 4-bit, respectively.
EVE
EVE is an official PyTorch implementation of Unveiling Encoder-Free Vision-Language Models. The project aims to explore the removal of vision encoders from Vision-Language Models (VLMs) and transfer LLMs to encoder-free VLMs efficiently. It also focuses on bridging the performance gap between encoder-free and encoder-based VLMs. EVE offers a superior capability with arbitrary image aspect ratio, data efficiency by utilizing publicly available data for pre-training, and training efficiency with a transparent and practical strategy for developing a pure decoder-only architecture across modalities.
dl_model_infer
This project is a c++ version of the AI reasoning library that supports the reasoning of tensorrt models. It provides accelerated deployment cases of deep learning CV popular models and supports dynamic-batch image processing, inference, decode, and NMS. The project has been updated with various models and provides tutorials for model exports. It also includes a producer-consumer inference model for specific tasks. The project directory includes implementations for model inference applications, backend reasoning classes, post-processing, pre-processing, and target detection and tracking. Speed tests have been conducted on various models, and onnx downloads are available for different models.
EasyEdit
EasyEdit is a Python package for edit Large Language Models (LLM) like `GPT-J`, `Llama`, `GPT-NEO`, `GPT2`, `T5`(support models from **1B** to **65B**), the objective of which is to alter the behavior of LLMs efficiently within a specific domain without negatively impacting performance across other inputs. It is designed to be easy to use and easy to extend.
Awesome-LLM-Large-Language-Models-Notes
Awesome-LLM-Large-Language-Models-Notes is a repository that provides a comprehensive collection of information on various Large Language Models (LLMs) classified by year, size, and name. It includes details on known LLM models, their papers, implementations, and specific characteristics. The repository also covers LLM models classified by architecture, must-read papers, blog articles, tutorials, and implementations from scratch. It serves as a valuable resource for individuals interested in understanding and working with LLMs in the field of Natural Language Processing (NLP).
spiceai
Spice is a portable runtime written in Rust that offers developers a unified SQL interface to materialize, accelerate, and query data from any database, data warehouse, or data lake. It connects, fuses, and delivers data to applications, machine-learning models, and AI-backends, functioning as an application-specific, tier-optimized Database CDN. Built with industry-leading technologies such as Apache DataFusion, Apache Arrow, Apache Arrow Flight, SQLite, and DuckDB. Spice makes it fast and easy to query data from one or more sources using SQL, co-locating a managed dataset with applications or machine learning models, and accelerating it with Arrow in-memory, SQLite/DuckDB, or attached PostgreSQL for fast, high-concurrency, low-latency queries.
portkey-python-sdk
The Portkey Python SDK is a control panel for AI apps that allows seamless integration of Portkey's advanced features with OpenAI methods. It provides features such as AI gateway for unified API signature, interoperability, automated fallbacks & retries, load balancing, semantic caching, virtual keys, request timeouts, observability with logging, requests tracing, custom metadata, feedback collection, and analytics. Users can make requests to OpenAI using Portkey SDK and also use async functionality. The SDK is compatible with OpenAI SDK methods and offers Portkey-specific methods like feedback and prompts. It supports various providers and encourages contributions through Github issues or direct contact via email or Discord.
tamingLLMs
The 'Taming LLMs' repository provides a practical guide to the pitfalls and challenges associated with Large Language Models (LLMs) when building applications. It focuses on key limitations and implementation pitfalls, offering practical Python examples and open source solutions to help engineers and technical leaders navigate these challenges. The repository aims to equip readers with the knowledge to harness the power of LLMs while avoiding their inherent limitations.
llm4regression
This project explores the capability of Large Language Models (LLMs) to perform regression tasks using in-context examples. It compares the performance of LLMs like GPT-4 and Claude 3 Opus with traditional supervised methods such as Linear Regression and Gradient Boosting. The project provides preprints and results demonstrating the strong performance of LLMs in regression tasks. It includes datasets, models used, and experiments on adaptation and contamination. The code and data for the experiments are available for interaction and analysis.
flute
FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) is a tool designed for uniform quantization and lookup table quantization of weights in lower-precision intervals. It offers flexibility in mapping intervals to arbitrary values through a lookup table. FLUTE supports various quantization formats such as int4, int3, int2, fp4, fp3, fp2, nf4, nf3, nf2, and even custom tables. The tool also introduces new quantization algorithms like Learned Normal Float (NFL) for improved performance and calibration data learning. FLUTE provides benchmarks, model zoo, and integration with frameworks like vLLM and HuggingFace for easy deployment and usage.
For similar tasks
OpenAI-CLIP-Feature
This repository provides code for extracting image and text features using OpenAI CLIP models, supporting both global and local grid visual features. It aims to facilitate multi visual-and-language downstream tasks by allowing users to customize input and output grid resolution easily. The extracted features have shown comparable or superior results in image captioning tasks without hyperparameter tuning. The repo supports various CLIP models and provides detailed information on supported settings and results on MSCOCO image captioning. Users can get started by setting up experiments with the extracted features using X-modaler.
seemore
seemore is a vision language model developed in Pytorch, implementing components like image encoder, vision-language projector, and decoder language model. The model is built from scratch, including attention mechanisms and patch creation. It is designed for readability and hackability, with the intention to be improved upon. The implementation is based on public publications and borrows attention mechanism from makemore by Andrej Kapathy. The code was developed on Databricks using a single A100 for compute, and MLFlow is used for tracking metrics. The tool aims to provide a simplistic version of vision language models like Grok 1.5/GPT-4 Vision, suitable for experimentation and learning.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.