CVPR2024-Papers-with-Code-Demo

收集 CVPR 最新的成果，包括论文、代码和demo视频等，欢迎大家推荐！Collect the latest CVPR (Conference on Computer Vision and Pattern Recognition) results, including papers, code, and demo videos, etc., and welcome recommendations from everyone!

Stars: 1166

Visit

This repository contains a collection of papers and code for the CVPR 2024 conference. The papers cover a wide range of topics in computer vision, including object detection, image segmentation, image generation, and video analysis. The code provides implementations of the algorithms described in the papers, making it easy for researchers and practitioners to reproduce the results and build upon the work of others. The repository is maintained by a team of researchers at the University of California, Berkeley.

README:

CVPR2024-Papers-with-Code-Demo

☪️添加微信: nvshenj125, 备注方向，进交流学习群

欢迎关注公众号：AI算法与图像处理

🌟 CVPR 2024 持续更新最新论文/paper和相应的开源代码/code！

B站demo：https://space.bilibili.com/288489574

✋ 注：欢迎各位大佬提交issue，分享CVPR 2022论文/paper和开源项目！共同完善这个项目

往年顶会论文汇总：

CVPR2021

CVPR2022

CVPR2023

ICCV2021

ECCV2022

🎆 欢迎进群 | Welcome

CVPR 2024 论文/paper交流群已成立！已经收录的同学，可以添加微信：nvshenj125，请备注：CVPR+姓名+学校/公司名称！一定要根据格式申请，可以拉你进群。

🔨 目录 |Table of Contents（点击直接跳转）

目录（右侧点击可折叠）

Backbone
数据集/Dataset
Diffusion Model
Text-to-Image
NAS
NeRF
Knowledge Distillation
多模态 / Multimodal
对比学习/Contrastive Learning
图神经网络 / Graph Neural Networks
胶囊网络 / Capsule Network
图像分类 / Image Classification
目标检测/Object Detection
目标跟踪/Object Tracking
轨迹预测/Trajectory Prediction
语义分割/Segmentation
弱监督语义分割/Weakly Supervised Semantic Segmentation
医学图像分割
视频目标分割/Video Object Segmentation
交互式视频目标分割/Interactive Video Object Segmentation
Visual Transformer
深度估计/Depth Estimation
人脸识别/Face Recognition
人脸检测/Face Detection
人脸活体检测/Face Anti-Spoofing
人脸年龄估计/Age Estimation
人脸表情识别/Facial Expression Recognition
人脸属性识别/Facial Attribute Recognition
人脸编辑/Facial Editing
人脸重建/Face Reconstruction
Talking Face
换脸/Face Swap
姿态估计/Pose Estimation
手势姿态估计（重建）/Hand Pose Estimation( Hand Mesh Recovery)
视频动作检测/Video Action Detection
手语翻译/Sign Language Translation
3D人体重建
行人重识别/Person Re-identification
行人搜索/Person Search
人群计数 / Crowd Counting
GAN
彩妆迁移 / Color-Pattern Makeup Transfer
字体生成 / Font Generation
场景文本检测、识别/Scene Text Detection/Recognition
图像、视频检索 / Image Retrieval/Video retrieval
Image Animation
抠图/Image Matting
超分辨率/Super Resolution
图像复原/Image Restoration
图像补全/Image Inpainting
图像去噪/Image Denoising
图像编辑/Image Editing
图像拼接/Image stitching
图像匹配/Image Matching
图像融合/Image Blending
图像去雾/Image Dehazing
图像去模糊/Image Deblur
图像压缩/Image Compression
反光去除/Reflection Removal
车道线检测/Lane Detection
自动驾驶 / Autonomous Driving
流体重建/Fluid Reconstruction
场景重建 / Scene Reconstruction
3D Reconstruction
视频插帧/Frame Interpolation
视频超分 / Video Super-Resolution
3D点云/3D point cloud
标签噪声 / Label-Noise
对抗样本/Adversarial Examples
Anomaly Detection
其他/Other

Backbone

返回目录/back

数据集/Dataset

HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative

论文/Paper: http://arxiv.org/pdf/2403.02640
代码/Code: None

Traffic Scene Parsing through the TSP6K Dataset

论文/Paper: https://arxiv.org/pdf/2303.02835.pdf
代码/Code: https://github.com/PengtaoJiang/TSP6K

返回目录/back

Diffusion Model

Balancing Act: Distribution-Guided Debiasing in Diffusion Models

论文/Paper: http://arxiv.org/pdf/2402.18206
代码/Code: None

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

论文/Paper: http://arxiv.org/pdf/2402.19481
代码/Code: https://github.com/mit-han-lab/distrifuser

DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly

论文/Paper: http://arxiv.org/pdf/2402.19302
代码/Code: https://github.com/iit-pavis/diffassemble

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

论文/Paper: http://arxiv.org/pdf/2403.00644
代码/Code: None

Few-shot Learner Parameterization by Diffusion Time-steps

论文/Paper: http://arxiv.org/pdf/2403.02649
代码/Code: https://github.com/yue-zhongqi/tif

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

论文/Paper: http://arxiv.org/pdf/2403.04290
代码/Code: None

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

论文/Paper: https://arxiv.org/abs/2403.06951
代码/Code: https://github.com/Tianhao-Qi/DEADiff_code

Face2Diffusion for Fast and Editable Face Personalization

论文/Paper: http://arxiv.org/pdf/2403.05094
代码/Code: https://github.com/mapooon/Face2Diffusion

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

论文/Paper: http://arxiv.org/pdf/2403.06951
代码/Code: None

MACE: Mass Concept Erasure in Diffusion Models

论文/Paper: http://arxiv.org/pdf/2403.06135
代码/Code: https://github.com/Shilin-LU/MACE

It's All About Your Sketch: Democratising Sketch Control in Diffusion Models

论文/Paper: http://arxiv.org/pdf/2403.07234
代码/Code: https://github.com/subhadeepkoley/demosketch2rgb

SemCity: Semantic Scene Generation with Triplane Diffusion

论文/Paper: http://arxiv.org/pdf/2403.07773
代码/Code: https://github.com/zoomin-lee/semcity

返回目录/back

Text-to-Image

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

论文/Paper: http://arxiv.org/pdf/2403.00483
代码/Code: None

NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging

论文/Paper: http://arxiv.org/pdf/2403.03485
代码/Code: https://github.com/univ-esuty/noisecollage

Discriminative Probing and Tuning for Text-to-Image Generation

论文/Paper: http://arxiv.org/pdf/2403.04321
代码/Code: None

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

论文/Paper: http://arxiv.org/pdf/2403.05239
代码/Code: None

Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation

论文/Paper: http://arxiv.org/pdf/2403.06452
代码/Code: https://github.com/mulns/Text2QR

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

论文/Paper: http://arxiv.org/pdf/2403.07214
代码/Code: None

返回目录/back

NAS

返回目录/back

NeRF

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

论文/Paper: http://arxiv.org/pdf/2403.03608
代码/Code: None

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

论文/Paper: http://arxiv.org/pdf/2403.06912
代码/Code: https://github.com/fictionarry/dngaussian

S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes

论文/Paper: http://arxiv.org/pdf/2403.06205
代码/Code: None

返回目录/back

Knowledge Distillation

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

论文/Paper: http://arxiv.org/pdf/2403.02781
代码/Code: https://github.com/zhengli97/PromptKD

Logit Standardization in Knowledge Distillation

论文/Paper: https://arxiv.org/abs/2403.01427
代码/Code: https://github.com/sunshangquan/logit-standardization-KD

RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features

论文/Paper: http://arxiv.org/pdf/2403.05061
代码/Code: None

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

论文/Paper: http://arxiv.org/pdf/2403.06213
代码/Code: https://github.com/roymiles/vkd

返回目录/back

多模态 / Multimodal

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

论文/Paper: https://arxiv.org/abs/2312.07472
代码/Code: https://github.com/IranQin/MP5
主页/Website：https://iranqin.github.io/MP5.github.io/

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

论文/Paper: http://arxiv.org/pdf/2402.18091
代码/Code: None

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

论文/Paper: http://arxiv.org/pdf/2403.02991
代码/Code: None

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

论文/Paper: http://arxiv.org/pdf/2403.05105
代码/Code: https://github.com/hhc1997/L2RM

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

论文/Paper: http://arxiv.org/pdf/2403.07839
代码/Code: None

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Matching Framework

论文/Paper: http://arxiv.org/pdf/2403.07636
代码/Code: https://github.com/hieuphan33/mavl

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

论文/Paper: http://arxiv.org/pdf/2403.07241
代码/Code: None

返回目录/back

Contrastive Learning

Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning

论文/Paper: http://arxiv.org/pdf/2403.06122
代码/Code: https://github.com/root0yang/blindnet

返回目录/back

胶囊网络 / Capsule Network

返回目录/back

图像分类 / Image Classification

返回目录/back

目标检测/Object Detection

UniMODE: Unified Monocular 3D Object Detection

论文/Paper: http://arxiv.org/pdf/2402.18573
代码/Code: None

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

论文/Paper: http://arxiv.org/pdf/2403.04198
代码/Code: https://github.com/SerCharles/CN-RMA

Memory-based Adapters for Online 3D Scene Perception

论文/Paper: https://arxiv.org/abs/2403.06974
代码/Code:https://github.com/xuxw98/Online3D

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

论文/Paper: https://arxiv.org/abs/2403.16131
代码/Code:https://github.com/xiuqhou/Salience-DETR

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

论文/Paper: http://arxiv.org/pdf/2403.06093
代码/Code: https://github.com/nullmax-vision/QAF2D

SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

论文/Paper: http://arxiv.org/pdf/2403.05817
代码/Code: https://github.com/zhanggang001/hednet

返回目录/back

目标跟踪/Object Tracking

DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking

论文/Paper: http://arxiv.org/pdf/2403.02767
代码/Code: None

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking

论文/Paper: http://arxiv.org/pdf/2403.04700
代码/Code: https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT

返回目录/back

3D Object Tracking

返回目录/back

轨迹预测/Trajectory Prediction

返回目录/back

语义分割/Segmentation

PEM: Prototype-based Efficient MaskFormer for Image Segmentation

论文/Paper: http://arxiv.org/pdf/2402.19422
代码/Code: https://github.com/niccolocavagnero/pem

Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation

论文/Paper: http://arxiv.org/pdf/2403.06462
代码/Code: https://github.com/Gavinwxy/DDFP

Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

论文/Paper: http://arxiv.org/pdf/2403.06247
代码/Code: None

返回目录/back

弱监督语义分割/Weakly Supervised Semantic Segmentation

返回目录/back

医学图像/Medical Image

Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration

论文/Paper: http://arxiv.org/pdf/2402.18933
代码/Code: None

返回目录/back

视频目标分割/Video Object Segmentation

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

论文/Paper: http://arxiv.org/pdf/2403.04258
代码/Code: None

返回目录/back

交互式视频目标分割/Interactive Video Object Segmentation

返回目录/back

Visual Transformer

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

论文/Paper: http://arxiv.org/pdf/2403.05419
代码/Code: https://github.com/techmn/satmae_pp

返回目录/back

深度估计/Depth Estimation

Representations for Recognition and Retrieval

论文/Paper: https://arxiv.org/pdf/2403.07535.pdf
代码/Code: https://github.com/Junda24/AFNet

返回目录/back

图像、视频检索 / Image Retrieval/Video retrieval

Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval

论文/Paper: http://arxiv.org/pdf/2403.00272
代码/Code: None

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

论文/Paper: http://arxiv.org/pdf/2403.05105
代码/Code: https://github.com/hhc1997/L2RM

How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?

论文/Paper: http://arxiv.org/pdf/2403.07203
代码/Code: None

返回目录/back

超分辨率/Super Resolution

SeD: Semantic-Aware Discriminator for Image Super-Resolution

论文/Paper: http://arxiv.org/pdf/2402.19387
代码/Code: None

Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts

论文/Paper: http://arxiv.org/pdf/2402.19215
代码/Code: https://github.com/mandalinadagi/wgsr

CAMixerSR: Only Details Need More "Attention"

论文/Paper: http://arxiv.org/pdf/2402.19289
代码/Code: https://github.com/icandle/camixersr

Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning

论文/Paper: http://arxiv.org/pdf/2403.02601
代码/Code: None

返回目录/back

图像复原/Image Restoration

Boosting Image Restoration via Priors from Pre-trained Models

论文/Paper: http://arxiv.org/pdf/2403.06793
代码/Code: None

返回目录/back

图像去噪/Image Denoising

返回目录/back

图像编辑/Image Editing

Doubly Abductive Counterfactual Inference for Text-based Image Editing

论文/Paper: http://arxiv.org/pdf/2403.02981
代码/Code: https://github.com/xuesong39/DAC

返回目录/back

图像压缩/Image Compression

返回目录/back

图像去模糊/Image Deblur

A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning

论文/Paper: http://arxiv.org/pdf/2403.02611
代码/Code: https://github.com/PieceZhang/MPT-CataBlur

返回目录/back

自动驾驶 / Autonomous Driving

Abductive Ego-View Accident Video Understanding for Safe Driving Perception

论文/Paper: http://arxiv.org/pdf/2403.00436
代码/Code: None

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

论文/Paper: http://arxiv.org/pdf/2403.07535
代码/Code: website:https://github.com/Junda24/AFNet/

返回目录/back

人脸识别/Face Recognition

返回目录/back

人脸检测/Face Detection

返回目录/back

人脸活体检测/Face Anti-Spoofing

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing

论文/Paper: http://arxiv.org/pdf/2402.19298
代码/Code: https://github.com/omggggg/mmdg

返回目录/back

人脸重建/Face Reconstruction

返回目录/back

视频动作检测/Video Action Detection

返回目录/back

手语翻译/Sign Language Translation

返回目录/back

行人重识别/Person Re-identification

返回目录/back

Talking Face

返回目录/back

姿态估计/Pose Estimation

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

论文/Paper: http://arxiv.org/pdf/2403.03221
代码/Code: None

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

论文/Paper: http://arxiv.org/pdf/2403.04381
代码/Code: https://github.com/MickeyLLG/S2DHand

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

论文/Paper: https://arxiv.org/pdf/2311.12028.pdf
代码/Code: https://github.com/NationalGAILab/HoT

返回目录/back

GAN

返回目录/back

人脸年龄估计/Age Estimation

返回目录/back

人脸表情识别/Facial Expression Recognition

返回目录/back

手势姿态估计（重建）/Hand Pose Estimation( Hand Mesh Recovery)

返回目录/back

3D Reconstruction

UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and UnFavOrable Data Sets

论文/Paper: http://arxiv.org/pdf/2403.05086
代码/Code: https://github.com/Youngju-Na/UFORecon

DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction

论文/Paper: http://arxiv.org/pdf/2403.05005
代码/Code: None

Memory-based Adapters for Online 3D Scene Perception

论文/Paper: http://arxiv.org/pdf/2403.06974
代码/Code: None

Bayesian Diffusion Models for 3D Shape Reconstruction

论文/Paper: http://arxiv.org/pdf/2403.06973
代码/Code: None

返回目录/back

视频插帧/Frame Interpolation

返回目录/back

3D点云/3D point cloud

Rethinking Few-shot 3D Point Cloud Semantic Segmentation

论文/Paper: http://arxiv.org/pdf/2403.00592
代码/Code: https://github.com/ZhaochongAn/COSeg

Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

论文/Paper: http://arxiv.org/pdf/2403.03532
代码/Code: https://github.com/liuquan98/eyoc

Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds

论文/Paper: http://arxiv.org/pdf/2403.05247
代码/Code: https://github.com/TRLou/HiT-ADV

返回目录/back

Anomaly Detection

Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts

论文/Paper: http://arxiv.org/pdf/2403.06495
代码/Code: https://github.com/mala-lab/inctrl

RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection

论文/Paper: http://arxiv.org/pdf/2403.05897
代码/Code: https://github.com/cnulab/realnet

返回目录/back

其他/Other

DisCo: Disentangled Control for Realistic Human Dance Generation

论文/Paper: https://arxiv.org/abs/2307.00040
代码/Code: https://github.com/Wangt-CN/DisCo

Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

论文/Paper: http://arxiv.org/pdf/2402.18528
代码/Code: None

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

论文/Paper: http://arxiv.org/pdf/2402.18490
代码/Code: None

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

论文/Paper: http://arxiv.org/pdf/2402.18330
代码/Code: https://github.com/tho-kn/egotap

Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing

论文/Paper: http://arxiv.org/pdf/2402.18277
代码/Code: None

Misalignment-Robust Frequency Distribution Loss for Image Transformation

论文/Paper: http://arxiv.org/pdf/2402.18192
代码/Code: https://github.com/eezkni/FDL

3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

论文/Paper: http://arxiv.org/pdf/2402.18146
代码/Code: https://github.com/jiangchaokang/3dsflabelling

OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction

论文/Paper: http://arxiv.org/pdf/2402.18140
代码/Code: None

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

论文/Paper: http://arxiv.org/pdf/2402.18115
代码/Code: https://github.com/minghanli/univs

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

论文/Paper: http://arxiv.org/pdf/2402.18078
代码/Code: https://github.com/YanzuoLu/CFLD

Boosting Neural Representations for Videos with a Conditional Decoder

论文/Paper: http://arxiv.org/pdf/2402.18152
代码/Code: None

Classes Are Not Equal: An Empirical Study on Image Recognition Fairness

论文/Paper: http://arxiv.org/pdf/2402.18133
代码/Code: None

QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction

论文/Paper: http://arxiv.org/pdf/2402.17951
代码/Code: None

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

论文/Paper: http://arxiv.org/pdf/2402.19479
代码/Code: None

SeMoLi: What Moves Together Belongs Together

论文/Paper: http://arxiv.org/pdf/2402.19463
代码/Code: None

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

论文/Paper: http://arxiv.org/pdf/2402.19326
代码/Code: None

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

论文/Paper: http://arxiv.org/pdf/2402.19231
代码/Code: https://github.com/lu-feng/cricavpr

MemoNav: Working Memory Model for Visual Navigation

论文/Paper: http://arxiv.org/pdf/2402.19161
代码/Code: None

VideoMAC: Video Masked Autoencoders Meet ConvNets

论文/Paper: http://arxiv.org/pdf/2402.19082
代码/Code: https://github.com/nust-machine-intelligence-laboratory/videomac

Theoretically Achieving Continuous Representation of Oriented Bounding Boxes

论文/Paper: http://arxiv.org/pdf/2402.18975
代码/Code: https://github.com/Jittor/JDet

OHTA: One-shot Hand Avatar via Data-driven Implicit Priors

论文/Paper: http://arxiv.org/pdf/2402.18969
代码/Code: None

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

论文/Paper: http://arxiv.org/pdf/2402.18956
代码/Code: None

Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation

论文/Paper: http://arxiv.org/pdf/2402.18920
代码/Code: None

SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting

论文/Paper: http://arxiv.org/pdf/2402.18848
代码/Code: None

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

论文/Paper: http://arxiv.org/pdf/2402.18842
代码/Code: None

OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition

论文/Paper: http://arxiv.org/pdf/2402.18786
代码/Code: None

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

论文/Paper: http://arxiv.org/pdf/2402.18771
代码/Code: None

Towards Generalizable Tumor Synthesis

论文/Paper: http://arxiv.org/pdf/2402.19470
代码/Code: None

Rethinking Multi-domain Generalization with A General Learning Objective

论文/Paper: http://arxiv.org/pdf/2402.18853
代码/Code: None

Rethinking Inductive Biases for Surface Normal Estimation

论文/Paper: http://arxiv.org/pdf/2403.00712
代码/Code: https://github.com/baegwangbin/DSINE

SURE: SUrvey REcipes for building reliable and robust deep networks

论文/Paper: http://arxiv.org/pdf/2403.00543
代码/Code: https://github.com/YutingLi0606/SURE

Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching

论文/Paper: http://arxiv.org/pdf/2403.00486
代码/Code: https://github.com/Windsrain/Selective-Stereo.

Deformable One-shot Face Stylization via DINO Semantic Guidance

论文/Paper: http://arxiv.org/pdf/2403.00459
代码/Code: https://github.com/zichongc/DoesFS

CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation

论文/Paper: http://arxiv.org/pdf/2403.00274
代码/Code: None

NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors

论文/Paper: http://arxiv.org/pdf/2403.03122
代码/Code: None

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

论文/Paper: http://arxiv.org/pdf/2403.02782
代码/Code: None

HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes

论文/Paper: http://arxiv.org/pdf/2403.02769
代码/Code: None

Learning Group Activity Features Through Person Attribute Prediction

论文/Paper: http://arxiv.org/pdf/2403.02753
代码/Code: https://github.com/chihina/GAFL-CVPR2024.

Interactive Continual Learning: Fast and Slow Thinking

论文/Paper: http://arxiv.org/pdf/2403.02628
代码/Code: None

NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors

论文/Paper: http://arxiv.org/pdf/2403.03122
代码/Code: None

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

论文/Paper: http://arxiv.org/pdf/2403.02782
代码/Code: None

HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes

论文/Paper: http://arxiv.org/pdf/2403.02769
代码/Code: None

Learning Group Activity Features Through Person Attribute Prediction

论文/Paper: http://arxiv.org/pdf/2403.02753
代码/Code: https://github.com/chihina/GAFL-CVPR2024.

Interactive Continual Learning: Fast and Slow Thinking

论文/Paper: http://arxiv.org/pdf/2403.02628
代码/Code: None

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

论文/Paper: http://arxiv.org/pdf/2403.03890
代码/Code: None

DART: Implicit Doppler Tomography for Radar Novel View Synthesis

论文/Paper: http://arxiv.org/pdf/2403.03896
代码/Code: None

MeaCap: Memory-Augmented Zero-shot Image Captioning

论文/Paper: http://arxiv.org/pdf/2403.03715
代码/Code: https://github.com/joeyz0z/MeaCap

HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations

论文/Paper: http://arxiv.org/pdf/2403.03561
代码/Code: None

Continual Segmentation with Disentangled Objectness Learning and Class Recognition

论文/Paper: http://arxiv.org/pdf/2403.03477
代码/Code: https://github.com/jordangong/CoMasTRe

HDRFlow: Real-Time HDR Video Reconstruction with Large Motions

论文/Paper: http://arxiv.org/pdf/2403.03447
代码/Code: None

LEAD: Learning Decomposition for Source-free Universal Domain Adaptation

论文/Paper: http://arxiv.org/pdf/2403.03421
代码/Code: https://github.com/ispc-lab/lead

F$^3$Loc: Fusion and Filtering for Floorplan Localization

论文/Paper: http://arxiv.org/pdf/2403.03370
代码/Code: None

Enhancing Vision-Language Pre-training with Rich Supervisions

论文/Paper: http://arxiv.org/pdf/2403.03346
代码/Code: None

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

论文/Paper: http://arxiv.org/pdf/2403.04765
代码/Code: None

Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning

论文/Paper: http://arxiv.org/pdf/2403.04492
代码/Code: https://github.com/rashindrie/dipa

Learning to Remove Wrinkled Transparent Film with Polarized Prior

论文/Paper: http://arxiv.org/pdf/2403.04368
代码/Code: https://github.com/jqtangust/filmremoval

LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking

论文/Paper: http://arxiv.org/pdf/2403.04303
代码/Code: None

Active Generalized Category Discovery

论文/Paper: http://arxiv.org/pdf/2403.04272
代码/Code: https://github.com/mashijie1028/activegcd

MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection

论文/Paper: http://arxiv.org/pdf/2403.04149
代码/Code: https://github.com/ispc-lab/map

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

论文/Paper: http://arxiv.org/pdf/2403.04245
代码/Code: https://github.com/dalision/modalbiasavsr

Seamless Human Motion Composition with Blended Positional Encodings

论文/Paper: https://arxiv.org/abs/2402.15509
代码/Code:https://github.com/BarqueroGerman/FlowMDM

DiffusionLight: Light Probes for Free by Painting a Chrome Ball

论文/Paper: https://arxiv.org/abs/2312.09168
代码/Code:https://github.com/DiffusionLight/DiffusionLight

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

论文/Paper: http://arxiv.org/pdf/2403.05087
代码/Code: https://github.com/initialneil/SplattingAvatar

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

论文/Paper: http://arxiv.org/pdf/2403.06946
代码/Code: https://github.com/tl-uestc/unimos

Real-Time Simulated Avatar from Head-Mounted Sensors

论文/Paper: http://arxiv.org/pdf/2403.06862
代码/Code: None

DiaLoc: An Iterative Approach to Embodied Dialog Localization

论文/Paper: http://arxiv.org/pdf/2403.06846
代码/Code: None

FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

论文/Paper: http://arxiv.org/pdf/2403.06775
代码/Code: https://github.com/modelscope/facechain

EarthLoc: Astronaut Photography Localization by Indexing Earth from Space

论文/Paper: http://arxiv.org/pdf/2403.06758
代码/Code: https://github.com/gmberton/earthloc

CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective

论文/Paper: http://arxiv.org/pdf/2403.06676
代码/Code: https://github.com/snskysk/cam-back-again

Distributionally Generative Augmentation for Fair Facial Attribute Classification

论文/Paper: http://arxiv.org/pdf/2403.06606
代码/Code: https://github.com/heqianpei/diga

Exploiting Style Latent Flows for Generalizing Deepfake Detection Video Detection

论文/Paper: http://arxiv.org/pdf/2403.06592
代码/Code: None

MoST: Motion Style Transformer between Diverse Action Contents

论文/Paper: http://arxiv.org/pdf/2403.06225
代码/Code: https://github.com/Boeun-Kim/MoST.

Coherent Temporal Synthesis for Incremental Action Segmentation

论文/Paper: http://arxiv.org/pdf/2403.06102
代码/Code: None

Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?

论文/Paper: http://arxiv.org/pdf/2403.06092
代码/Code: None

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

论文/Paper: http://arxiv.org/pdf/2403.05854
代码/Code: None

PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor

论文/Paper: http://arxiv.org/pdf/2403.06668
代码/Code: None

SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection

论文/Paper: http://arxiv.org/pdf/2403.03170
代码/Code: None

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

论文/Paper: https://arxiv.org/abs/2403.17749
代码/Code: https://github.com/YuqiYang213/MLoRE

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

论文/Paper: http://arxiv.org/pdf/2403.07874
代码/Code: https://github.com/zh460045050/v2l-tokenizer

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

论文/Paper: http://arxiv.org/pdf/2403.07719
代码/Code: https://github.com/wonderlandxd/wikg

Robust Synthetic-to-Real Transfer for Stereo Matching

论文/Paper: http://arxiv.org/pdf/2403.07705
代码/Code: https://github.com/jiaw-z/dkt-stereo

CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

论文/Paper: http://arxiv.org/pdf/2403.07700
代码/Code: https://github.com/shahaf-arica/cuvler

Masked AutoDecoder is Effective Multi-Task Vision Generalist

论文/Paper: http://arxiv.org/pdf/2403.07692
代码/Code: https://github.com/hanqiu-hq/mad

PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution

论文/Paper: http://arxiv.org/pdf/2403.07589
代码/Code: None

Unleashing Network Potentials for Semantic Scene Completion

论文/Paper: http://arxiv.org/pdf/2403.07560
代码/Code: https://github.com/fereenwong/ammnet

Open-World Semantic Segmentation Including Class Similarity

论文/Paper: http://arxiv.org/pdf/2403.07532
代码/Code: https://github.com/PRBonn/ContMAV

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

论文/Paper: http://arxiv.org/pdf/2403.07392
代码/Code: https://github.com/Traffic-X/ViT-CoMer

FSC: Few-point Shape Completion

论文/Paper: http://arxiv.org/pdf/2403.07359
代码/Code: None

Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture

论文/Paper: http://arxiv.org/pdf/2403.07347
代码/Code: https://github.com/jiafei127/fd4mm

A Bayesian Approach to OOD Robustness in Image Classification

论文/Paper: http://arxiv.org/pdf/2403.07277
代码/Code: None

返回目录/back

For Tasks:

Click tags to check more tools for each tasks

object detection image segmentation image generation video analysis

For Jobs:

computer vision engineer machine learning engineer data scientist research scientist software engineer

Alternative AI tools for CVPR2024-Papers-with-Code-Demo

Similar Open Source Tools

CVPR2024-Papers-with-Code-Demo

github

: 1.2k

Wegent

Wegent is an open-source AI-native operating system designed to define, organize, and run intelligent agent teams. It offers various core features such as a chat agent with multi-model support, conversation history, group chat, attachment parsing, follow-up mode, error correction mode, long-term memory, sandbox execution, and extensions. Additionally, Wegent includes a code agent for cloud-based code execution, AI feed for task triggers, AI knowledge for document management, and AI device for running tasks locally. The platform is highly extensible, allowing for custom agents, agent creation wizard, organization management, collaboration modes, skill support, MCP tools, execution engines, YAML config, and an API for easy integration with other systems.

github

: 417

unilm

The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

github

: 19.6k

agenta

Agenta is an open-source LLM developer platform for prompt engineering, evaluation, human feedback, and deployment of complex LLM applications. It provides tools for prompt engineering and management, evaluation, human annotation, and deployment, all without imposing any restrictions on your choice of framework, library, or model. Agenta allows developers and product teams to collaborate in building production-grade LLM-powered applications in less time.

github

: 3.8k

timeline-studio

Timeline Studio is a next-generation professional video editor with AI integration that automates content creation for social media. It combines the power of desktop applications with the convenience of web interfaces. With 257 AI tools, GPU acceleration, plugin system, multi-language interface, and local processing, Timeline Studio offers complete video production automation. Users can create videos for various social media platforms like TikTok, YouTube, Vimeo, Telegram, and Instagram with optimized versions. The tool saves time, understands trends, provides professional quality, and allows for easy feature extension through plugins. Timeline Studio is open source, transparent, and offers significant time savings and quality improvements for video editing tasks.

github

: 56

stable-diffusion.cpp

The stable-diffusion.cpp repository provides an implementation for inferring stable diffusion in pure C/C++. It offers features such as support for different versions of stable diffusion, lightweight and dependency-free implementation, various quantization support, memory-efficient CPU inference, GPU acceleration, and more. Users can download the built executable program or build it manually. The repository also includes instructions for downloading weights, building from scratch, using different acceleration methods, running the tool, converting weights, and utilizing various features like Flash Attention, ESRGAN upscaling, PhotoMaker support, and more. Additionally, it mentions future TODOs and provides information on memory requirements, bindings, UIs, contributors, and references.

github

: 5.4k

AI-on-the-edge-device

AI-on-the-edge-device is a project that enables users to digitize analog water, gas, power, and other meters using an ESP32 board with a supported camera. It integrates Tensorflow Lite for AI processing, offers a small and affordable device with integrated camera and illumination, provides a web interface for administration and control, supports Homeassistant, Influx DB, MQTT, and REST API. The device captures meter images, extracts Regions of Interest (ROIs), runs them through AI for digitization, and allows users to send data to MQTT, InfluxDb, or access it via REST API. The project also includes 3D-printable housing options and tools for logfile management.

github

: 7.5k

Awesome-LVLM-Hallucination

github

: 93

Interview-for-Algorithm-Engineer

This repository provides a collection of interview questions and answers for algorithm engineers. The questions are organized by topic, and each question includes a detailed explanation of the answer. This repository is a valuable resource for anyone preparing for an algorithm engineering interview.

github

: 1.4k

llm-agents.nix

Nix packages for AI coding agents and development tools. Automatically updated daily. This repository provides a wide range of AI coding agents and tools that can be used in the terminal environment. The tools cover various functionalities such as code assistance, AI-powered development agents, CLI tools for AI coding, workflow and project management, code review, utilities like search tools and browser automation, and usage analytics for AI coding sessions. The repository also includes experimental features like sandboxed execution, provider abstraction, and tool composition to explore how Nix can enhance AI-powered development.

github

: 629

robusta

Robusta is a tool designed to enhance Prometheus notifications for Kubernetes environments. It offers features such as smart grouping to reduce notification spam, AI investigation for alert analysis, alert enrichment with additional data like pod logs, self-healing capabilities for defining auto-remediation rules, advanced routing options, problem detection without PromQL, change-tracking for Kubernetes resources, auto-resolve functionality, and integration with various external systems like Slack, Teams, and Jira. Users can utilize Robusta with or without Prometheus, and it can be installed alongside existing Prometheus setups or as part of an all-in-one Kubernetes observability stack.

github

: 2.9k

Embodied-AI-Guide

Embodied-AI-Guide is a comprehensive guide for beginners to understand Embodied AI, focusing on the path of entry and useful information in the field. It covers topics such as Reinforcement Learning, Imitation Learning, Large Language Model for Robotics, 3D Vision, Control, Benchmarks, and provides resources for building cognitive understanding. The repository aims to help newcomers quickly establish knowledge in the field of Embodied AI.

github

: 4.1k

fastapi-admin

智元 Fast API is a one-stop API management system that unifies various LLM APIs in terms of format, standards, and management to achieve the ultimate in functionality, performance, and user experience. It includes features such as model management with intelligent and regex matching, backup model functionality, key management, proxy management, company management, user management, and chat management for both admin and user ends. The project supports cluster deployment, multi-site deployment, and cross-region deployment. It also provides a public API site for registration with a contact to the author for a 10 million quota. The tool offers a comprehensive dashboard, model management, application management, key management, and chat management functionalities for users.

github

: 114

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.9k

presenton

Presenton is an open-source AI presentation generator and API that allows users to create professional presentations locally on their devices. It offers complete control over the presentation workflow, including custom templates, AI template generation, flexible generation options, and export capabilities. Users can use their own API keys for various models, integrate with Ollama for local model running, and connect to OpenAI-compatible endpoints. The tool supports multiple providers for text and image generation, runs locally without cloud dependencies, and can be deployed as a Docker container with GPU support.

github

: 4.0k

Awesome_papers_on_LLMs_detection

This repository is a curated list of papers focused on the detection of Large Language Models (LLMs)-generated content. It includes the latest research papers covering detection methods, datasets, attacks, and more. The repository is regularly updated to include the most recent papers in the field.

github

: 147

For similar tasks

CVPR2024-Papers-with-Code-Demo

github

: 1.2k

ezlocalai

ezlocalai is an artificial intelligence server that simplifies running multimodal AI models locally. It handles model downloading and server configuration based on hardware specs. It offers OpenAI Style endpoints for integration, voice cloning, text-to-speech, voice-to-text, and offline image generation. Users can modify environment variables for customization. Supports NVIDIA GPU and CPU setups. Provides demo UI and workflow visualization for easy usage.

github

: 67

ms-copilot-play

Microsoft Copilot Play is a Cloudflare Worker service that accelerates Microsoft Copilot functionalities in China. It allows high-speed access to Microsoft Copilot features like chatting, notebook, plugins, image generation, and sharing. The service filters out meaningless requests used for statistics, saving up to 80% of Cloudflare Worker requests. Users can deploy the service easily with Cloudflare Worker, ensuring fast and unlimited access with no additional operations. The service leverages the power of Microsoft Copilot, based on OpenAI GPT-4, and utilizes Bing search to answer questions.

github

: 221

oh-my-pi

oh-my-pi is an AI coding agent for the terminal, providing tools for interactive coding, AI-powered git commits, Python code execution, LSP integration, time-traveling streamed rules, interactive code review, task management, interactive questioning, custom TypeScript slash commands, universal config discovery, MCP & plugin system, web search & fetch, SSH tool, Cursor provider integration, multi-credential support, image generation, TUI overhaul, edit fuzzy matching, and more. It offers a modern terminal interface with smart session management, supports multiple AI providers, and includes various tools for coding, task management, code review, and interactive questioning.

github

: 831

kaapana

Kaapana is an open-source toolkit for state-of-the-art platform provisioning in the field of medical data analysis. The applications comprise AI-based workflows and federated learning scenarios with a focus on radiological and radiotherapeutic imaging. Obtaining large amounts of medical data necessary for developing and training modern machine learning methods is an extremely challenging effort that often fails in a multi-center setting, e.g. due to technical, organizational and legal hurdles. A federated approach where the data remains under the authority of the individual institutions and is only processed on-site is, in contrast, a promising approach ideally suited to overcome these difficulties. Following this federated concept, the goal of Kaapana is to provide a framework and a set of tools for sharing data processing algorithms, for standardized workflow design and execution as well as for performing distributed method development. This will facilitate data analysis in a compliant way enabling researchers and clinicians to perform large-scale multi-center studies. By adhering to established standards and by adopting widely used open technologies for private cloud development and containerized data processing, Kaapana integrates seamlessly with the existing clinical IT infrastructure, such as the Picture Archiving and Communication System (PACS), and ensures modularity and easy extensibility.

github

: 176

MONAI

MONAI is a PyTorch-based, open-source framework for deep learning in healthcare imaging. It provides a comprehensive set of tools for medical image analysis, including data preprocessing, model training, and evaluation. MONAI is designed to be flexible and easy to use, making it a valuable resource for researchers and developers in the field of medical imaging.

github

: 6.2k

PyTorch-Tutorial-2nd

The second edition of "PyTorch Practical Tutorial" was completed after 5 years, 4 years, and 2 years. On the basis of the essence of the first edition, rich and detailed deep learning application cases and reasoning deployment frameworks have been added, so that this book can more systematically cover the knowledge involved in deep learning engineers. As the development of artificial intelligence technology continues to emerge, the second edition of "PyTorch Practical Tutorial" is not the end, but the beginning, opening up new technologies, new fields, and new chapters. I hope to continue learning and making progress in artificial intelligence technology with you in the future.

github

: 2.8k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

For similar jobs

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. It provides a common API to deliver inference solutions on various platforms, including CPU, GPU, NPU, and heterogeneous devices. OpenVINO™ supports pre-trained models from Open Model Zoo and popular frameworks like TensorFlow, PyTorch, and ONNX. Key components of OpenVINO™ include the OpenVINO™ Runtime, plugins for different hardware devices, frontends for reading models from native framework formats, and the OpenVINO Model Converter (OVC) for adjusting models for optimal execution on target devices.

github

: 8.9k

peft

PEFT (Parameter-Efficient Fine-Tuning) is a collection of state-of-the-art methods that enable efficient adaptation of large pretrained models to various downstream applications. By only fine-tuning a small number of extra model parameters instead of all the model's parameters, PEFT significantly decreases the computational and storage costs while achieving performance comparable to fully fine-tuned models.

github

: 20.6k

jetson-generative-ai-playground

This repo hosts tutorial documentation for running generative AI models on NVIDIA Jetson devices. The documentation is auto-generated and hosted on GitHub Pages using their CI/CD feature to automatically generate/update the HTML documentation site upon new commits.

github

: 94

emgucv

Emgu CV is a cross-platform .Net wrapper for the OpenCV image-processing library. It allows OpenCV functions to be called from .NET compatible languages. The wrapper can be compiled by Visual Studio, Unity, and "dotnet" command, and it can run on Windows, Mac OS, Linux, iOS, and Android.

github

: 2.1k

MMStar

MMStar is an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. It addresses two key issues in current LLM evaluation: the unnecessary use of visual content in many samples and the existence of unintentional data leakage in LLM and LVLM training. MMStar evaluates 6 core capabilities across 18 detailed axes, ensuring a balanced distribution of samples across all dimensions.

github

: 84

VLMEvalKit

VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.

github

: 3.1k

llava-docker

This Docker image for LLaVA (Large Language and Vision Assistant) provides a convenient way to run LLaVA locally or on RunPod. LLaVA is a powerful AI tool that combines natural language processing and computer vision capabilities. With this Docker image, you can easily access LLaVA's functionalities for various tasks, including image captioning, visual question answering, text summarization, and more. The image comes pre-installed with LLaVA v1.2.0, Torch 2.1.2, xformers 0.0.23.post1, and other necessary dependencies. You can customize the model used by setting the MODEL environment variable. The image also includes a Jupyter Lab environment for interactive development and exploration. Overall, this Docker image offers a comprehensive and user-friendly platform for leveraging LLaVA's capabilities.

github

: 59