Popis: |
Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low light or poor image quality, has yet to be thoroughly investigated. Assessing the robustness and limitations of MMLMs in these scenarios is essential to ensuring their reliability and safety in real-world applications, where the input image quality can vary significantly. To address this gap, we propose a benchmark comprising 460 images captured under challenging conditions, including low light and blurring. This benchmark is specifically designed to evaluate the anomaly detection capabilities of MMLMs. We assess the performance of state-of-the-art MMLMs, such as Qwen-VL-Max-0809, GPT-4V, Gemini-1.5, Claude3-opus, ERNIE-Bot-4, and SparkDesk-v3.5, across six diverse scenes. Our evaluations indicate that these MMLMs struggle with error detection in adverse scenarios, thereby highlighting the need for further investigation into the underlying causes and potential improvement strategies. To tackle these limitations, we introduce a novel anomaly detection agent (ADAGENT) framework, which is an AI agent framework that combines the “Chain of Critical Self-Reflection (CCS)”, specialized toolsets, and “Heuristic Retrieval-Augmented Generation (RAG)” to enhance anomaly detection performance with MMLMs. ADAGENT sequentially evaluates abilities, such as text generation, semantic understanding, contextual comprehension, key information extraction, reasoning, and logical thinking. By implementing this framework, we demonstrated a $15\%\sim {30\%}$ improvement in the top-3 accuracy for anomaly detection tasks under adverse conditions, compared with baseline approaches. |