Anomaly Detection in Visual Question Answering
NeurIPS Workshop on Safety and Robustness in Decision-making
As vision and language are the two major modality of human understanding, a visual question answering (VQA) model has great potential to be a real-world application. Regarding VQA as classification problem, recent studies increase the accuracy of answer by comprehensive understanding of multimodal inputs from large-scale dataset. However, we tackle that previous studies do not consider unexpected situations, which VQA applications can encounter after deployment in real-world. In this study, we define the five types of anomaly in VQA, evaluate the robustness of typical models, and propose attention-based anomaly detection. The common approach of anomaly detection, maximum softmax probability of answer, does not assure high performance on various types of anomaly. However, we show that the robustness of VQA model can increase when a model contains attention modules, which learn high order correlation between image and question. We expect that this study becomes a starting point to evaluate the robustness of VQA models to establish safe VQA application in real-world.
이도엽(POSTECH), 천영재(카카오브레인), 정우성(POSTECH)