The meaning of "most" for visual question answering models - Citegraph

Paper Info

Title
The meaning of "most" for visual question answering models

Abstract
The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design data to replicate experiments from psycholinguistics where the same question was investigated for humans. Focusing on the FiLM visual question answering model, our experiments indicate that a form of approximate number system emerges whose performance declines with more difficult scenes as predicted by Weber's law. Moreover, we identify confounding factors, like spatial arrangement of the scene, which impede the effectiveness of this system.

Year	DOI	Venue
2018	10.18653/v1/w19-4806	BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019
DocType	Volume	Citations
Journal	abs/1812.11737	0
PageRank	References	Authors
0.34	14	2

Authors (2 rows)

Cited by (0 rows)

References (14 rows)

Name	Order	Citations	PageRank
Alexander Kuhnle	1	7	1.11
Ann Copestake	2	862	95.10

1