Benchmarking Robustness of Compositional Image Retrieval

Abstract

Compositional image retrieval (CIR) aims to retrieve a target image through a query specified by an input image paired with text describing desired modifications to the input image. CIR has recently attracted attention for its ability to express precise adjustments to the input image by leveraging both information-rich images and concise natural language instructions. In real-world applications, reference images can deviate from the original distribution, and textual descriptions can also be diverse due to different users. Despite its current success, the robustness of popular CIR methodologies to either (1) real-world corruptions or (2) variations of the textual descriptions have never been systematically evaluated. In this paper, we perform the first robustness study of CIR, establishing three new diverse benchmarks for a systematic evaluation of robustness to corruption (in both the textual and visual domains) in compositional image retrieval, further probing textual understanding. For analysis of natural image corruption, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C, for the open domains and fashion domains respectively–both of which feature 75 visual corruptions and 35 textual corruptions. Finally, to facilitate robust evaluation of text understanding, we introduce a new diagnostic dataset CIRR-D by expanding the CIRR dataset with synthetic data. CIRR-D’s textual descriptions are carefully modified to better probe text understanding across a range of factors: numerical variations, attribute variation, object removal, and background variation. We provide a testbed to benchmark ten published models, revealing how the composition of visual and textual modalities contributes to robustness.

-->

Acknowledgement

We appreciate the following codebase: