Abstract
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content.
Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world.
To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos.
PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos.
We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales.
We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR.
Our method
We formulate the PRVR subtask as a MIL problem, simultaneously viewing a video as a bag of video clips and a bag of video
frames. Clips and frames represent video content at different temporal scales. Based on multi-scale video representation, we propose
MS-SL to compute the relevance between videos and queries in a
coarse-to-fine manner. The structure of our proposed model is shown in the figure below.
Group Comparision
To gain a further understanding of the individual models, we
define moment-to-video ratio (M/V) for query, which is measured
by its corresponding moment's length ratio in the entire video. According
to M/V, queries can be automatically classified into different groups,
which enables a fine-grained analysis of how a specific model responds to the different types of queries.
The performance in
the group with the lowest M/V is the smallest, while the group with
the highest M/V is the largest. The result allows us to conclude that
the current video retrieval baseline models better address queries
of larger relevance to the corresponding video. By contrast, the
performance we achieved is more balanced in all groups.
This result shows that our proposed model is less sensitive to irrelevant
content in videos.
Acknowledgement
This work was supported by the National
Key R&D Program of China (2018YFB1404102), NSFC (62172420,
61902347, 61976188, 62002323), the Public Welfare Technology Research Project of Zhejiang Province (LGF21F020010), the Open
Projects Program of the National Laboratory of Pattern Recognition,
the Fundamental Research Funds for the Provincial Universities of
Zhejiang, and Public Computing Cloud of RUC.
Reference
-
Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features
from text for image and video caption retrieval. IEEE Transactions on Multimedia.
20, 12 (2018), 3377–3388.
-
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained Video-Text
Retrieval with Hierarchical Graph Reasoning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 10638–10647.
-
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan
Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by
watching hundred million narrated video clips. In Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2630–2640.
-
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use
what you have: Video retrieval using representations from collaborative experts.
arXiv preprint arXiv:1907.13487 (2019).
-
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019.
W2VV++: Fully deep learning for ad-hoc video search. In Proceedings of the 27th
ACM International Conference on Multimedia. 1786–1794.
-
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++:
Improving visual-semantic embeddings with hard negatives. In Proceedings of
the British Machine Vision Conference. 935–943.
-
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun
Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9346–9355.
-
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and
Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions
on Pattern Analysis and Machine Intelligence (2021), 1-1.
-
Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and
Xun Wang. 2022. Reading-strategy Inspired Visual Representation Learning
for Text-to-Video Retrieval. IEEE Transactions on Circuits and Systems for Video
Technology (2022).
-
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. TVR: A large-scale
dataset for video-subtitle moment retrieval. In European Conference on Computer
Vision. 447–463.
-
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou,
and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive
learning. In Proceedings of the 44th International ACM SIGIR Conference on Research
and Development in Information Retrieval. 685–695.
-
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li and Xun Wang.
2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia.