Partially Relevant Video Retrieval

Jianfeng Dong¹ Xianke Chen¹ Minsong Zhang¹ Xun Yang² Shujie Chen¹ Xirong Li*³ Xun Wang*¹

¹School of Computer and Information Engineering, Zhejiang Gongshang University

²School of Information Science and Technology, University of Science and Technology of China

³Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China

Paper

Slides

Data

Code

Leaderboard

@inproceedings{ dong2022prvr,
                title = {Partially Relevant Video Retrieval},
                author = {Jianfeng Dong and Xianke Chen and Minsong Zhang and Xun Yang and Shujie Chen and Xirong Li and Xun Wang},
                booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
                year = {2022}
                }

Abstract Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR.

Formulation of PRVR Given a natural language query, the task of PRVR aims to retrieve videos containing a moment that is semantically relevant to the given query, from a large corpus of untrimmed videos. As the moment referred to by the query is typically a small part of a video, we argue that the query is partially relevant to the video. It is worth pointing out that PRVR is different from conventional T2V retrieval, where videos are pre-trimmed and much shorter, and queries are usually fully relevant to the whole video. To build a PRVR model, a set of untrimmed videos are given for training, where each video is associated with multiple natural language sentences. Each sentence describes the content of a specific moment in the corresponding video. Note that we do not have access to the start/end time points of the moments (moments annotations) referred to by the sentences.

An untrimmed video is considered to be partially relevant w.r.t. a given textual query as long as the video contains a (short) moment relevant w.r.t. the query, see the figure above. Two textual queries partially relevant to a given video. Only a specific moment in the video is relevant to the corresponding query, while the other frames are irrelevant.

Our method We formulate the PRVR subtask as a MIL problem, simultaneously viewing a video as a bag of video clips and a bag of video frames. Clips and frames represent video content at different temporal scales. Based on multi-scale video representation, we propose MS-SL to compute the relevance between videos and queries in a coarse-to-fine manner. The structure of our proposed model is shown in the figure below.

Performance Comparision As videos in popular T2VR datasets such as MSR-VTT, MSVD and VATEX are supposed to be fully relevant to the queries, they are not suited for our experiments. Here, we re-purpose three datasets commonly used for VCMR, i.e., TVR, Activitynet Captions, and Charades-STA, considering their natural language queries partially relevant with the corresponding videos (a query is typically associated with a specific moment in a video). The results on the three datasets are summarized in the tables below.

On TVR:

Model	R@1	R@5	R@10	R@100	SumR
T2VR models:
W2VV, TMM18 [1]	2.6	5.6	7.5	20.6	36.3
HGR, CVPR20 [2]	1.7	4.9	8.3	35.2	50.1
HTM, ICCV19 [3]	3.8	12.0	19.1	63.2	98.2
CE, BMVC19 [4]	3.7	12.8	20.1	64.5	101.1
W2VV++, MM19 [5]	5.0	14.7	21.7	61.8	103.2
VSE++, BMVC19 [6]	7.5	19.9	27.7	66.0	121.1
DE, CVPR19 [7]	7.6	20.1	28.1	67.6	123.4
DE++, TPAMI21 [8]	8.8	21.9	30.2	67.4	128.3
RIVRL, TCSVT22 [9]	9.4	23.4	32.2	70.6	135.6
VCMR models w/o moment localization:
XML, ECCV20 [10]	10.0	26.5	37.3	81.3	155.1
ReLoCLNet, SIGIR21 [11]	10.7	28.1	38.1	80.3	157.1
MS-SL(Ours) [12]	13.5	32.1	43.4	83.4	172.3

On Activitynet:

Model	R@1	R@5	R@10	R@100	SumR
T2VR models:
W2VV [1]	2.2	9.5	16.6	45.5	73.8
HTM [3]	3.7	13.7	22.3	66.2	105.9
HGR [2]	4.0	15.0	24.8	63.2	107.0
RIVRL [9]	5.2	18.0	28.2	66.4	117.8
VSE++ [6]	4.9	17.7	28.2	67.1	117.9
DE++ [8]	5.3	18.4	29.2	68.0	121.0
DE [7]	5.6	18.8	29.4	67.8	121.7
W2VV++ [5]	5.4	18.7	29.7	68.8	122.6
CE [4]	5.5	19.1	29.9	71.1	125.6
VCMR models w/o moment localization:
XML [10]	5.7	18.9	30.0	72.0	126.6
ReLoCLNet [11]	5.3	19.4	30.6	73.1	128.4
MS-SL(Ours) [12]	7.1	22.5	34.7	75.8	140.1

On Charades-STA:

Model	R@1	R@5	R@10	R@100	SumR
T2VR models:
W2VV [1]	0.5	2.9	4.7	24.5	32.6
VSE++ [6]	0.8	3.9	7.2	31.7	43.6
W2VV++ [5]	0.9	3.5	6.6	34.3	45.3
HGR [2]	1.2	3.8	7.3	33.4	45.7
CE [4]	1.3	4.5	7.3	36.0	49.1
DE [7]	1.5	5.7	9.5	36.9	53.7
DE++ [8]	1.7	5.6	9.6	37.1	54.1
RIVRL [9]	1.6	5.6	9.4	37.7	54.3
HTM [3]	1.2	5.4	9.2	44.2	60.0
VCMR models w/o moment localization:
XML [10]	1.2	5.4	10.0	45.6	62.3
ReLoCLNet [11]	1.6	6.0	10.1	46.9	64.6
MS-SL(Ours) [12]	1.8	7.1	11.8	47.7	68.4

Group Comparision To gain a further understanding of the individual models, we define moment-to-video ratio (M/V) for query, which is measured by its corresponding moment's length ratio in the entire video. According to M/V, queries can be automatically classified into different groups, which enables a fine-grained analysis of how a specific model responds to the different types of queries.

The performance in the group with the lowest M/V is the smallest, while the group with the highest M/V is the largest. The result allows us to conclude that the current video retrieval baseline models better address queries of larger relevance to the corresponding video. By contrast, the performance we achieved is more balanced in all groups. This result shows that our proposed model is less sensitive to irrelevant content in videos.

Acknowledgement This work was supported by the National Key R&D Program of China (2018YFB1404102), NSFC (62172420, 61902347, 61976188, 62002323), the Public Welfare Technology Research Project of Zhejiang Province (LGF21F020010), the Open Projects Program of the National Laboratory of Pattern Recognition, the Fundamental Research Funds for the Provincial Universities of Zhejiang, and Public Computing Cloud of RUC.

Contact

Jianfeng Dong: dongjf24@gmail.com
Xianke Chen: a397283164@163.com

Reference

Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia. 20, 12 (2018), 3377–3388.
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640.
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++: Fully deep learning for ad-hoc video search. In Proceedings of the 27th ACM International Conference on Multimedia. 1786–1794.
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 935–943.
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9346–9355.
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1-1.
Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. IEEE Transactions on Circuits and Systems for Video Technology (2022).
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. TVR: A large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision. 447–463.
Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Video corpus moment retrieval with contrastive learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–695.
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li and Xun Wang. 2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia.