¹School of Computer and Information Engineering, Zhejiang Gongshang University
²School of Information Science and Technology, University of Science and Technology of China
³Key Lab of Data Engineering and Knowledge Engineering, Renmin University of China
@inproceedings{ dong2022prvr,
title = {Partially Relevant Video Retrieval},
author = {Jianfeng Dong and Xianke Chen and Minsong Zhang and Xun Yang and Shujie Chen and Xirong Li and Xun Wang},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
year = {2022}
}
An untrimmed video is considered to be partially relevant w.r.t. a given textual
query as long as the video contains a (short) moment relevant w.r.t.
the query, see the figure above. Two textual queries partially relevant to a given video. Only a specific moment in the video is relevant to the
corresponding query, while the other frames are irrelevant.
| Model | R@1 | R@5 | R@10 | R@100 | SumR | ||||
|---|---|---|---|---|---|---|---|---|---|
| T2VR models: | |||||||||
| W2VV, TMM18 [1] | 2.6 | 5.6 | 7.5 | 20.6 | 36.3 | ||||
| HGR, CVPR20 [2] | 1.7 | 4.9 | 8.3 | 35.2 | 50.1 | ||||
| HTM, ICCV19 [3] | 3.8 | 12.0 | 19.1 | 63.2 | 98.2 | ||||
| CE, BMVC19 [4] | 3.7 | 12.8 | 20.1 | 64.5 | 101.1 | ||||
| W2VV++, MM19 [5] | 5.0 | 14.7 | 21.7 | 61.8 | 103.2 | ||||
| VSE++, BMVC19 [6] | 7.5 | 19.9 | 27.7 | 66.0 | 121.1 | ||||
| DE, CVPR19 [7] | 7.6 | 20.1 | 28.1 | 67.6 | 123.4 | ||||
| DE++, TPAMI21 [8] | 8.8 | 21.9 | 30.2 | 67.4 | 128.3 | ||||
| RIVRL, TCSVT22 [9] | 9.4 | 23.4 | 32.2 | 70.6 | 135.6 | ||||
| VCMR models w/o moment localization: | |||||||||
| XML, ECCV20 [10] | 10.0 | 26.5 | 37.3 | 81.3 | 155.1 | ||||
| ReLoCLNet, SIGIR21 [11] | 10.7 | 28.1 | 38.1 | 80.3 | 157.1 | ||||
| MS-SL(Ours) [12] | 13.5 | 32.1 | 43.4 | 83.4 | 172.3 | ||||
| Model | R@1 | R@5 | R@10 | R@100 | SumR | ||||
|---|---|---|---|---|---|---|---|---|---|
| T2VR models: | |||||||||
| W2VV [1] | 2.2 | 9.5 | 16.6 | 45.5 | 73.8 | ||||
| HTM [3] | 3.7 | 13.7 | 22.3 | 66.2 | 105.9 | ||||
| HGR [2] | 4.0 | 15.0 | 24.8 | 63.2 | 107.0 | ||||
| RIVRL [9] | 5.2 | 18.0 | 28.2 | 66.4 | 117.8 | ||||
| VSE++ [6] | 4.9 | 17.7 | 28.2 | 67.1 | 117.9 | ||||
| DE++ [8] | 5.3 | 18.4 | 29.2 | 68.0 | 121.0 | ||||
| DE [7] | 5.6 | 18.8 | 29.4 | 67.8 | 121.7 | ||||
| W2VV++ [5] | 5.4 | 18.7 | 29.7 | 68.8 | 122.6 | ||||
| CE [4] | 5.5 | 19.1 | 29.9 | 71.1 | 125.6 | ||||
| VCMR models w/o moment localization: | |||||||||
| XML [10] | 5.7 | 18.9 | 30.0 | 72.0 | 126.6 | ||||
| ReLoCLNet [11] | 5.3 | 19.4 | 30.6 | 73.1 | 128.4 | ||||
| MS-SL(Ours) [12] | 7.1 | 22.5 | 34.7 | 75.8 | 140.1 | ||||
| Model | R@1 | R@5 | R@10 | R@100 | SumR | ||||
|---|---|---|---|---|---|---|---|---|---|
| T2VR models: | |||||||||
| W2VV [1] | 0.5 | 2.9 | 4.7 | 24.5 | 32.6 | ||||
| VSE++ [6] | 0.8 | 3.9 | 7.2 | 31.7 | 43.6 | ||||
| W2VV++ [5] | 0.9 | 3.5 | 6.6 | 34.3 | 45.3 | ||||
| HGR [2] | 1.2 | 3.8 | 7.3 | 33.4 | 45.7 | ||||
| CE [4] | 1.3 | 4.5 | 7.3 | 36.0 | 49.1 | ||||
| DE [7] | 1.5 | 5.7 | 9.5 | 36.9 | 53.7 | ||||
| DE++ [8] | 1.7 | 5.6 | 9.6 | 37.1 | 54.1 | ||||
| RIVRL [9] | 1.6 | 5.6 | 9.4 | 37.7 | 54.3 | ||||
| HTM [3] | 1.2 | 5.4 | 9.2 | 44.2 | 60.0 | ||||
| VCMR models w/o moment localization: | |||||||||
| XML [10] | 1.2 | 5.4 | 10.0 | 45.6 | 62.3 | ||||
| ReLoCLNet [11] | 1.6 | 6.0 | 10.1 | 46.9 | 64.6 | ||||
| MS-SL(Ours) [12] | 1.8 | 7.1 | 11.8 | 47.7 | 68.4 | ||||
The performance in
the group with the lowest M/V is the smallest, while the group with
the highest M/V is the largest. The result allows us to conclude that
the current video retrieval baseline models better address queries
of larger relevance to the corresponding video. By contrast, the
performance we achieved is more balanced in all groups.
This result shows that our proposed model is less sensitive to irrelevant
content in videos.