Zobrazeno 1 - 10
of 1 984
pro vyhledávání: '"Li, GuangYao"'
The Segment Anything Model (SAM) has demonstrated strong performance in image segmentation of natural scene images. However, its effectiveness diminishes markedly when applied to specific scientific domains, such as Scanning Probe Microscope (SPM) im
Externí odkaz:
http://arxiv.org/abs/2410.12562
Cross-view geo-localization in GNSS-denied environments aims to determine an unknown location by matching drone-view images with the correct geo-tagged satellite-view images from a large gallery. Recent research shows that learning discriminative ima
Externí odkaz:
http://arxiv.org/abs/2408.02408
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only
Externí odkaz:
http://arxiv.org/abs/2407.20693
Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Vi
Externí odkaz:
http://arxiv.org/abs/2407.10957
Autor:
Li, Guangyao, Brydon, P. M. R.
The $j = 3/2$ fermions in cubic crystals or cold atomic gases can form Cooper pairs in both singlet ($J = 0$) and unconventional quintet ($J = 2$) $s$-wave states. Our study utilizes analytical field theory to examine fluctuations in these states wit
Externí odkaz:
http://arxiv.org/abs/2405.06111
Autor:
Guo, Ruohao, Ying, Xianghua, Chen, Yaru, Niu, Dantong, Li, Guangyao, Qu, Liao, Qi, Yanyu, Zhou, Jinxing, Xing, Bowei, Yue, Wenzhen, Shi, Ji, Wang, Qixun, Zhang, Peiliang, Liang, Buwen
In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we intro
Externí odkaz:
http://arxiv.org/abs/2310.18709
Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations
Externí odkaz:
http://arxiv.org/abs/2310.07517
Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the deman
Externí odkaz:
http://arxiv.org/abs/2309.07929
Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where m
Externí odkaz:
http://arxiv.org/abs/2308.05421
We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In th
Externí odkaz:
http://arxiv.org/abs/2306.09431