Zobrazeno 1 - 10
of 60 912
pro vyhledávání: '"A. Hanan"'
Autor:
Munasinghe, Shehan, Gani, Hanan, Zhu, Wenqi, Cao, Jiale, Xing, Eric, Khan, Fahad Shahbaz, Khan, Salman
Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in vi
Externí odkaz:
http://arxiv.org/abs/2411.04923
Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to
Externí odkaz:
http://arxiv.org/abs/2410.18607
Multi-speaker localization and tracking using microphone array recording is of importance in a wide range of applications. One of the challenges with multi-speaker tracking is to associate direction estimates with the correct speaker. Most existing a
Externí odkaz:
http://arxiv.org/abs/2410.11453
Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channel
Externí odkaz:
http://arxiv.org/abs/2410.05019
In recent years, interest in vision-language tasks has grown, especially those involving chart interactions. These tasks are inherently multimodal, requiring models to process chart images, accompanying text, underlying data tables, and often user qu
Externí odkaz:
http://arxiv.org/abs/2410.13883
Autor:
Nawaz, Umair, Awais, Muhammad, Gani, Hanan, Naseer, Muzammal, Khan, Fahad, Khan, Salman, Anwer, Rao Muhammad
Capitalizing on vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data of
Externí odkaz:
http://arxiv.org/abs/2410.01407
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs).
Externí odkaz:
http://arxiv.org/abs/2409.19806
Autor:
Ghaleb, Ali, ElSadawy, Eslam, Essam, Ihab, Abdelhakim, Mohamed, Zaki, Seif-Eldin, Fahim, Natalie, Bayoumi, Razan, Hindy, Hanan
The automation of guitar tablature generation from video inputs holds significant promise for enhancing music education, transcription accuracy, and performance analysis. Existing methods face challenges with consistency and completeness, particularl
Externí odkaz:
http://arxiv.org/abs/2409.08618
Autor:
Khallaghi, Sam, Abedi, Rahebe, Ali, Hanan Abou, Alemohammad, Hamed, Asipunu, Mary Dziedzorm, Alatise, Ismail, Ha, Nguyen, Luo, Boka, Mai, Cat, Song, Lei, Wussah, Amos, Xiong, Sitian, Yao, Yao-Ting, Zhang, Qi, Estes, Lyndon D.
The accuracy of mapping agricultural fields across large areas is steadily improving with high-resolution satellite imagery and deep learning (DL) models, even in regions where fields are small and geometrically irregular. However, developing effecti
Externí odkaz:
http://arxiv.org/abs/2408.06467
Autor:
Van Tuan, Dinh, Dery, Hanan
Treating the trion problem as an effective two-body system with exciton and electron components, we identify component exchange as the reason leading to trion formation. This mechanism can be visualized as a hole that toggles back and forth between t
Externí odkaz:
http://arxiv.org/abs/2407.17445