Zobrazeno 1 - 10
of 48
pro vyhledávání: '"Ko, Byungsoo"'
Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor
Externí odkaz:
http://arxiv.org/abs/2410.04751
Autor:
Lee, Young-Jun, Lee, Dokyong, Youn, Junyoung, Oh, Kyeongjin, Ko, Byungsoo, Hyeon, Jonghwan, Choi, Ho-Jin
Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social intera
Externí odkaz:
http://arxiv.org/abs/2407.03958
In content-based video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embed
Externí odkaz:
http://arxiv.org/abs/2303.08906
As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low qual
Externí odkaz:
http://arxiv.org/abs/2212.04119
Autor:
Ko, Byungsoo, Kim, Han-Gyu, Heo, Byeongho, Yun, Sangdoo, Chun, Sanghyuk, Gu, Geonmo, Kim, Wonjae
Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer v
Externí odkaz:
http://arxiv.org/abs/2212.04114
Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks
Externí odkaz:
http://arxiv.org/abs/2210.02254
Autor:
Ko, Byungsoo, Gu, Geonmo
This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts,
Externí odkaz:
http://arxiv.org/abs/2203.14463
In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if
Externí odkaz:
http://arxiv.org/abs/2112.08816
Previous deep learning-based line segment detection (LSD) suffers from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we
Externí odkaz:
http://arxiv.org/abs/2106.00186
In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satis
Externí odkaz:
http://arxiv.org/abs/2104.03015