Zobrazeno 1 - 10
of 125
pro vyhledávání: '"Song, Xuemeng"'
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fa
Externí odkaz:
http://arxiv.org/abs/2407.06001
Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item repres
Externí odkaz:
http://arxiv.org/abs/2404.16555
Composed image retrieval (CIR) aims to retrieve the target image based on a multimodal query, i.e., a reference image paired with corresponding modification text. Recent CIR studies leverage vision-language pre-trained (VLP) methods as the feature ex
Externí odkaz:
http://arxiv.org/abs/2404.15875
Autor:
Becattini, Federico, Chen, Xiaolin, Puccia, Andrea, Wen, Haokun, Song, Xuemeng, Nie, Liqiang, Del Bimbo, Alberto
Recommending fashion items often leverages rich user profiles and makes targeted suggestions based on past history and previous purchases. In this paper, we work under the assumption that no prior knowledge is given about a user. We propose to build
Externí odkaz:
http://arxiv.org/abs/2402.11627
Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (i.e., utterance, video, and audio). Although existing studi
Externí odkaz:
http://arxiv.org/abs/2402.03658
Multi-interest learning method for sequential recommendation aims to predict the next item according to user multi-faceted interests given the user historical interactions. Existing methods mainly consist of a multi-interest extractor that embeds the
Externí odkaz:
http://arxiv.org/abs/2401.04312
Existing sign language translation methods follow a two-stage pipeline: first converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and then translating the generated gloss sequence into a spoken language sentence (i.e. Gloss2Text)
Externí odkaz:
http://arxiv.org/abs/2312.10210
Publikováno v:
ACM Multimedia 2023
Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved co
Externí odkaz:
http://arxiv.org/abs/2309.01366
Publikováno v:
ACL 2023
Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, which aims to generate a natural language sentence for a multimodal social post (an image as well as its caption) to explain why it contains sarcasm. Although the existing pioneer s
Externí odkaz:
http://arxiv.org/abs/2306.16650