Zobrazeno 1 - 10
of 396 932
pro vyhledávání: '"Quality of data"'
Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples, significantly redu
Externí odkaz:
http://arxiv.org/abs/2410.18634
Autor:
Gu, Shuhao, Zhang, Jialing, Zhou, Siyuan, Yu, Kevin, Xing, Zhaohu, Wang, Liangdong, Cao, Zhou, Jia, Jintao, Zhang, Zhuoyi, Wang, Yixuan, Hu, Zhenchong, Zhang, Bo-Wen, Li, Jijie, Liang, Dong, Zhao, Yingli, Ao, Yulong, Liu, Yaoqi, Feng, Fangxiang, Liu, Guang
Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducin
Externí odkaz:
http://arxiv.org/abs/2410.18558
Autor:
Liu, Zhongtao, Riley, Parker, Deutsch, Daniel, Lui, Alison, Niu, Mengmeng, Shah, Apu, Freitag, Markus
Collecting high-quality translations is crucial for the development and evaluation of machine translation systems. However, traditional human-only approaches are costly and slow. This study presents a comprehensive investigation of 11 approaches for
Externí odkaz:
http://arxiv.org/abs/2410.11056
Despite the significant progress made in code generation with large language models, challenges persist, especially with hardware description languages such as Verilog. This paper first presents an analysis of fine-tuned LLMs on Verilog coding, with
Externí odkaz:
http://arxiv.org/abs/2409.12993
Fine-grained air quality (AQ) mapping is made possible by the proliferation of cheap AQ micro-stations (MSs). However, their measurements are often inaccurate and sensitive to local disturbances, in contrast to standardized stations (SSs) that provid
Externí odkaz:
http://arxiv.org/abs/2408.09526
Autor:
Shulakov, Volodymyr
Synthetic tabular data is becoming a necessity as concerns about data privacy intensify in the world. Tabular data can be useful for testing various systems, simulating real data, analyzing the data itself or building predictive models. Unfortunately
Externí odkaz:
http://arxiv.org/abs/2407.13016
Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-re
Externí odkaz:
http://arxiv.org/abs/2408.06537
Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual
Externí odkaz:
http://arxiv.org/abs/2408.01323
Autor:
Liu, Zheng, Liang, Hao, Huang, Xijie, Xiong, Wentao, Yu, Qinhan, Sun, Linzhuang, Chen, Chong, He, Conghui, Cui, Bin, Zhang, Wentao
Recently, with the rise of web images, managing and understanding large-scale image datasets has become increasingly important. Vision Large Language Models (VLLMs) have recently emerged due to their robust vision-understanding capabilities. However,
Externí odkaz:
http://arxiv.org/abs/2407.20756
The implementation of modern monitoring systems for power quality disturbances have the potential to generate substantial amounts of data, reaching a point where transmission and storage of high-frequency measurements become impractical. This researc
Externí odkaz:
http://arxiv.org/abs/2407.01112