Výsledky vyhledávání

Report

Brevity is the soul of wit: Pruning long files for code generation

Autor: Singh, Aaditya K., Yang, Yu, Tirumala, Kushal, Elhoushi, Mostafa, Morcos, Ari S.

Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus. Specifically,

Externí odkaz: http://arxiv.org/abs/2407.00434

Zobrazit plný text záznamu

Report

Effective pruning of web-scale datasets based on complexity of concept clusters

Autor: Abbas, Amro, Rusak, Evgenia, Tirumala, Kushal, Brendel, Wieland, Chaudhuri, Kamalika, Morcos, Ari S.

Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits

Externí odkaz: http://arxiv.org/abs/2401.04578

Zobrazit plný text záznamu

Report

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

Autor: Yang, Yu, Singh, Aaditya K., Elhoushi, Mostafa, Mahmoud, Anas, Tirumala, Kushal, Gloeckle, Fabian, Rozière, Baptiste, Wu, Carole-Jean, Morcos, Ari S., Ardalani, Newsha

Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Prev

Externí odkaz: http://arxiv.org/abs/2312.02418

Zobrazit plný text záznamu

Akademický článek

An automated computational framework to construct printability maps for additively manufactured metal alloys

Autor: Sofia Sheikh, Brent Vela, Pejman Honarmandi, Peter Morcos, David Shoukr, Ibrahim Karaman, Alaa Elwany, Raymundo Arróyave

Publikováno v: npj Computational Materials, Vol 10, Iss 1, Pp 1-19 (2024)

Abstract In metal additive manufacturing (AM), processing parameters can affect the probability of macroscopic defect formation (lack-of-fusion, keyholing, balling), which can, in turn, jeopardize the final product’s integrity. A printability map c

Externí odkaz: https://doaj.org/article/f249ce0fd0ea4525ae43b714abecbcc8

Zobrazit plný text záznamu

Akademický článek

Temporal trends in mental health terminology in Alzheimer's disease clinical trials.

Autor: Golrokhian-Sani, Amir-Ali^1,2 (AUTHOR), Morcos, Maya^1,2 (AUTHOR), Philippi, Alecco² (AUTHOR), Al-Rawi, Reem² (AUTHOR), Morcos, Marc³ (AUTHOR) marc.morcos1@hotmail.com, Fu, Rui⁴ (AUTHOR)

Publikováno v: PLoS ONE. 12/30/2024, Vol. 19 Issue 12, p1-11. 11p.

Zobrazit plný text záznamu

Report

UBVRI night sky brightness at Kottamia Astronomical Observatory

Autor: Aboushelib, Mohamed F., Morcos, Abdelfady B., Nawar, Samir, Shalabiea, Osama M., Awad, Zainab

Publikováno v: Nature portfolio, Scientific Reports (2023), volume 13, page 16754

Photoelectric observations of night sky brightness (NSB) at different zenith distances and azimuths, covering all the sky, at the Egyptian Kottamia Astronomical observatory (KAO) site of coordinates {\phi} = 29{\deg}55.9'N and {\lambda} = 31{\deg}49.

Externí odkaz: http://arxiv.org/abs/2310.05429

Zobrazit plný text záznamu

Plný text ve formátu HTML

Report

Sieve: Multimodal Dataset Pruning Using Image Captioning Models

Autor: Mahmoud, Anas, Elhoushi, Mostafa, Abbas, Amro, Yang, Yu, Ardalani, Newsha, Leather, Hugh, Morcos, Ari

Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream

Externí odkaz: http://arxiv.org/abs/2310.02110

Zobrazit plný text záznamu

Report

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Autor: Tirumala, Kushal, Simig, Daniel, Aghajanyan, Armen, Morcos, Ari S.

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While training on

Externí odkaz: http://arxiv.org/abs/2308.12284

Zobrazit plný text záznamu

Report

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

Autor: Bordes, Florian, Shekhar, Shashank, Ibrahim, Mark, Bouchacourt, Diane, Vincent, Pascal, Morcos, Ari S.

Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and c

Externí odkaz: http://arxiv.org/abs/2308.03977

Zobrazit plný text záznamu

Report

On the special role of class-selective neurons in early training

Autor: Ranadive, Omkar, Thakurdesai, Nikhil, Morcos, Ari S, Leavitt, Matthew, Deny, Stéphane

It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorat

Externí odkaz: http://arxiv.org/abs/2305.17409

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání