Výsledky vyhledávání - "MISHRA, MAYANK"

Report

Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models

Autor: Gupta, Sonam, Nandwani, Yatin, Yehudai, Asaf, Mishra, Mayank, Pandey, Gaurav, Raghu, Dinesh, Joshi, Sachindra

Fine-tuning Large Language Models (LLMs) on specific datasets is a common practice to improve performance on target tasks. However, this performance gain often leads to overfitting, where the model becomes too specialized in either the task or the ch

Externí odkaz: http://arxiv.org/abs/2409.04787

Zobrazit plný text záznamu

Report

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

Autor: Shen, Yikang, Stallone, Matthew, Mishra, Mayank, Zhang, Gaoyuan, Tan, Shawn, Prasad, Aditya, Soria, Adriana Meza, Cox, David D., Panda, Rameswar

Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters bu

Externí odkaz: http://arxiv.org/abs/2408.13359

Zobrazit plný text záznamu

Report

Scaling Granite Code Models to 128K Context

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraini

Externí odkaz: http://arxiv.org/abs/2407.13739

Zobrazit plný text záznamu

Report

Enhancing Training Efficiency Using Packing with Flash Attention

Autor: Kundu, Achintya, Lee, Rhui Dih, Wynter, Laura, Ganti, Raghu Kiran, Mishra, Mayank

Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including

Externí odkaz: http://arxiv.org/abs/2407.09105

Zobrazit plný text záznamu

Report

The infrastructure powering IBM's Gen AI model development

Autor: Gershon, Talia, Seelam, Seetharami, Belgodere, Brian, Bonilla, Milton, Hoang, Lan, Barnett, Danny, Chung, I-Hsin, Mohan, Apoorve, Chen, Ming-Hung, Luo, Lixiang, Walkup, Robert, Evangelinos, Constantinos, Salaria, Shweta, Dombrowa, Marc, Park, Yoonho, Kayi, Apo, Schour, Liran, Alim, Alim, Sydney, Ali, Maniotis, Pavlos, Schares, Laurent, Metzler, Bernard, Karacali-Akyamac, Bengi, Wen, Sophia, Chiba, Tatsuhiro, Choochotkaew, Sunyanan, Yoshimura, Takeshi, Misale, Claudia, Elengikal, Tonia, Connor, Kevin O, Liu, Zhuoran, Molina, Richard, Schneidenbach, Lars, Caden, James, Laibinis, Christopher, Fonseca, Carlos, Tarasov, Vasily, Sundararaman, Swaminathan, Schmuck, Frank, Guthridge, Scott, Cohn, Jeremy, Eshel, Marc, Muench, Paul, Liu, Runyu, Pointer, William, Wyskida, Drew, Krull, Bob, Rose, Ray, Wolfe, Brent, Cornejo, William, Walter, John, Malone, Colm, Perucci, Clifford, Franco, Frank, Hinds, Nigel, Calio, Bob, Druyan, Pavel, Kilduff, Robert, Kienle, John, McStay, Connor, Figueroa, Andrew, Connolly, Matthew, Fost, Edie, Roma, Gina, Fonseca, Jake, Levy, Ido, Payne, Michele, Schenkel, Ryan, Malki, Amir, Schneider, Lion, Narkhede, Aniruddha, Moshref, Shekeba, Kisin, Alexandra, Dodin, Olga, Rippon, Bill, Wrieth, Henry, Ganci, John, Colino, Johnny, Habeger-Rose, Donna, Pandey, Rakesh, Gidh, Aditya, Gaur, Aditya, Patterson, Dennis, Salmani, Samsuddin, Varma, Rambilas, Rumana, Rumana, Sharma, Shubham, Mishra, Mayank, Panda, Rameswar, Prasad, Aditya, Stallone, Matt, Zhang, Gaoyuan, Shen, Yikang, Cox, David, Puri, Ruchir, Agrawal, Dakshi, Thorstensen, Drew, Belog, Joel, Tang, Brent, Gupta, Saurabh Kumar, Biswas, Amitabha, Maheshwari, Anup, Gampel, Eran, Van Patten, Jason, Runion, Matthew, Kaki, Sai, Bogin, Yigal, Reitz, Brian, Pritko, Steve, Najam, Shahan, Nambala, Surya, Chirra, Radhika, Welp, Rick, DiMitri, Frank, Telles, Felipe, Arvelo, Amilcar, Chu, King, Seminaro, Ed, Schram, Andrew, Eickhoff, Felix, Hanson, William, Mckeever, Eric, Joseph, Dinakaran, Chaudhary, Piyush, Shivam, Piyush, Chaudhary, Puneet, Jones, Wesley, Guthrie, Robert, Bostic, Chris, Islam, Rezaul, Duersch, Steve, Sawdon, Wayne, Lewars, John, Klos, Matthew, Spriggs, Michael, McMillan, Bill, Gao, George, Kamra, Ashish, Singh, Gaurav, Curry, Marc, Katarki, Tushar, Talerico, Joe, Shi, Zenghui, Malleni, Sai Sindhur, Gallen, Erwan

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational

Externí odkaz: http://arxiv.org/abs/2407.05467

Zobrazit plný text záznamu

Report

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Autor: Brandon, William, Mishra, Mayank, Nrusimha, Aniruddha, Panda, Rameswar, Kelly, Jonathan Ragan

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths an

Externí odkaz: http://arxiv.org/abs/2405.12981

Zobrazit plný text záznamu

Report

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based age

Externí odkaz: http://arxiv.org/abs/2405.04324

Zobrazit plný text záznamu

Report

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Autor: Pan, Bowen, Shen, Yikang, Liu, Haokun, Mishra, Mayank, Zhang, Gaoyuan, Oliva, Aude, Raffel, Colin, Panda, Rameswar

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\t

Externí odkaz: http://arxiv.org/abs/2404.05567

Zobrazit plný text záznamu

Report

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

Autor: Nrusimha, Aniruddha, Mishra, Mayank, Wang, Naigang, Alistarh, Dan, Panda, Rameswar, Kim, Yoon

We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key chal

Externí odkaz: http://arxiv.org/abs/2404.03605

Zobrazit plný text záznamu

Report

Direct visualization of local magnetic domain dynamics in a 2D Van der Walls material/ferromagnet interface

Autor: Vas, Joseph Vimal, Medwal, Rohit, Manna, Sourabh, Mishra, Mayank, Muller, Aaron, Mohan, John Rex, Fukuma, Yasuhiro, Duchamp, Martial, Rawat, Rajdeep Singh

Exploring new strategies for controlling the magnetic domain propagation is the key to realize ultrafast, high-density domain wall-based memory and logic devices for next generation computing. These strategies include strain modulation in multiferroi

Externí odkaz: http://arxiv.org/abs/2404.03177

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání