$FPDM$: Domain-Specific Fast Pre-training Technique using Document-Level Metadata

Autor:	Nandy, Abhilash, Kapadnis, Manav Nitin, Patnaik, Sohan, Butala, Yash Parag, Goyal, Pawan, Ganguly, Niloy
Rok vydání:	2023
Předmět:	FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language I.2.7 68T50 Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI:	10.48550/arxiv.2306.06190
Popis:	Pre-training Transformers has shown promising results on open-domain and domain-specific downstream tasks. However, state-of-the-art Transformers require an unreasonably large amount of pre-training data and compute. In this paper, we propose $FPDM$ (Fast Pre-training Technique using Document Level Metadata), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We show that $FPDM$ outperforms several transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains, and shows a negligible drop in performance on open-domain benchmarks. Importantly, the novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around $1,000$, $4,500$, and $500$ times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. Code and datasets are available at https://bit.ly/FPDMCode. Comment: 23 pages, 7 figures
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a882c5a74ffdb7fc5da1addc53965f17 Zobrazit plný text záznamu