Can Machines Tell Stories? A Comparative Study of Deep Neural Language Models and Metrics

Autor:	Avisha Das, Rakesh M. Verma
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Deep learning transformer-based architecture neural language models natural language evaluation story generation generative pre-trained transformer (GPT) Electrical engineering. Electronics. Nuclear engineering TK1-9971
Zdroj:	IEEE Access, Vol 8, Pp 181258-181292 (2020)
Druh dokumentu:	article
ISSN:	2169-3536
DOI:	10.1109/ACCESS.2020.3023421
Popis:	Massive textual content has enabled rapid advances in natural language modeling. The use of pre-trained deep neural language models has significantly improved natural language understanding tasks. However, the extent to which these systems can be applied to content generation is unclear. While a few informal studies have claimed that these models can generate `high quality' readable content, there is no prior study on analyzing the generated content from these models based on sampling and fine-tuning hyperparameters. We conduct an in-depth comparison of several language models for open-ended story generation from given prompts. Using a diverse set of automated metrics, we compare the performance of transformer-based generative models - OpenAI's GPT2 (pre-trained and fine-tuned) and Google's pre-trained TransformerXL and XLNet to human-written textual references. Studying inter-metric correlation along with metric ranking reveals interesting insights - the high correlation between the readability scores and word usage in the text. A study of the statistical significance and empirical evaluations between the scores (human and machine-generated) at higher sampling hyperparameter combinations (t = {0.75, 1.0}, k = {100, 150, 250}) reveal that the top pre-trained and fine-tuned models generated samples condition well on the prompt with an increased occurrence of unique and difficult words. The GPT2-medium model fine-tuned on the 1024 Byte-pair Encoding (BPE) tokenized version of the dataset along with pre-trained Transformer-XL models generated samples close to human written content on three metrics: prompt-based overlap, coherence, and variation in sentence length. A study of overall model stability and performance shows that fine-tuned GPT2 language models have the least deviation in metric scores from human performance.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/da4beb725c0b4c8dafac7ccf92099a74 Zobrazit plný text záznamu View record in DOAJ