Data Schema to Formalize Education Research & Development Using Natural Language Processing

Autor:	Haizhu Hong, Margaret Williams, Hannah Frederick, Amanda West, Brian Wright
Rok vydání:	2021
Předmět:	Vocabulary business.industry Bigram media_common.quotation_subject Database schema Data dictionary computer.software_genre Semantics Latent Dirichlet allocation symbols.namesake Schema (psychology) Reading (process) symbols Artificial intelligence business computer Natural language processing media_common
Zdroj:	2021 Systems and Information Engineering Design Symposium (SIEDS).
DOI:	10.1109/sieds52267.2021.9483781
Popis:	Our work aims to aid in the development of an open source data schema for educational interventions by implementing natural language processing (NLP) techniques on publications within What Works Clearinghouse (WWC) and the Education Resources Information Center (ERIC). A data schema demonstrates the relationships between individual elements of interest (in this case, research in education) and collectively documents elements in a data dictionary. To facilitate the creation of this educational data schema, we first run a two-topic latent Dirichlet allocation (LDA) model on the titles and abstracts of papers that met WWC standards without reservation against those of papers that did not, separated by math and reading subdomains. We find that the distributions of allocation to these two topics suggest structural differences between WWC and non-WWC literature. We then implement Term Frequency-Inverse Document Frequency (TF-IDF) scoring to study the vocabulary within WWC titles and abstracts and determine the most relevant unigrams and bigrams currently present in WWC. Finally, we utilize an LDA model again to cluster WWC titles and abstracts into topics, or sets of words, grouped by underlying semantic similarities. We find that 11 topics are the optimal number of subtopics in WWC with an average coherence score of 0.4096 among the 39 out of 50 models that returned 11 as the optimal number of topics. Based on the TF-IDF and LDA methods presented, we can begin to identify core themes of high-quality literature that will better inform the creation of a universal data schema within education research.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::c585acd5112f45877574215342e25447 https://doi.org/10.1109/sieds52267.2021.9483781 Zobrazit plný text záznamu