Statistical Methods for Generating Synthetic Email Data Sets

Autor: Esteban Urdiales, Otis B. Jennings, Karolyn O. Babalola, James A. DeBardelaben
Rok vydání: 2018
Předmět:
Zdroj: IEEE BigData
DOI: 10.1109/bigdata.2018.8622601
Popis: This document outlines and demonstrates an approach to generating a synthetic email dataset using the Enron email dataset as a reference and the Synthetic Transaction Data Generator (STDG) simulator application. With statistical measures extracted from the Enron dataset, we generate synthetic email threads within the confines of a fictitious corporate structure. Our approach extrapolates the network structure of who communicates with whom and the timing of these communications, but it ignores semantic content. To this end, we first harvest the statistical network and temporal features of the Enron reference dataset. With these features, we use the agent-based STDG application to stochastically generate corporations of arbitrary sizes and email transactions within the corporations over a specified time period. We evaluate our methodology by comparing features of the synthetically-generated datasets with those of the reference dataset.
Databáze: OpenAIRE