Popis: |
Two elements have been essential to AI's recent boom: (1) deep neural nets and the theory and practice behind them; and (2) cloud computing with its abundant labeled data and large computing resources. Abundant labeled data is available for key domains such as images, speech, natural language processing, and recommendation engines. However, there are many other domains where such data is not available, or access to it is highly restricted for privacy reasons, as with health and financial data. Even when abundant data is available, it is often not labeled. Doing such labeling is labor-intensive and non-scalable. As a result, to the best of our knowledge, key domains still lack labeled data or have at most toy data; or the synthetic data must have access to real data from which it can mimic new data. This paper outlines work to generate realistic synthetic data for an important domain: credit card transactions. Some challenges: there are many patterns and correlations in real purchases. There are millions of merchants and innumerable locations. Those merchants offer a wide variety of goods. Who shops where and when? How much do people pay? What is a realistic fraudulent transaction? We use a mixture of technical approaches and domain knowledge including mechanics of credit card processing, a broad set of consumer domains: electronics, clothing, hair styling, etc. Connecting everything is a virtual world. This paper outlines some of our key techniques and provides evidence that the data generated is indeed realistic. Beyond the scope of this paper: (1) use of our data to develop and train models to predict fraud; (2) coupling models and the synthetic dataset to assess performance in designing accelerators such as GPUs and TPUs. |