Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms
Autor: | Meyer, Jordan, Padgett, Nick, Miller, Cullen, Exline, Laura |
---|---|
Rok vydání: | 2024 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time. Comment: Project Page: https://source.plus/pd12m |
Databáze: | arXiv |
Externí odkaz: |