Synthetic data is one of the most practical solutions to AI's data problem and also one of the least discussed
IBM's synthetic data explainer https://www.ibm.com/think/topics/synthetic-data covers a topic that gets far less attention than its importance warrants in the context of how AI products are actually being developed and deployed.
The data problem synthetic data is solving: real-world training data for AI systems is expensive to collect, often imbalanced across the rare cases that matter most, carries genuine privacy risks, and in many domains is too sensitive to share across teams or organisations without significant legal and compliance overhead.
Synthetically generated data that mimics the statistical properties of real data while containing no real individuals is a practical solution to all four problems simultaneously. The applications range from privacy-preserving model training, to generating examples of rare events that appear infrequently in real data, to creating test datasets for safety-critical systems where real data collection is infeasible.
The risks the article covers are the ones worth taking seriously: models trained on synthetic data derived from biased real data inherit that bias, and the model collapse risk of AI models trained on AI-generated data losing fidelity to the real world is an active research concern rather than a theoretical one.
Does synthetic data improve AI fairness by giving teams control over dataset composition, or does it primarily create new risks by introducing an additional layer of potential bias and feedback loops?