RB06. Methods for using AI to create a synthetic digital twin of the Estonian population

Primary focus area – F5: AI for e-governance
Secondary focus areas – F1: hybrid AI pipelines, F2: adaptation of foundation models, F6: AI for healthcare, F8: AI for cybersecurity

Abstract

This project develops AI-based methods for generating a realistic, privacy-preserving synthetic digital twin of the Estonian population. Initial efforts focus on synthesizing data from the population registry, education, healthcare, and tax systems. The output will enable safe testing, research, and development of public-sector digital services without using real personal data.

Research Gap

GDPR restricts the use of real population data for development and testing. Current methods cannot generate coherent, multi-table synthetic datasets that reflect complex societal interactions over time. Previous research focuses on single-table or healthcare data synthesis, but lacks methods for creating interconnected datasets across domains. Furthermore, utility and privacy evaluation methods for such synthetic data are still underdeveloped.

Objective

  1. Create a prototype framework for generating synthetic population data, reflecting individuals, organisations, and their interactions.
  2. Ensure compatibility with microsimulation models to enable policy testing.
  3. Build a utility and privacy assessment methodology to tune synthetic data generation and ensure GDPR compliance.

Approach

We will build a modular pipeline for synthetic data generation, deployable in public-sector institutions. The system will include:

  • Rule-based modules for structured identifiers (e.g. ID codes, bank accounts)
  • ML-based modules tailored for tabular, temporal, and image data
  • Hybrid synthesis combining rules and ML, based on real-life user pathways in e-government systems

Privacy-utility assessment tool to evaluate and fine-tune generated data for usability and compliance.

Impact

Synthetic population data will enable safe, GDPR-compliant development and testing of digital services. It supports policy evaluation, R&D, and education by simulating realistic socio-economic dynamics. It also allows researchers and policymakers to conduct microsimulations without accessing sensitive data, boosting transparency and innovation. The privacy-utility tool ensures that the synthetic data balances usefulness with legal safeguards.