Data Privacy

Private data are collected for a specific purpose (e.g. to send a product to a customer) and it is totally forbidden to use these data for another purpose. As a side effect, you are not allowed to use personal data stored in your databases for activities like testing, training or commercial demonstration. Beside, same issue arises when companies want to keep some strategical data unrevealed.

Masking-based anonymization

To address this problem, different techniques of anonymization exist. In particular, data substitution is used to mask private data with fake data so as to get anonymized databases that look like the original ones. That sounds simple but raises many problems.

Data relevance The fake data must be relevant: if your application handles first names, you do not want to mask these names with numbers: the application screens showing these data would become quite misleading.

Data correlations Substitution data production gets pretty harder when your data are strongly correlated. As an example, the first name of a person is correlated to its salutation. If you generate a "Mrs", you have to give her a female name.

Intelligent keys Some data include some validity fields that make them tricky to generate. E.g. if you are to mask credit card numbers, you do not want to randomly produce 16 digit numbers: you would get almost only invalid cards.

Distribution preservation Data distributions are almost never uniform. For example, if you analyze a dataset containing persons, you might get 70% of women and only 30% of men. When producing fake data, you will often need to preserve the original data distribution and, thus, a simple random generation is not enough.

Isolated case phenomena In a large set of data, you always have isolated cases that distinguish from the mass. For example, in an IT company you would have thousands of developpers and testers for only one CEO, say Mr John. In this case, masking the name of the CEO with "James" is pointless: you will immediatly guess that the information concerning Mr James actually refer to Mr John. For that reason, you sometime want to bias distributions (e.g. generating no CEO or many CEOs) to prevent from the isolated case phenomena.

Generating high-quality anonymized data

With GEDIS Studio, you can produce the exact dataset that you need. You choose the data domain values, you express the correlations you need and enjoy a full control over demographic distributions.

The data produced with GEDIS Studio are anonymized but still incredibly realistic. At last but not the least, GEDIS Studio is also able to generate data from scratch: design your very own generators, produce non sensitive data and save all the tedious data extraction process costs.

Follow the link to see how GEDIS Studio helps you produce high-quality data that will keep your production databases safe: data anonymization with GEDIS Studio