Anonymizing data and preserving foreign key relationships

The simplest way to populate your database with test data.

Moderators: David Atkinson, Anu Deshpande, Lionel

Anonymizing data and preserving foreign key relationships

Postby BenDippenaar » Sat Oct 15, 2011 12:27 am

We are using data generator to anonymize some existing data in a database.

Right now, when i use the SQL foreign key data generator, it looks like the data that is generated has to be unique, or I have to specify the amount of repeats from within the interface.

It would be really nice if the data generator could generate data that matches the referential ‘footprint’ of the data it is generating from.

for example, it doesn't seem like i can preserve existing many-to-one relationships in the anonymized database.
BenDippenaar
 
Posts: 1
Joined: Sat Oct 15, 2011 12:24 am

Postby james.billings » Mon Oct 17, 2011 12:26 pm

Hi, and thanks for your post.

Usually for anonymizing data, you'd work on the basis of "copying" your current database and then generating random data for just the columns you want to remove sensitive information on. We have a document describing the process here

I'd suggest having a look through that and seeing if it helps, if not, please let us know!
james.billings
 
Posts: 1144
Joined: Wed Jun 16, 2010 11:10 am
Location: My desk.

Postby Scriber » Mon Oct 17, 2011 6:24 pm

Hey James,

What if your DB had 3 tables in it and in all three tables there was a column lets call it creditcardnumber the columns all need to be anonymized the same way so that the one to many or the one to one relationship can be maintained and you can query the new anonymizing data and get the same row count and reult set as the live data. Can the tool do that?

Thanks,
Scriber
Scriber
 
Posts: 2
Joined: Mon Oct 17, 2011 6:17 pm

Postby james.billings » Mon Oct 17, 2011 6:27 pm

For most of the random generators, you get the option of a "seed" value. AFAIK, the same seed should produce the same ordered data.

If the 3 tables are actually a master, and then there's a FK on the other two back to that, it should just bring the values in automatically using the FK generator.
james.billings
 
Posts: 1144
Joined: Wed Jun 16, 2010 11:10 am
Location: My desk.

Postby Scriber » Mon Oct 17, 2011 7:10 pm

Thanks James I'll give it a try, and it should work in a one to many relationship? Were table A has one entry for creditcardnumber 123 and table B has say ten entries for the same creditcardnumber 123 when it changes table A to say ABC it will chnage the other ten entries in Table B to ABC so when I query for creditcard number ABC I'll get the proper row count?
Scriber
 
Posts: 2
Joined: Mon Oct 17, 2011 6:17 pm

Postby james.billings » Mon Oct 17, 2011 7:13 pm

Hmm, I'm not sure if it's that clever when just generating random records. If your FK relationships are set up on that column then you have the choice in the FK generator for whether to generate all rows at random, make them unique, or to repeat a certain number of times, but I'm not sure if that's going to help.

SQL Data Generator doesn't have any kind of cross-column rules, so you are limited in that sense.

If yuo get stuck, then a simple repro may be useful... you can email it to support with F0053299 in the subjectline
james.billings
 
Posts: 1144
Joined: Wed Jun 16, 2010 11:10 am
Location: My desk.

Re: Anonymizing data and preserving foreign key relationships

Postby mark__a » Thu Nov 10, 2011 9:39 pm

BenDippenaar wrote:It would be really nice if the data generator could generate data that matches the referential ‘footprint’ of the data it is generating from.

for example, it doesn't seem like i can preserve existing many-to-one relationships in the anonymized database.


We are trying to do exactly the same thing. We have three tables:
Address, Customer, CustomerAddresses (contains only 2 FKs: AddressId and CustomerId). A Customer can have 0 to many Addresses.

The goal is to generate new address info and customer names but keep the same FKs in the CustomerAddresses table. I am using the "Use existing data source" option and am referencing a copy of the database to generate/import the data from Address and Customer. What I want to do is to import the CustomerAddresses table EXACTLY as it is. Same Id values in each row.

In the "real" database, the data looks like this:

Address/AddressId
1 (and then street1, city, state, zip, etc.)
2
3
...

Customer/CustomerId
1 (and then name, credit limit, etc.)
2
3

CustomerAddresses
CustomerId / AddressId
1 1
1 2
2 3
2 4
3 5
3 6

But when we generate the CustomerAddresses table with "Use existing data source" it forces us to use the Foreign Key Generator for these two columns. We can't change it. How can we just use the existing data and not generate anything?

We cannot de-select the CustomerAddresses table for generation because it chokes when trying to delete all the rows from Customer.
mark__a
 
Posts: 2
Joined: Thu Nov 10, 2011 9:25 pm

Postby james.billings » Thu Nov 10, 2011 9:48 pm

Unfortunately if the foreign keys are there, then it'll always use the FK generator. If you're copying records across for the other two tables does this matter though? Presumably, the primaries in the other two tables are what you want anyway?

The other option is to drop the FK's before starting up Data Generator; then you can use whatever generation option you like on those columns. You can recreate them afterwards then. This is a little more inconvenient of course, but the only way around it I can think of right now.
james.billings
 
Posts: 1144
Joined: Wed Jun 16, 2010 11:10 am
Location: My desk.

Re:

Postby mark__a » Thu Nov 10, 2011 11:47 pm

james.billings wrote:Unfortunately if the foreign keys are there, then it'll always use the FK generator. If you're copying records across for the other two tables does this matter though? Presumably, the primaries in the other two tables are what you want anyway?


This is a serious limitation in your product. All I really want to do is to "UPDATE" a few columns to wipe out some of the production data columns so that we can have a realistic test database.

As a concrete example for why this is a serious limitation, I put forth the following:

Customer can have 0 or more related Address records (via CustomerAddresses). If the CustomerAddresses CustomerId and/or AddressId FKs are randomly assigned, then the Addresses won't make much sense at all. You'll have Address records from different states or countries. Which is fine if you're trying to test for all available inputs but terrible if your goal is a realistic-yet-cleansed set of data.

There are a number of other examples in our database where many-to-many relationships only make sense if you keep the same keys. As far as I can tell, the only way to do this is to run a post-SQL Generator script that deletes all the FKs in these m2n tables, sets IDENTITY INSERT and then re-imports the existing data.

Why can't we just import all of the CustomerAddresses FKs as-is? Can't we just set the generation order so that Customer and Address are generated before CustomerAddresses?
mark__a
 
Posts: 2
Joined: Thu Nov 10, 2011 9:25 pm

Postby james.billings » Fri Nov 11, 2011 3:57 pm

We do have a feature request to allow simple updates which would obviously help in your situation, but as yet it's not been scheduled for inclusion. The ref. is SDG-850 and I've added a link back to this thread on the job for when it's next evaluated.
james.billings
 
Posts: 1144
Joined: Wed Jun 16, 2010 11:10 am
Location: My desk.

I second this!!

Postby anna.p » Wed Oct 17, 2012 12:21 am

A feature to allow simple updates would be fantastic.

I'm getting around this limitation by creating views against the source tables that select only the columns I don't want to "update". I use the Table/View generator to populate those columns in the target and randomizing generators for the columns I want to "update". The only columns for which this doesn't work are those with a Foreign Key constraint and IDENTITY columns.

2 possible enhancements to address this, which would help my situation but possibly not the other posters', are:
    1. For columns with FK constraints, default to the "Foreign Key" generator, but allow us to change it to a different one.

    2. For IDENTITY columns, default to the "Server Assigned" generator, but allow us to change it to another integer-type generator with an option to SET IDENTITY_INSERT ON.
anna.p
 
Posts: 23
Joined: Tue Jul 03, 2012 10:59 pm


Return to SQL Data Generator 1

Who is online

Users browsing this forum: No registered users and 1 guest