04
Merges from the Crypt - or, how I learned to stop worrying and love SQL
0 Comments | Tags: django databases postgresql
I was recently tasked with merging two databases. There was a decent amount of overlap between the two, as they both were backing Django sites and also shared some of the same installed apps, but there was also a lot that was unique, as the smaller of the two was running GeoDjango. I opted to use a combination of django-multidb and command line postgres to accomplish this.
The Plan
1. diff the two database schemas, finding tables that are:
- in both databases
- unique to the smaller database
2. create 2 SQL dumps:
- one for overlapping tables without CREATE statements
- one for unique tables with all the CREATEs
3. fix all PK & FKs
- iterate over the overlap dump file and compare rows against the big database:
- if the record is duplicate and has the same PK, just remove that line from the file
as any other rows referencing it will work automatically
- if the record is duplicate with a different PK, remove that line from the file,
get the PK, and update all FK references in the dump to use the new PK
- if the record is unique, offset the PK and update all FK references
- iterate over the two dump files
- update all FK references
4. update id_seqs, general clean-up
1. Diff Schemas
The idea behind this is that, in the end, I am going to be modifying dump files and then importing those dumps into the bigger database. Since the databases have so much shared data (django_content_types, auth_permissions, etc), it just seemed to make sense to divide the work into two processes: those tables that are unique and those that overlap. The basic process I followed was to psql database_name -c "\dt" > filename and then using a regex extract the table names and create diffs.
Regex:
match = re.search(r'\s*(public|private)\s*\|\s*([a-zA-Z0-9_]+)\s*\|\s*table\s*\|.*', line)
2. Create SQL dumps
Getting good data into the dumps means less processing after-the-fact. By storing the overlapping tables in a list, I generated some pg_dump commands:
overlap_tables_cmd = ' --table='.join(overlaps)
diff_tables_cmd = ' --exclude-table='.join(overlaps)
os.system('pg_dump %s --exclude-table=%s -O -x > %s_diff.sql' % (primary_db, diff_tables_cmd, primary_db))
os.system('pg_dump %s --data-only --table=%s > %s_overlap.sql' % (primary_db, overlap_tables_cmd, primary_db))
If that looks gross, its because it is, but in the end it saved me a lot of trouble. The only problem is that the "Diff" dump contains CREATE SEQUENCE and SELECT pg_catalog.setval() - there may be a switch to get rid of these, but since I needed to keep the CREATE & setval's for the new tables, I just wrote a function to remove the ones I didn't need.
3. Fix all PKs and FKs
This is the big one. I ended up writing some complicated stuff, as there were some interesting conflicts that arose. Primary keys will not need to be altered in the "diff" dump, as those tables are being freshly created and there won't be any conflicts. The biggest idea here is that, with the overlapping tables, a determination will need to be made on whether the row is unique or not, and if it IS unique, to add it but otherwise use the pre-existing row and update all foreign key references. I thought of rows in the overlap dump as having 3 states:
Duplicate row with same primary key: These are the easy to deal with - simply remove them from the dump. All rows that reference them will automatically work since the primary key will be the same.
Duplicate row with a different primary key: With these rows, I remove them from the dump but make a note to update all foreign key references to use the pre-existing row rather than the one being removed.
Unique row: I find the maximum primary key value for each table and simply offset each unique row's ID, so there are no PK collisions. These are the only rows that stay in the dump file. Like the rows that are duplicate but with different PKs, I make a note to update all FK references to use the newly-generated PK.
Updating foreign keys was not too difficult in this merge, as the number of circular references was minimal. The key here is what storage method is used for tracking which PKs need to be updated. I used dictionaries for their super-fast lookup time, storing changes in subdictionaries:
CHANGE_LIST = {
'user_id': {
'123': '234', # original is on left, new is on right
'124': '235'
},
'category_id': {
...
}
}
I use a single pass approach to perform the updates. I iterated over both dump files, and whenever I entered a table I used a regex to extract the column names, and if any of them matched one of the keys in CHANGE_LIST, I checked whether the values in the data matched up and if so swapped the old for the new. Here's the regex I used:
match_copy = re.compile(r'COPY (.*) \(([^\)]+)\).*;')
end_of_table = re.compile(r'\\\\.')
# then, something like this:
columns = match.group(2).split(', ')
for key in CHANGE_LIST.keys():
if key in columns:
keys_to_fix.append((key, columns.index(key)))
Note that if you've got something where a FK column has a non-standard name, or if there are two FKs to the same table, you can just copy the keys, as was the case for categories:
CHANGE_LIST['parent_id'] = CHANGE_LIST['category_id']
This is a glaring weakness in my merge process, and I was lucky enough to have good standardized FK column names and only one instance where I needed to do a hack like the above. If there were another table that used parent_id, I would've been in trouble.
I would like to mention another step I added here, which was to resolve conflicts. I was fairly strict in how I determined the 'uniqueness' of rows, but sometimes there would be overlap. Like for instance records in auth_user where the username was the same in both tables but the emails were different. I wrote a function that scans for conflicts by using a single lookup field, in this case 'username', then it passes the two rows to a special conflict handler function that determines which row to use and logs the conflict. Once again, I got lucky with the merge in that the number of conflicts was manageable.
4. Clean up
As I mentioned before, the diff dump contains CREATE SEQUENCE's for all the tables, which obviously I don't want. Also, since I've increased the PKs on many of the tables, I will need to update the SELECT pg_catalog.setval()'s to store the new maximum PK. I used the following two regexes to capture that table name and then compare it with tables that already existed in the database, and if a match was found, either remove the SEQUENCE or update the id_seq value:
match_setval = re.compile(r'SELECT pg_catalog\.setval\(\'([a-zA-Z0-9_]+)_id_seq\',\s*(\d+),\s*.*\);')
match_sequence = re.compile(r'CREATE SEQUENCE ([a-zA-Z0-9_]+)_id_seq')
5. Restore!
The last step is to combine the dumps, restore, and hope for the best!
Get Django Multidb
svn checkout http://code.djangoproject.com/svn/django/branches/soc2009/multidb/ django-multidb
Comments (0)
Commenting has been disabled for this entry
