Hacking a huge Drupal database import

This article describes how to use regexes and some bash tricks and bad voodoo to split a SQL file into pieces in order to get it into MySQL in a reasonable amount of time. Parsing SQL is notoriously painful and error prone, so if you see mistakes here (I'm sure there are), please comment!

I recently got called in to work on a project for a customer I had never worked with before. It is a very well known media / retail company and the job was to build a new interface for a media player / video library they required. Pretty standard stuff, some views a couple quicktabs, a bit of ajax coding and a lot of pixel-f**king. Anyway, near the end of the week long project when it came time to stage it, I realized I had a pretty big problem. This site had 2 million users and several tables of profile data and other information. My SQL file was over 3GB and took 6-8 hours to import. Eek!

So knowing that I wasn't going to write my update functions 100% perfect the first time, and I would likely have to import several times until I was confident it was working okay, I needed to find a way to get the DB to import in a more reasonable amount of time. All I had at this point was a sql file of the production DB and 6hrs to get the code ready.

Here's what I did:

(keep in mind this is on OSX, GNU tools maybe slightly differnet)

split -a5 '^DROP TABLE IF EXISTS [^;]+;$' db_backup.sql _db_split

What this does is look for the string DROP TABLE IF EXISTS, splits based on that and creates files called _db_splitXXXXX where XXXXX is a unique string.

Okay, now I have a bunch of files that look like this:

ls -l _db_split*

-rw-r--r-- 1 jacob staff 837 Jul 16 23:24 _db_splitaaaaa

-rw-r--r-- 1 jacob staff 679 Jul 16 23:24 _db_splitaaaab

-rw-r--r-- 1 jacob staff 937 Jul 16 23:24 _db_splitaaaac

-rw-r--r-- 1 jacob staff 1666 Jul 16 23:24 _db_splitaaaad

-rw-r--r-- 1 jacob staff 601 Jul 16 23:24 _db_splitaaaae

-rw-r--r-- 1 jacob staff 99276 Jul 16 23:24 _db_splitaaaaf
It goes on and on.

Let's make those a little easier to read

for f in _db_split*; do new_f=$(egrep 'CREATE TABLE `([a-z0-9_]+)' -o $f | cut -d' ' -f 3 | cut -c2-100); mv $f dump_$new_f.sql; done;
This one finds he table name from inside the file (I'm sure there is a nicer way to do this, probably should just use ruby, but this works). Then it renames the file from _db_splitXXXXX to dump_tablename.txt. Now I have something like this:

ls -l dump*
-rw-r--r-- 1 jacob staff 679 Jul 16 23:24 dump_access.sql

-rw-r--r-- 1 jacob staff 937 Jul 16 23:24 dump_accesslog.sql

-rw-r--r-- 1 jacob staff 1666 Jul 16 23:24 dump_actions.sql

-rw-r--r-- 1 jacob staff 601 Jul 16 23:24 dump_actions_aid.sql

-rw-r--r-- 1 jacob staff 99276 Jul 16 23:24 dump_apachesolr_search_node.sql

-rw-r--r-- 1 jacob staff 728 Jul 16 23:24 dump_authmap.sql
Okay, we don't want to import the cache data, or watchdog, but we do want the table definitions.
So I defined a function to do this:
function sql_remove_insert() { out=$(cat $1 | ruby -e 'puts STDIN.read.split("Dumping data for table")[0]'); echo "$out" > $1; }
And I run it against the tables I don't need to be importing:

for t in `ls dump_cache* dump_watchdog* dump_settings_audit_log_*`; do sql_remove_insert $t; done;
That makes a difference, but we're still at > 2GB. The real issue is the 2 million users. So assuming that the important users who create the content (i.e. administrators) are all within the first 10,000 or so accounts, I wrote a hacky (and probably fragile) regular expression to remove any insert records where the uid is > 10,000. And then I ran it on any table which has uid as the first column:
for t in {dump_users,dump_users_roles,dump_users_uuid,dump_password_reset_users}; do perl -pi -e 's/,$[0-9]{5,10},(?!$,).*\),/,/g' $t.sql; done;
That last bit brought the DB down to only 300MB or so and it imported in a couple hours alowing me to stage my changes, mess it up, restore and get it working in the end.

Bash-FU! Hope this helps soemone trying to import a huge SQL file into MySQL.

Tags:

Original Article:

Hacking a huge Drupal database import