Using Git for efficient field storage in Drupal
Topics:
Versionable content in Drupal has always suffered from inefficient storage. Each new revision of a content item (historically a "node", but now anything that's fieldable, an "entity") is stored in its entirety in the database. For sites with lot of content, and continuous updates (new revisions) of that content, the database size can grow extremely large over time. This becomes a serious problem for maintenance and disk usage.
Ideally, the database would only store differences between revisions, saving disk space, but at the time of this writing, the architecture does not support this. It has been discussed previously over at Managing node revisions with a backend like subversion?, but the idea was never implemented.
As of Drupal 7, we now have a means by which to solve this problem. The Field Storage API can be used to provide alternative storage backends. So all we need to do is:
- Find an efficient storage backend.
- Plug it in.
The good news is that we already have a solution for the first item, and we're already using it. Git, the distributed version control system, is being used by drupal.org and many other Drupal software projects. We simply need to re-purpose it. If Git can be used for efficient storage of code revisions, why not use it for efficient storage of content revisions?
Plugging it in would involve writing a module to implement the Field Storage API hooks for Git use. After thinking about this for a while, I don't believe this would be too difficult. There are some items that would require careful consideration, however.
Considerations
File structure and directory hierarchy
As Git is essentially a file system, it's necessary to develop a standard for how the field data is stored within it. A tree structure like the following comes to mind:
- field machine name
- entity type
- bundle
- entity ID
- language
- delta (the sequence number for this data item, used for multi-value fields)
- file representing field contents for value 1
- file representing field contents for value 2
- file representing field contents for value 3
- delta (the sequence number for this data item, used for multi-value fields)
- language
- entity ID
- bundle
- entity type
Mapping revision IDs to commit IDs
In order to map field revisions within Drupal to commit IDs within Git, translation would be necessary. For now, we could add a database table that associates each revision with a commit ID. So when a particular revision of a field is called upon for display, the commit ID could be looked up, and then we could pull the corresponding file out of Git at the commit's point in the repository history.
There are probably some other issues with this approach, but the above two (2) considerations were the first to come to mind.
Thoughts?