Fields in Core Code Sprint Final Report

Executive Summary

In mid-December, eight Drupal developers held a code sprint at the Acquia offices in Andover, MA with the goal of “implementing Fields in Drupal core.” The attendees were Dries Buytaert, Yves Chedemois, Florian Loretan, Barry Jaspan, Károly Négyesi, David Rothstein, Karen Stevenson, David Strauss, and Moshe Weitzman.

During the sprint, we hashed out some major architectural breakthroughs, wrote lots of code (including unit tests to make sure the code keeps working!), and organized ourselves to help push through the remaining items which were not able to be completed during the sprint. We achieved our major goal: implement a "Field API" in core that cleanly allows attaching fields to both nodes and users and, by extension, any other entity type that wants them. The rest of the post contains more details about what we accomplished.

There is still plenty to do and we need your help! To jump in, visit our Drupal core improvements: Field API page.

We would like to give a huge thanks to the Drupal community and all of our contributors that made this sprint possible. This experience has proven how important sprints are and how effective they can be, and we hope that each of you will consider donating to see further sprints happen, such as the forthcoming Drupal.org Redesign sprints.

Background

The Fields in Core sprint was only possible because of a lot of thought, discussion, and design work completed previously.

The Drupal community has been discussing “comments as nodes,” “users as nodes,” and “CCK (content construction kit) in core” for years. Eventually we agreed that we needed some kind of unified “Data API” to organize operations on these and other types of entities. A single “Data API” has not yet been created, but particularly starting with Drupal 6, many different aspects of Drupal have moved towards a more unified and regular approach. For example, Drupal 6’s Schema API provides a consistent way to manage database tables; Drupal 6’s new menu (“callback router”) system provides a unified way to load objects of any type based on URL components and provide the objects to callbacks; Drupal 7 is moving towards a more organized method for rendering content from multiple object types into multiple output formats (such as XHTML, XML, JSON, …).

In February 2008, the Data Architecture Design Sprint (DADS) took place in Chicago. Starting (as always) with an unrealistically large to-do list, DADS clarified the idea that Drupal’s strategic advantage in the marketplace is an architecture that allows adding value to content, concluded that supporting Fields (a la CCK) on local and remote data directly in Drupal core would enhance that advantage, and specified how CCK could be re-factored to achieve Fields in core with minimum effort.

Throughout 2008, Barry began implementing the Field API for core and presented progress at Drupalcon Boston and Drupalcon Szeged. At the same time, Karen and Yves were working feverishly on CCK 2.0 for Drupal 6, leaving them no time specifically for Fields in Core (though much of the CCK 2.0 work they did prove extremely useful for the Field API later). The end result is that the Field API and CCK 2.0 diverged, leading to all the expected problems with forked and fragmented code.

Finally, everyone agreed that getting Fields in Core done absolutely required getting the Field API and CCK development to be synchronized, and that the best way to accomplish that was to get at least Karen, Yves, and Barry in the same room to bang it out. Thus the Fields in Core code sprint was declared.

The Field API follows the designs decided on at DADS remarkably closely, and the DADS Final Report is therefore a good introduction and explanation to the concepts involved. The Field API does not directly implemented either Content Model 1 nor 2, though it is probably closer to 2; that part of the DADS results were the least well understood and are probably worth ignoring.

Goals of the Sprint

As always, we started with an unrealistically large list of desired goals. We quickly reduced it to a single hard requirement: by the end of week, implement a "Field API" in core that cleanly allows attaching fields to both nodes and users and, by extension, any other entity type that wants them. After a seemingly slow start, we achieved this goal and, it turns out, several others.

Define Field API Architecture

The Fields in Core sprint was mostly about implementing the Field API. Documentation for the API is under development; see Current status and how to help. This section provides a brief introduction to the Field API so the rest of the document makes sense.

The Field API has three components and therefore three separate focuses of functionality (note that the names for these API sub-groups are still under discussion):

Field CRUD (Create/Read/Update/Delete). Functions used by modules to create content types and/or fields for themselves or other modules to use. For example, the Image module might define a node type called Image that contains a single image field. Generally, this occurs inside module install hooks. Also, the CCK UI uses the Field API CRUD functions for its major functionality. This part of the API first appeared in CCK D6, but gets massively redesigned for 'Fields in D7'.
Field types. Hooks that let modules define field types (text, image, nodereference...) and their 'plugins' (form widgets, display formatters). The Field module controls the database schema, storing and loading data, generating Form API data structure, etc. based on the behavior of these hooks. This is currently implemented in the field modules via hook_field and friends. This part of the API has been "the" CCK API from the early D4.7 releases up to D6, and sees a few adjustments and Developer Experience enhancements but no fundamental change.
Field attachment. Functions used by modules that implement object types that want to support having fields attached, like Node and User. For example, node_load() passes the node object under construction to the Field API Attachment functions to retrieve all fields for that node (based on its content type, node id (nid), and node version id (vid)). user_load() can do the same thing. This part of the API is brand new for 'Fields in D7'.

Developer experience design goals

The Field API is intended to meet a number of developer experience (DX) goals, hopefully as part of a trend to meet these goals throughout all of Drupal. Any failure to meet these goals represents a bug to be fixed in the Field API:

Well-defined data structures with a unique public representation.
Eliminate all $op hook parameters, putting them into the hook function name instead.
No Form API assumptions; drupal_execute() is never required.
Maintain intra-process API integrity; no static caches that require manual clearing.
Use defined constants instead of literal constants for arguments; e.g. FIELD_FOO_ENABLED or FIELD_FOO_DISABLED instead of TRUE or FALSE.
Full support for command-line or daemon processes. Do not assume the Field API is being used from within a page request.
Use exceptions instead of return values for error conditions. Drupal is one big failure to check return values.

Accomplishments

During the sprint, we accomplished goals in three categories: design discussions and decisions, code written, and developer/community engagement.

Design Decisions

Everyone came to the sprint wanting to get straight to writing code but we found ourselves having a lot of “before we can get started…” discussions. Some of these came from not everyone being familiar with all the background, some were re-hashing of previous discussions, and others were genuinely new topics. We ended up spending about two days in total in non-coding design discussions. Topics included:

Split between core Field and contrib CCK.
Terminology: Field vs Instance; Field CRUD/Definition/Attachment API; “Content type” versus “Bundle”
What properties are per Field vs. per Instance
Required vs. optional attributes for creating fields and instances
Field CRUD API creation paradigm
Field object representation: Array vs class stdClass vs class Field
Field Attach is pull (e.g. called from node_load) vs. hooked
Field display, core vs. contrib. (will need a separate core patch to cleanup the notion of 'context / build mode')
All per-field storage tables (see below for more details)
Integer ID keys vs. varchar for entity types in field data tables
Field data tables keyed by entity type and ID (e.g. node/vid or user/uid). Bundle names are also stored for delete as a special case.
Revisions in a separate table
Batch deletes and the "deleted" column
Materialized views
Test methodology

The sprint team thought it was crazy to have spent two of our five days talking instead of coding until we made a list of everything we covered and realized how much ground it covered. We intend to document each of the discussions and decisions in a set of handbook pages on drupal.org for future reference.

Field storage

The most contentious and long-standing design debate surrounding Field API is the way field data is stored in the local SQL database.

CCK uses a hybrid database schema. All fields that are single-valued and not shared between multiple content types are stored in a single wide "per-content-type" table for the one content type they are assigned to. All fields that are multiple-valued and/or shared between multiple content types are stored in separate "per-field" tables. For example, suppose you have two text fields, A and B. If they are both single-valued and assigned only to the Story content type, we get one database table:

Table: content_type_story
nid
vid
A_value
B_value
12
12
Hello
world

On the other hand, if fields A and B are both multiple-valued, we get two database tables:

Table: content_field_A
nid
vid
delta
A_value
12
12
0
Hello
Table: content_field_B
nid
vid
0
B_value
12
12
0
world

The advantage of per-content-type tables is that many fields can be loaded and saved with a single query, whereas with per-field tables every field requires a separate load or save query. The advantage of the CCK's hybrid storage model is that automatically it uses this more efficient form of storage when possible and falls back to per-field storage when necessary.

Some of the disadvantages of the hybrid storage are:

It requires complex, fragile, and difficult to maintain code.
Changing a field's cardinality or sharedness, a fairly frequent operation on live sites, requires changing the database schema. Particularly for large tables, it is difficult or impossible to make these changes safely without risking race conditions, lost data, PHP page request timeouts, and other problems. The CCK maintainers (Yves and Karen) who have the most experience working with this code say that it is a nightmare.

The advantage of per-field storage is simplicitly. All the tables have the same basic layout, so the code to generate and use them is simple. Changing a field's cardinality or sharedness is trivial. Fundamentally, it is possible to implement per-field storage correctly and reliably, whereas with the hybrid model it may not be.

The disadvantage of all per-field storage is increased performance costs. Field operations involving N fields always requires N queries, even if the fields are single-value and not shared. For object load operations (such as node_load()) this is not a big deal because loaded fields are cached per-object, though custom queries (either manually written or from Views) can end up requiring a lot more joins. For object save operations, the additional write queries are unavoidable, though particularly busy write-heavy sites can mitigate the problem with master/slave database servers.

The Field API's default storage module, field_sql_storage.module, uses all per-field storage. The code sprint team debated the issue at length and the final vote was not unanimous, but a decision had to be made and we opted for the approach that we knew would work and could be implemented efficiently.

The default field storage module is not the last word, however. Three alternatives exist:

A contrib module, Per-Bundle Storage (PBS), re-introduces the hybrid storage model by automatically maintaining redundant per-bundle (the equivalent of per-content-type) tables for all limited-value fields. PBS tables actually contain fixed-multiple-value fields (e.g. 3 values, but not unlimited values) as well as shared fields, and thus allow more field data to be loaded in a single query even than CCK used to. PBS completely eliminates all read-time performance penalties, though it does leave the write-time performance penalty in place.
Another optional module, Materialized Views, provides an even more powerful and flexible method of denormalized data storage that improves read performance not just for field data but for any of Drupal's tables. See Introducing Materialized Views for more details.
The Field API uses a pluggable storage engine. While the default storage engine uses all per-field storage, it is probably possible to write an optional field storage engine that re-introduces directly hybrid field storage.

Code Implementation

Code implemented includes:

Most of the Field CRUD API
Most of the Field Attach API
Most of the Field Types API
A great deal of supporting code such as the Field Info API which collates information about the fields, instances, widgets, formatters, and bundles in the system.
Doxygen documentation for most of the above.
Changes to node.module and user.module to use the Field Attach API, thus “field-enabling” node and users.
Unit tests for all of the above that pass, though they are by no means complete.
Updated the CCK UI contrib module to use the Field API to provide the same kind of field management functionality available in D6 CCK but now for nodes and well as users.

We’re proud of the decisions we made and code we wrote. However, we believe our most important accomplishment is getting eight major Drupal developers, including both CCK maintainers, fully engaged in the Fields in Core process and synchronized with each other. Only so much can be accomplished in one week, particularly with such a complex topic, but we did enough and built enough momentum to get the project done in time for Drupal 7.

Sample code

As a quick introduction, here is sample module install file that uses the new Field API to create a new text field and assign it to “article” nodes and all users:

<?php   function demofield_install() {     // Create a text field.     $field_name = 'demofield';     $field = array('field_name' => $field_name, 'type' => 'text');     field_create_field($field);       // Create a field instance structure.     $instance = array(       'field_name' => $field_name,       'label' => 'Demofield Demo',       'description' => 'This is a demonstration text field.'     );     $instance['widget'] = array(       'type' => 'text_textfield',     );     $instance['display'] = array(       'full' => array(         'label' => 'above',         'type' => 'text_default',       )     );     // Attach the field to all users (because user.module does not     // define multiple "content types").     $instance['bundle'] = 'user';     field_create_instance($instance);       // Attach the field to "articles" (a name defined by node.module).     $instance['bundle'] = 'article';     field_create_instance($instance);   } ?>
Screen shots

This screen shot shows the page to manage and add new fields to users, with the custom field "Favorite Saying" already added:

Once a field is defined, it appears on the user edit form and profile display page:

Of course, there is still plenty of user interface and theme work to be done here.

Current status and How to Help

The initial Field API patch has been submitted to d.o. You can find this and all other Field API patches at http://drupal.org/project/issues?projects=3060&components=field%20system.

To help with Field API, please visit http://drupal.org/node/361849.

Code Sprint Best (?) Practices

Many of the participants in this code sprint also attended the DADS sprint and we benefited greatly by following all of the Lessons Learned listed in the DADS sprint final report. However, that was a design sprint. This was a code sprint and we developed some additional practices that we recommend for future core code sprints:

We had 6-7 developers working on a large core patch simultaneously. We needed to be able to share our incremental changes constantly and there is no way we could have gotten patches committed to Drupal HEAD in a timely fashion (especially early on when the changes had syntax errors!). We set up our own separate HEAD repository so that we could all make frequent commits. We used Bazaar but other version-control systems would have worked as well. However, CVS really sucks for everything.
We spent the time to break up our to-do list into a number of small tasks (we used the Scrum model of User Stories and Tasks). Using the d.o issue queue for all these small tasks would have had too much overhead, so we used a Google Spreadsheet to track them all. This made it very easy for each developer to grab individual tasks to work on without worrying about duplicating someone else's work.
When the sprint was over and the initial rapid pace of change declined, we began posting patches for review against HEAD to the d.o issue queue for community review. However, during the review process before the initial patch was committed, we kept using the Bazaar repository for development because it is just impractical to have multiple developers working on a large patch file simultaneously.
At the same time we posted the initial patch for review, we migrated our to-do list from the Google Spreadsheet to the d.o issue queue. This made it much easier for the rest of Drupal community to contribute because it is the issue tracking system they are familiar with.
Finally, we realized that for a large and complicated core patch, it is very difficult for many developers to take the time to understand and review it. However, it is much easier to read and review API documentation, and by reviewing the documentation they would implicitly be reviewing the design and code. Therefore, we wrote Doxygen API documentation, posted it on a publicly accessible site (using the API Module), and created a separate d.o issue for reviewing the documentation.

Sprint Contributors

Many individuals and businesses came together to make the fields in core sprint a success. Most notable were Yves, Karen, and Moshe, who donated a week of unpaid personal time. Achieve Internet, Acquia, Four Kitchens, and NowPublic all donated a week of an employee’s time. Acquia allowed us to camp out in their offices for a week (secretly, we know they loved having everyone there!). We also received over $5,600 in donations from 146 sponsors throughout the community. We and everyone in the Drupal community deeply appreciate all this support to benefit the future of Drupal. We will provide a financial report of how the money was spent once all the details and receipts are in.

This code sprint generated an enormous return for a fairly small investment. Hiring these eight developers and bringing them together in one place to work for a week would easily cost over $50,000 USD. Instead, we spent barely one-tenth as much and accomplished a goal the Drupal community has sought for several years. There are several other key Drupal sprints coming up, and our experience with this one shows important and effective sponsoring them can be.

Drupal version: Drupal 7.x

Original Article:

Fields in Core Code Sprint Final Report