Foreign Affairs Launches on Drupal 6

Foreign Affairs is the journal of the Council on Foreign Relations, a non-partisan member organization dedicated to improving the understanding of U.S. foreign policy and international affairs through the free exchange of ideas. The Council was founded in 1921 by academics and diplomats who had advised President Woodrow Wilson on how to deal with the aftermath of World War I. Its 3,400 members include Presidents, Cabinet members, and other high-ranking U.S. government officials, as well as renowned academics, journalists, and major leaders of business, media, human rights, and other non-governmental groups.

Since its inception in 1922, articles and essays published in Foreign Affairs have helped shape political debate and policy on some of the most important issues of the day. Authors who have written for the journal have included influential intellectuals and political leaders ranging from W.E.B. DuBois to Henry Kissinger to Hillary Clinton.

The new Foreign Affairs Web site, developed by Palantir.net, showcases Drupal’s wide variety of capabilities. In addition to leveraging existing modules, the site features a significant amount of new functionality developed specifically for this project. The magazine's management strongly supports the open source development model, and much of the work done for the site has been or will soon be contributed back to the Drupal community.

Theming

The site's theme, built by Palantir from designs by Concentric Studio, is a Zen sub-theme built using sustainable theming methods. The new site design makes extensive use of custom typography, which is achieved in a standards-compliant, accessible manner using the sIFR 3 Flash replacement method.

With a user interface as sophisticated as Foreign Affairs, many elements required JavaScript-based enhancements. From styling rounded corners to improving form usability, Palantir engineered a richer user experience using JavaScript technologies. jQuery, Drupal behaviors, and a few select jQuery plugins (labelify, linkselect, and corners, to be precise) provided the JavaScript toolbox used to build the site.

Taxonomy, Dates, and Filtering

Making over 60 years worth of content easy to navigate was one of the project's primary goals. Organizing along multiple taxonomies (vocabularies), the site provides several categorizations through which readers can drill down from general categories to specific issues. Content can now be navigated by date, region, topic, and author (among others). In addition, any given article, essay, or book review is cataloged across these dimensions. Multi-dimensional search and filtering were key requirements from day one, as was the ability to view all articles and books written by a specific author.

Each individual article is tagged with one or more authors using CCK node reference fields, one or more topics using taxonomy terms, and a publication date, specified using a CCK date field instead of the standard Drupal "published on" time stamp. Articles are also assigned to specific issues using a CCK node reference.

As those in the Drupal community know, Views is made for requirements like these. Since the data is stored as CCK fields, making Views filters to create complex pages was fairly simple.

For example, an author page uses Panels to rewrite the display. Instead of just showing information about the author, a View is loaded that searches for all of the author's articles, books, or reviews of those books. Any individual article that lists that author links back to this page, allowing readers to gain an historical context not just for that author's works, but also for the different ways those works have been covered over the years, an invaluable research tool.

For more complex pages, like Topics and Regions, the challenge was to provide faceted article filtering. Again, Views makes this possible through the use of exposed filters, allowing readers to research topics like the 1968 Arab-Israeli War by selecting a few options.

As the site's editorial staff continues to populate the site with content from the journal's archives, material from some years is missing from the site's database. While the Date filter allows data ranges (like 1947-present), it does not support calculated skip-dates. However, by using a dynamic filter on top of the normal Date filter, visitors to the site are presented only those year options that exist in the currently available online collection. Visitors can drill down through the journal's archives using dynamic menus, such as those seen when you click the arrow next to the year "2009" on the In the Magazine page.

Similarly, readers can also drill down to view subtopics and subregions using dynamic menu options. For example, if a reader is looking at Regions > Middle East, only subregions of the Middle East are shown in the selection options.

Data Import

Foreign Affairs wanted to not only provide their entire back catalog through the new Web site, but also have a way to quickly and easily publish all of the articles for each new issue at the same time.

To import, debug, and edit over 14,000 XML documents that comprised over 60 years worth of existing content that lived in Foreign Affairs' in-house article management system, Palantir built a custom import system that not only processed all of this existing data, but also made editorial content review easier by flagging possibly incorrect data, providing detailed feedback on how content was processed, and retaining the original content for re-processing when necessary. To accommodate large-batch processing, an advanced Drupal Web interface was crafted to load tens of thousands of pieces of content at once.

Traditional import models in Drupal could not support the data model used, since the XML definition for an article also contains information regarding its issue, its authors, and any books that are mentioned within the article. Under the Foreign Affairs data model, issues, articles, authors, and books are all separate content types, linked together using the CCK node reference system.

The parser needed to be able to look at an XML file and make the following determinations:

What type of article is this?
What issue does this article belong to?
- Does this issue node exist or should we create it?
Who are the authors of this article? (There can be multiple authors.)
- Do author nodes exist or should we create them?
What are the books referenced in this article (indicated by ISBN)?
- Do book nodes exist or should we create them?

When reading an XML article file, it was necessary to keep track of the nodes that were referenced by the article, and create new ones as needed. In the end, the 14,000 XML documents generated over 35,000 nodes.

Additionally, this import application needed to give feedback and reports that editors could understand and use as a guide to make corrections to the imported data. Drupal 6 and the Batch API to the rescue!

Each XML import was set into specific directories on the Web server (outside of the htdocs root for security), and then loaded through a manual form. The import would take entire years or decades of content, register the files, and queue a Batch API request.

The Batch API also allows real time feedback, so an editor can see how many files are being processed, the status of each file, and any success or error messages. When parsing each file, the document was first run through simpleXML. If the document failed to parse, or if any required XML data fields (like the issue number) were missing, the file was marked as invalid and moved to a marked directory for review. If any XML data was missing or malformed (such as non-matching taxonomy terms), the XML file was flagged for review and similarly moved. Views was then used to generate a report of each import, so that an editor could glance at the data and see if any files failed or were flagged for review and which new content pages were created.

Being able to review badly formed XML documents made the import process painless. It also allowed editors to review the cause of an error (for instance, "Title is in all CAPS" or "No matching taxonomy terms were found"). These error prompts let the editors prioritize their manual corrections, or even allowed us to fix the XML document and re-run the import.

In addition to being able to populate the new site with decades worth of back issues prior to launch, this import system also makes it possible to load a new issue of the magazine onto the site in a matter of minutes.

Content Access

Through real time access over secure Web services, Palantir integrated Drupal and two remote user databases, allowing more than 150,000 existing subscribers to seamlessly access the new Foreign Affairs site.

A set of strict yet flexible rules provide access to Foreign Affairs' rich historical archive of content. In general terms, the site's free content breaks down into the following categories:

A selection of articles from the current issue, which are viewable by anyone and may be viewed paginated or in single-page view.
Articles from past issues that have been marked as free to all site visitors.

Any article that does not match the above criteria is only available as a 500-word excerpt, followed by a prompt to subscribe to the magazine. This rule is enforced on all users with a user 'class' flag that exists parallel to the traditional Drupal account role system to conform to the magazine's business rules:

If a user is an authenticated member of the Council on Foreign Relations, then all content is visible.
If a user is a current subscriber as identified by the EclipseNet SOAP service, then all content is visible.
If a user visits the site from the computer network of a licensed institution, such as a college or university, then all content is visible.
If the user is a registered search engine crawler (such as the Google Search Appliance used on the site), then all content is visible and pagination is disabled.

Editors have the option of allowing the system to determine the most recent issue, based on publication date, or using a simple form to select the current issue from a list of all Issue nodes. This flexibility allows editors to preview new content without making it available for free to all users.

EclipseNet and Remote Authentication

For authenticated users, two separate sources of user information had to be merged. Some users, upon log in, would be authenticated against one source, while other users would be authenticated against a different source (A final, smaller class of users authenticated directly against Drupal.). This tripartite module system was built using Drupal's existing authmap structure.

Staff and affiliates of the Council for Foreign Relations use a network-wide user management system to provide authentication services for all Council-managed Web sites. This system has an exposed SOAP Web service. An authentication gateway was built that seamlessly integrated the existing SOAP-based authentication system with the new Foreign Affairs Web site. Existing Council users continue to use their existing authentication credentials to log in, and their user information continues to be maintained by the back-office user management system. This covered the needs of Council users, but the requirements for the Foreign Affairs Web site went beyond this. Most users are not staff members for the Council. On the contrary, the bulk of the Foreign Affairs user base comes from outside of the Council. Information for these users exists in a different data repository.

The new Web site employs a multi-tiered access system. Anonymous users can see some content. Members can see more. Print subscribers have even more tools and information at their disposal. And the information about these users is maintained in Via Subscription's EclipseNet subscription management service. Via Subscription's service is accessible through an advanced SOAP service, which provides over 60 functions, each requiring a complex set of parameters. Not only authentication, but also e-commerce and subscription management features are handled through this Web service. Palantir worked closely with the Council and Via Subscription to build a robust, performance and feature-rich bridge between Drupal and this Web service.

With Via Subscription integration, visitors can sign up for membership, subscribe to the print edition, manage subscription renewals, purchase PDF reprints, and manage their personal information. And all of these features are integrated into the user account management tools included with Drupal.

Anytime a site begins integrating with remote services, one issue immediately surfaces: performance. While Web site users readily accepted thirty second download times in the 1990s, today's users demand fast load times.

In performance terms, SOAP transactions can be costly, especially when executing complex requests that operate on large datasets. The costliest SOAP transactions, those that required complex calculations and transmission of large amounts of data, may take more than twenty seconds. Streamlining this process was essential. To optimize performance, Palantir built a sophisticated caching layer that pre-calculates and stores frequently requested (but rarely modified) information. Large amounts of data can be retrieved and recalculated during off-peak times. Visitors benefit from snappy response times, and the Council for Foreign Relations benefits from lower overall network traffic as duplicate SOAP calls are eliminated.

As with most of the other major features developed for the site, the remote services integration was a project in its own right. This endeavor produced four modules and well over 5,000 lines of code. While most of this code is too project-specific to be of value outside of Foreign Affairs, it allowed for the development of a robust strategy for utilizing Web services that has allowed Palantir to tremendously cut the amount of time it takes to implement SOAP-based modules.

IP-based Content Licenses

An IP address-based access system was built to allow users from institutions with existing site licenses to access premium content without having to log into the site. While these users can view all content on the site, they do not have all of the same privileges as paid subscribers, such as the ability to post comments. In Drupal terms, these are anonymous users who have to be treated as a special case.

The solution was to build a parallel role system, maintained within the $user account information for a given page request. Users marked as 'site license' holders would pass the permission check to view an entire article, regardless of their log in status.

But how to track IP-based access rules? Even more, some of the licensing institutions had dedicated IP ranges from an in-house host, but others used proxy hosts or 3rd-party servers that mean IP ranges could change. This means that the following conditions had to be accounted for:

Does the user IP address completely match a known IP string (e.g. 127.0.0.1) assigned to a licensing institution?
Does the user IP address match a known hostname (e.g. example.com) when run through gethostbyaddr()?
Does the user IP address match a regular expression indicating a known IP string (e.g. 127.0.\d, where \d is any digit)?

Note: We only deal with IPv4 here, and there is an additional security check not detailed here.

The third rule exists for ease of managing the IP lookups. Many Web hosts run a collection of related IPs for their users. Many university networks, for example, assign users' IPs in a range such as 127.0.0.1 => 127.0.0.17. This can be expressed as a regular expression in the format 127.0.0.1[1-7], which means that it's not necessary to create seven separate records (one for each unique IP in the range).

The first step was to write the IP validator sequence. It turns out that the normal PHP methods for validation require a full octet string for both sides of the calculation (see ip2long() and related functions for details). Since it's necessary to check against regex patterns, it's not possible to use PHP's built-in logic. Instead, each IP address has to be broken down into a sequence of four octets. The first two octets are always required and must be numeric, as they form the basis for any pattern matching. The second two octects are optional, and may be numeric or contain regex logic, recognizing numbers, brackets [] and the decimal operator \d.

Inbound IPs are gathered by the Drupal ip_address() function and then checked against the registered octet patterns. From here, the license data is loaded from the main table to get the hostname, licensee name and other data that needed for use with the site's reporting and tracking systems, notably Omniture.

This data could not be stored in sessions or cookies, because it needed to work for users with browsers set to disallow cookies.

Because it is necessary to query DNS servers, the cost of performing this check on every page load can be quite high. To alleviate this, the system leverages the Drupal cache system by creating a {cache_license} table. Every time an IP is checked, the {cache_license} table is first queried. If no record is found, we run our calculations on the IP address and then cache the result for future visits. This cache is persistent, and is only cleared when the base IP license records are updated or deleted.

Finally, Palantir built a user interface to allow editors to create, update, and delete license records at will. This framework is planned to be released as the IP API, a system for creating access control lists based on IP-matching patterns.

Tracking: Google Search, Zedo, Omniture and Popularity

Integrating the multiple analytics, advertising, and search systems used on the Foreign Affairs site called for developing a unified framework to handle data requests. The information needed by each of the four systems used is remarkably similar and required checking and storing data for every page load, including cached pages.

The search, analytics, and advertising systems used on Foreign Affairs are:

Google Search, an instance of the commercial Google Search Appliance specifically licensed to the Council on Foreign Relations and running on a dedicated server. For usage on the Foreign Affairs site, Google Search cares about the content type, taxonomy, author, and date information for an article.
Zedo, a third-party advertising server that delivers ads via iframe based on HTML inserted in page output. Zedo cares about the content type, taxonomy, user status, and ad size.
Omniture, a hosted analytics suite used widely in the media industry, which uses JavaScript tags to record Web traffic. Omniture tags care about the content type, user status, taxonomy, and some specific e-commerce variables for tracking ad campaigns.
Popularity, a version of internal statistical tracking used to chart the most popular articles on the site. Since the requirements called for tracking popularity across content types and taxonomy terms, the core Statistics module would not work here.

The four systems share several common items: content type, taxonomy, and user status being the most common. Since these data points needed to be captured for each page load -- including cached pages -- streamlining the process as much as possible was a key goal.

The solution lies in Drupal's hook_boot(). This hook fires on every page load, and was already being used to track the user information for site licenses and remote authentication. This hook was used much like the IP handler: loading cached data about a page if it existed, or creating that data if it did not. The process worked like so:

Inbound page requests were mapped to their menu handlers (node, panels, views) as needed.
Data about the current page request was parsed from the URL (in the case of nodes, by loading specific data based on the node ID).
The data was then set in a static variable for use by other modules, which could add additional data points if necessary.
The data was then cached permanently, if needed.

This approach made it possible to parse data once and use it many times and for many purposes, rather than requiring each of the four modules to have its own parsing system.

Integrating the approach also easily solved a late feature request: The Google Search Appliance needed full access to all articles, and those articles could not be shown in paginated form. By integrating with the Site License module, it became possible to assign the Google Search Appliance a known IP and set a special flag indicating whether that licensee should be shown paginated or unpaginated versions of the article (The question of paginated or unpaginated also had to be passed to Omniture, making this integration a double win.).

Deploy

Finally, following best practices guidelines for editorial environments, Palantir built a content deployment system for Foreign Affairs to allow content to progress from their development server to their staging server, and from their staging server to the production server in a secure, reliable, and robust manner. The Deploy system allows the three Drupal instances to "talk to each other" and share content without relying upon an external system. With this strategy, advanced features, like selective bidirectional syncing of user-submitted content, are handled effectively and securely, a feat external synchronization tools are incapable of achieving.

For the user, deployment is simple. Nodes and other Drupal content are added to a "Deployment Plan" which is just a way to group content together to be pushed at the same time. After all the content has been added to a plan, the user can push it to a remote server. This can be new content or updates to existing content. More information on how this system was created can be found here.

While deployment is simple for the user, under the hood it is reasonably complicated and has a lot of moving parts. There are three major components that comprise the framework that deployment operates in.

UUID - This module maps primary keys for "things" to UUIDS which are guaranteed to be unique. This is how the deployment system can keep things straight between servers. When Deploy is installed, a set of tables are created which map keys to UUIDs. For instance, node_uuid contains the fields uuid and nid. If Deploy is being installed on a site with existing content, UUIDs are generated for all the existing records. UUIDs are generated using PHP's uniq() function.

Deploy - This includes not only deploy.module which manages the setup of plans and servers, but also individual deployment modules which manage Drupal "things" and generally map to their Drupal module equivalents (for instance, Drupal uses taxonomy.module to manage both taxonomy terms and vocabularies, so deploy has a taxonomy_deploy.module which does the same).

Services - The Services module is used to receive the "things" sent by deploy.module. Services defines its own service modules in the same way that Deploy does (there is a taxonomy_service.module in services).

Let us take the example of deploying a single node. Here is a stripped down node object:

stdClass Object (   [nid] => 8556   [type] => book   [uid] => 3   [title] => Kittens Are Awesome   [body] => They are so cute and furry and OMG!   [taxonomy] => Array (     [112] => stdClass Object (       [tid] => 112       [vid] => 2       [name] => Fuzzy     )     [285] => stdClass Object (       [tid] => 285       [vid] => 2       [name] => Petting     )   )   [uuid] => 17916866454935b60a142499.7768560 )

In order for this node to be properly deployed not only do fields like the body and title need to get to their destination, but also two taxonomy terms and a user. The taxonomy terms have their own dependencies in the form of parent terms, related terms (which have their own dependencies), as well as the vocabulary it belongs to. In order to deploy this node the following process has to take place.

Note that this node has a UUID. When a new node is created, uuid.module acts on hook_nodeapi('insert') by adding an entry to the table node_uuid. In cases where no 'load' operation can be hooked into (like taxonomy) this has to be managed by hand.
For each item in the plan, a dependency checking hook is called. In this example this is taxonomy_deploy.module and user_deploy.module (for a real node deployment there may be more moving parts like nodereferences to other nodes, the content type of the node, etc.).
Each of these modules checks to see whether or not its "thing" already exists on the destination server. This is managed by passing the UUID to a Web service on the destination server. The destination looks it up in the appropriate mapping table and returns the remote key or false if not present.
If the "thing" does not exist on the remote server, then it is added to the deployment plan. In some cases this may cause the "thing" in question to in turn fire off its own dependency checking hook. The taxonomy term Fuzzy above has a parent term which needs to be checked, and that term may have a parent as well, and they all belong to a vocabulary. This all happens recursively until the dependencies have been accounted for. Dependencies are weighted, as much as possible, such that nothing gets pushed until everything it needs has already been pushed. So for instance, taxonomy vocabularies are pushed before taxonomy terms, which are pushed before nodes.
Finally, each item in the plan is pushed to the remote server. Items are pushed in order of their weight according to standard Drupal practice.

The process for deploying a "thing" generally looks like this:

Load "thing".
Does "thing" exist on remote server? If so, get the remote ID and replace the one in the "thing's" object with the remote ID. If not, unset the object's ID (which Drupal generally interprets as "this thing is new").
Call a hook which allows all of the dependencies to update their IDs as well.
What you now have is a "thing" with all of its IDs representing their remote equivalents. This "thing" can now simply be passed to a service on the test server and saved through the appropriate mechanism (node_save(), user_save(), etc.).
Since the dependency weighting is managed so carefully, you can (hopefully) always be assured that your dependencies are on the other side before actually pushing your "thing". This is the great benefit that comes with having a dependency checking step separate from an actual deployment check.
If any item in the plan fails, then all deployment is halted and the error is reported. Again, this prevents any dependent items from being pushed after something has failed.
User gets a pretty table which reports all their goodness (or bummers as the case may be).

Note that this system has no concept of "dev" or "live" servers, it just pushes things from one place to another. This allows Foreign Affairs to have a workflow of dev->staging<->live. Code and content are pushed live from the staging server, while at the same time user-generated content like comments are pushed to staging from live. Deploy has a well defined API for managing and pushing plans, so writing a module to manage something like deployment of comments on creation is easy.

The Deploy system was presented at DrupalCon DC, and is now available on Drupal.org.

Module list

Along with several custom modules that provided site-specific functionality, the following contributed modules were used on the Foreign Affairs site:

The new Foreign Affairs Web site is not just a great example of Drupal's capabilities, it's a cutting-edge resource that allows visitors from all walks of life to access, share, and engage in discussion on some of the most important issues faced by our world today. The new site allows the magazine to reach out to its readership in new ways while still upholding its role as the leading forum for serious discussion of American foreign policy and international affairs.

Case study written by Evan Clossin, George DeMet, Matt Butcher, Ken Rickard, Greg Dunlap, and Larry Garfield

Drupal version: Drupal 6.x

Original Article:

Foreign Affairs Launches on Drupal 6