Handling HTML With Drupal's Migrate API

Parsing HTML versus processing with regular expressions

Suppose you need to extract the URL from a bit of HTML markup like this:

Using regular expressions

If you are familiar with regular expressions, then it is pretty easy to come
up with something like
//
and use it with built-in PHP functions like preg_match().

Unfortunately, it is more complicated than that. For example:

The HTML tags are case-insensitive: you have to match a or A.
There might be other attributes, such as class, id, or name, before or
after the href attribute.
The URL (value of the href attribute) might be enclosed in single quotes
instead of double quotes.
There might be newlines within the HTML element.
Are you sure that an escaped quote (like \") is not allowed in a URL?

Before you start researching that last question, the point is that you should
not spend your time reinventing the wheel.

There is an
amusing answer on StackOverflow
describing the dangers of trying to process HTML with regular expressions, and
this practice has come to be known as
Parsing Html The Cthulhu Way.
The StackOverflow answer ends with the suggestion,

Have you tried using an XML parser instead?

Using the DOMDocument class

In PHP, we can use the DOMDocument and related classes to parse HTML markup.
These classes use an HTML parser in the background rather than regular
expressions.
There are some steps to set things up:

$document = new \DOMDocument();
$document->loadHTML($html_string);
$xpath = new \DOMXPath($document);

After this bit of boilerplate code, we can search the $xpath object with any
XPath query and extract whatever attributes we need.
For example, to find the href attribute of each element in the source,

foreach($xpath->query('//a') as $html_node) {
$href = $html_node->getAttribute('href');
// Your processing goes here.
}

Using XPath queries gives us a lot of flexibility: we can find elements
having a specific class, or we can select those that are nested inside some
other HTML element.
We did not even think about these possibilities when discussing regular
expressions above.

When you are finished processing your DOMDocument element, you can convert it
back to a string:

$processed_html = $document->saveHTML();

Migrate API and the ETL paradigm

In Drupal 8, the Migrate API follows the standard Extract, Transform, Load
(ETL) structure, and we also keep the terminology from the contributed Migrate
module in earlier versions of Drupal:

Extract (source plugin): read data from the source
Transform (process plugins): change data to match the site’s structure
Load (destination plugin): save the data

Each migration has a single source plugin and a single destination plugin, but
each field uses at least one process plugin and may use several.
I think this is the fun part: creating new, easy-to-configure process plugins
is the best way to add reusable code to the framework.

The Transform/process phase is also the right place to handle HTML processing.

New process plugins for managing HTML

So far, Marco and I have contributed four process plugins to the Migrate Plus
module.
The goal of these plugins is to make it easy to process text fields with
proper HTML parsing.
The plugins create the required DOMDocument and related objects, so the person
writing the migration only has to supply the XPath expression and other configuration.

The dom plugin

This plugin handles creating the DOMDocument object from a string, and then
converting back to a string at the end.
The other plugins go between these two steps, so they take a DOMDocument
object as input, do some processing on it, and return the same object.
This is what it looks like in practice:

process:
'body/value':
-
plugin: dom
method: import
source: 'body/0/value'
# Other plugins do their work here.
-
plugin: dom
method: export

The dom_str_replace plugin

Suppose, as part of your site upgrade, you decide to change the subdomain.
For example, you might decide to change documentation.example.com to
help.example.com.
If you have any links in your text fields, then you need to update them.
You can do this with the dom_str_replace plugin:

-
plugin: dom_str_replace
mode: attribute
xpath: '//a'
# may change to
# xpath: '//a'
# in the next release
attribute_options:
name: href
search: 'documentation.example.com'
replace: 'help.example.com'

Warning:
The xpath key was called expression in version 8.x-4.2 of the Migrate
Plus module. Use xpath starting with the recently released version 8.x-5.0-rc1.

Like the str_replace plugin that is already part of the Migrate Plus module,
this plugin supports either basic string replacement, using the PHP
str_replace() or str_ireplace() function, or regular expressions, using
preg_replace().

The dom_apply_styles plugin

If you are using the Migrate API to import data from an external source, then
you want the imported data to have formatting consistent with the rest of your
site.
Perhaps you have configured Drupal’s Editor module to add certain CSS classes
from the Styles menu of the WYSIWYG editor, but you cannot add those classes
to the external source.

This plugin lets you search for an XPath expression and replace the
corresponding HTML elements with whatever is configured in the Editor module.
For example,

-
plugin: dom_apply_styles
format: full_html
rules:
-
xpath: '//b'
style: Bold

This will replace ... with whatever style is labeled “Bold” in the
Full HTML text format, perhaps ....

The dom_migration_lookup plugin

If you are migrating from a Drupal 7 site, then perhaps node/123 on the old
site becomes node/456 on the new site.
If you have entity-reference fields, then you can update references like these
using the migration_lookup plugin from the core Migrate module.

If those references are in links in a text field, then you can now use the
dom_migration_lookup plugin:

-
plugin: dom_migration_lookup
mode: attribute
xpath: '//a'
attribute_options:
name: href
search: '@/node/(\d+)@'
replace: '/node/[mapped-id]'
migrations:
- article
- page

If either the article or page migration has mapped 123 to 456, then
this will replace /node/123 in any href attributes with /node/456.

Like the core migration_lookup plugin, this one violates the strict ETL
paradigm, since a process plugin (i.e., code in the Transform stage) has to
“peek” at the destination database.
Ditto for the dom_apply_styles plugin, which reads configuration from the
destination database.

References

Migrate API documentation
on drupal.org
Migrate Plus module home page
Release notes for migrate_plus 8.x-5.0-rc1
amusing answer on StackOverflow
Parsing Html The Cthulhu Way
Change record describing the new
DOMDocument-based plugins
XPath documentation on MDN

Original Article:

Handling HTML With Drupal's Migrate API