What Big Data Means To Us
By: Brian Boyko – April 4, 2013
You ever get the feeling that people are using the same words and talking about completely different things?
For example, when I talk to my dad about Cloud Computing, he often asks “But what happens if it rains?"
Defining Big Data
One of the more confusing buzzwords, according to this report by Omar Gallaga at the Austin American Statesman, is “Big Data.” With good reason: How big does the data have to be to be Big Data? Can you use Big Data tools on smaller data that’s still pretty big? How does Big Data relate to data mining? Is Big Data all video and stuff? Is “’Big’ Data” data that’s influential enough to lobby Congress?
The problem with “Big Data” is not that the term itself is overused, but that it is very, very broad. As a general rule, Big Data consists of datasets that are just too big to be analyzed, monitored, manipulated, or accessed by standard tools. Big Data, therefore, requires specialized tools and methods to make the most of, tools that are only now being developed. So, really, it should be called “Too Big Data.” But too big for what? Well – it all depends on context. When we put on our “Drupal Developer” hats, we tend to think of Big Data as data beyond what Drupal is designed to handle effectively on its own. And while Drupal is a very extensible framework, solving “Big Data” problems takes quite a bit of know-how.
Standard solutions and caching
There’s the brute-force method of throwing more memory, bandwidth, and processing power to the database server itself, but we typically don’t find this method to be cost-effective or future friendly. A step up from brute-force is caching. And by caching, we mean caching everything – from using Varnish to cache individual pages before serving them up to the visitor, to all the various caching solutions out there. Good for some things, but not really an elegant solution.
Drupal’s Big Data problems: bootstrapping and relational databases
In order to solve the problem of Big Data on Drupal, we have to look at the specific issues that bottleneck data on Drupal. First is Bootstrapping. As Drupal initializes every pageload – it asks every module if it has something to contribute to each part of the pageload process. In many if not most cases, this is a redundancy of effort. The more modules it has to ask, the slower the bootstrap. This is not bad design – it’s one of the reasons Drupal is so extensible. But it does compound the problem. Underneath the hood, Drupal is storing everything in a normalized, relational database (usually MySQL). There’s no duplication of data – everything is stored in different tables, and tables refer to each other in order to access the correct data. As soon as you have to query data, you’re pulling from multiple tables using the SQL keyword “JOIN.” While this ability to “join” makes makes MySQL powerful, JOIN commands are resource expensive, can be extremely slow, reduce the maintainability of code, and increase complexity.
NoSQL and “front end-gineering” to the rescue
We use a combination of technologies, including (but not limited to) MongoDB, Node, and Ember to get past some of the specific problems with Big Data by bypassing bootstrapping whenever possible, and by using more efficient data access methods. MongoDB is a NoSQL database; and NoSQL is designed for better horizontal scaling and higher availability, as well as optimized for retrieval and appending operations. The data is de-normalized – meaning, you’ll have duplicate data. So, if you have the available disk space, the performance tradeoff makes it worth it.
Sometimes your data is relational and it makes more sense to store it in a relational database - MySQL is still there in the *AMP stack if we need it. On the front end, Node is a JavaScript platform running on the server, and Ember is a JavaScript framework running in the browser. Whenever possible, we have Ember access Node, rather than Drupal. This bypasses most of the Drupal bootstrap processes, with the exception of the initial pageload – future requests bypass the bootstap.
Why implement Drupal if we’re just going to bypass it?
You may ask that if the goal in all this is to bypass Drupal, well… why use Drupal at all? By all means, if you don’t need Drupal – don’t use it! But, there are some things, like content management and user permissions, which Drupal does exceptionally well. Getting back to our main point, Big Data for Drupal really just means increasing Drupal’s capacity to scale. As specialized solutions become less necessary, we’ll have to redefine how big Big Data is. Or we could just say any dataset that requires a Mack truck full of hard drives to move is Big Data.
You know what, it’s simple, it’s direct… I like that one better. Let’s go with that.
Find out how we use assistive technologies to help Drupal manage large data sets smoothly...
Category: DrupalDrupal Planet