Sunday, 16 November 2014

MongoDB

Mongolicious


MongoDB training, provided by MongoDB themselves, was up for grabs recently, so I put my name in the hat to get a keener insight into MongoDB and NoSQL databases, having used them only a few occasions. The training was attended by a near 50/50 split of devs and DBAs and this led to some interesting debates and reactions.  

Like lots of the stuff I've been introduced to/have started looking at recently (I am late to nearly every technology party there is) they (Mongo and NoSQL) have been around for a while and have become established as a viable alternative persistence solution to the likes of MSSQL. 

The main features:
  • Its schema less
  • Document oriented/No support for JOINs 
  • Querying performed using JS/JSON
  • Indexing
  • Replication/Redundancy/High availability
  • Scaling out via Sharding 
  • Aggregation (unfortunately we ran out of time so this is not covered here)
  • Authorisation (again ran out of time on this one) 

Its schema less


You'll know doubt be aware that relational databases use schema to determine the shape and type of data that is going to be stored in them.  MongoDB doesnt, quite simply.  You take an object from some application (most likely) and it is serialised (typically) from your application into  JSON and then into BSON and then stored on disk.  

MongoDB does not care for the type of data used or its shape and when a document is added, a field with a particular datatype,  used for one commit, may be of a different type on another (I'm not sure how this would impact indexing)  But the look of sheer horror and gasps from the DBAs was priceless (more on this in a minute). 
In practice data type changes like this would probably be as rare as they are in an application as they are in the database, how often (apart from during development) do you go and change the fundamental data-type (that is,to say not its precision or  max size) in production code/databases? 

Document oriented


As there is no schema,  things we think of as objects in applications (e.g. a Customer object)  can be serialised into JSON and then as a MongoDB document added to a collection of customers.  A collection is roughly analogous to a table and therefore a collection can have a number of documents like a table can have a number of rows. 

Documents can be entire object graphs with other embedded documents (e.g. a customer with a collection of addresses).  Access to documents can be expressed through dot notation when querying and projecting.  

With this arrangement the notion of relationships expressed  through queries over many sets of tables (like in SQL) dissapears, relationships are instead expressed by virtue of parent child relationships you see in a typical object graph.  Joins are killed off at the DB layer and would generally then be performed in the application (if at all).  However,  there is the ability to define foreign key relationships with other documents in the database.  There is a cost to this though as referential integrity will come at some cost to query performance.  

With document orientation though I think it seems like a good opportunity to embrace and leverage small transactional boundaries perhaps closely aligned to an aggregate root/entities like in DDD.  Each document commit is atomic and so getting the aggregate root root right, I'm sure, would certainly improve concurrency and reduce  the size of working sets.  

Querying performed using JS/JSON


One of the particularly great features of MongoDB is the simplicity of its querying. 
For example finding all of the documents with a customer whose name is bob on a reports database (db context is reports below) within the customers collection looks like:



The find method is simple JavaScript the first argument is JSON - straightforward. The predicate is expressed by a key and value pair, with the key being the field in a document with values matching the specified value.  Just  like in SQL, complex relational expressions can be built up using in built functions such as $lt, $gt, $or and $in.


Taking it a little further we can determine which part of the  document we would like to view


The second argument to the find method {customer_name : 1} is a filter to say which fields in the document to return in any results found, analogous to the select statement in SQL.   Here customer_name will be returned. We could inverse the statement passing 0 to customer_name and all of the fields EXCEPT customer_name would be returned. 

As queries get more complicated there can be a tendency to unbalance the brackets in the expression, so be mindful of this.  A good tip was to add in all of the sub-expressions first and expand them out.  Alternately, use a text editor which has highlighting for brackets/curly braces etc.

Indexing

Indexing is supported via single or compound indexes on document fields.  They can be configured differently on primary and secondary nodes dependent on needs, with a bit of jiggery-pokery (taking nodes offline and configuring them differently).  
I can't speak much for the absolute speed and performance of the indexes, but relative to not having them, they are much quicker. 
Under the hood they are implemented using B-Trees, whether this is optimal would be for some performance analysis and comparisons to answer. 

Replication/Redundancy/High availability


Apart from when development sandboxing and running in standalone mode,  the recommendation is to run MongoDB databases in a replica set which must contain at least 3 nodes to be effective and always an odd number for larger replica sets (explained in a moment). A replica set consists of a minimum of 3 nodes in total.   A primary and 2 secondaries.  

All writes and reads are, by default performed, at the primary and then the instructions and data are replicated to the secondaries.  

Primary node failure and subsequent failover is handled by an election process to determine which secondary should become the new primary.  The odd number of nodes in a replica set is crucial here as this ensures that a majority can be obtained by participating nodes and a new primary elected.  If an even number of nodes was configured in a replica set, there could be a  shared vote for leader and all of the nodes would go into a read only state, as there would be no primary.  I'm not entirely sure if MongoDB enforces the odd number rule we didn't try it out in the examples.

In the 3 node replica set arrangement, if  the primary node where to fail, an election would take place between the secondaries and which ever node obtained the majority vote first would be come primary. Obviously, nodes can not vote for themselves.  In this instance there is 50/50 chance of either one of the secondaries becoming the new primary, as long as there are no priorities configured. 

Using secondaries here as foils for the primary would see this arrangement more attuned with Mirroring in MSSQL where exact copies of the primary data a farmed out to slaves (in essence) ensuring redundancy.   And so replication here is not necessarily the  same as replication used in MSSQL.  

However, a slight tweak to configuration of the secondaries means that all reading need not occur through the primary, and specific secondaries can be read from directly.   This provides an element of scaling out, where writes and reads are performed through different servers,   Indeed, indexing can be tuned separately on any of the nodes to help with optimal query execution

The configuration of replica sets is an absolute doddle and this was the general consensus amongst both devs and DBAs.


Scaling out via Sharding


This got the trainer hot under the collar and is one of the more prominent features, particularly for the larger software projects you may find in an enterprise. Sharding only requires a  tad more work  than is required for adding a replica set to get going.    

Analogous to partitioning in MSSQL, sharding is, in essence, a way of making use of a distributed index.   It also helps distribute storage and retrieval across MongoDB instances, thereby performing a kind of load balancing. 

Sharding occurs at a collection level and an appropriate shard key is used to shard upon. So, for example, if we were to take a customer collection we could shard it on the email address of customers.

Consequently documents with email addresses starting A through S and S through Z could end up being placed into 2 different shards with all read and writes for data with email addresses falling within the respective ranges being routed appropriately to write or retrieve related data (a mongos server and configuration servers are required for this) 
  
Shards contain chunks of a pre-defined size.  If data, in a shard, exceeds this threshold then splitting occurs and data is moved between shards in attempt to maintain a close to equal split of data amongst the shards.   This process occurs in the background, it will most likely impact performance but its not something the user has to worry about (that is to say this would be a non-functional requirement).  This allows auto-scaling to a degree, until you realise or plan for more shards as your data needs grow. 


DBA Horror

I think the developers found using and configuring MongoDB a more edifying experience than the DBAs.  I would hope the DBAs involved would agree I was in fair in saying they seemed to find the use of the shell, command prompt and JSON a little challenging.   Higher level concepts similar in MongoDB and SQL worlds were easily digested by attendees from both sides.  The consensus though, generally, was that MongoDB was a low friction route into administering and working with a highly scalable and available data persistence mechanism for developers!   Having only been a SQL hacker and not being involved enterprise persistence choices I'm not best placed to say whether MongoDB would be a  better or replacement solution for SQL, but it does seem bloody easy to use and administer.

To this end I would recommend attending the MongoDB course if it wings its way into your organisation or take the MongoDB intro courses on Pluralsight or MongoDB University to help make an informed decision before defaulting to SQL.



 


5 comments:

  1. The worst blog I have ever seen!

    ReplyDelete
  2. Well, Blade, you are entitled to your opinion.

    ReplyDelete
  3. Nice post Colin. I've been using MongoDB extensively over the past year and have come to love it, especially when paired with the CQRS and Event Sourcing architecture design patterns. I particularly like the schema-less aspect of MongoDB and the impedence mismatch offered by being able to serialise your objects directly into a collection. This allows you to use the MongoDB database for write operations complemented by using a SQL read model populated using Event Sourcing for querying and aggregating. It opens up soo many options! no more are we now trying to optimise a single data store for both write and read operations, winner! these technologies and patterns also make your life easier when implementing domain driven design which is a double winner in my eyes.

    ReplyDelete
  4. Chris Dung. This is Blade. I don't do schemas. I don't queue. I'm VIP.

    ReplyDelete
  5. Cheers Chris. I agree about the reduction in object-impedance mismatch being a great thing and one thing I didn't mention is that things like NHibernate and Entity Framework attempt to get away from this by allowing code-first design but - in my experience - I have still had to mess around with persistence problems and gnarly SQL issues you were hoping to be abstracted away from. Good nod to CQRS as well, I'd spoken about this on the course and wondered whether we could leverage pointing to the primary for writes and nominated secondaries for reads, not sure how this would work in practice though. Also agree about DDD, it ties in nicely to modelling and persisting aggregates/entities. I mistakenly referred to this in the post as bounded contexts, I meant aggregates/entities (I've updated this)

    ReplyDelete