Monday, February 7, 2011

About notes on document-oriented NoSQL by Curt Monash

I feel that couple observations need to be added to the Notes on document-oriented NoSQL by Curt Monash.

MongoDB does not store JSON documents, but rather JSON-style documents - specifically BSON (http://bsonspec.org/). It has important performance benefits for mostly numeric flexible-schema stores (read - health and social statistics, finance). Effectively the data does not need to serialize in and out of character stream between application objects and the store.

That also allows MongodB to manage storage as a set of memory-mapped files so that the DB server has little need and overhead of managing data persistence on disk. A side effect of memory mapped files efficiency is that objects are capped in size. I believe the current limit is 8MB, but do not quote me on that.

Many RDBMS implementations can have explicit (foreign keys) and implicit (joins) references between data items. That allows to build an arbitrary, albeit complex, data graph and have it persisted in the data's meta-data or at least somewhere between an application and DB. For example in queries, views, and stored procedures.

BSON, like JSON, represents inherently acyclic data graphs - effectively directed trees. It has no build-in mechanism to keep record of any relationships in data except for containment at below object level. That seems to be consistent with MongoDB's philosophy of disclaiming any significant responsibility for meta-data. If schema management is not in the DB engine, then why should meta-schema be in the DB engine? That is a blessing and a curse, as one needs to use a proprietary format if they want to persist the structural information in the data store.

In XML-based stores one has an option of using the family of XML-related standards to record and query edges of a data graph. One can check a schema validity for an XML document. I doubt MarkLogic does it natively today however it is very conceivable to have XPath references from inside one document into innards of another and have the relationship followed as a part of a query. This is a bit more than a "true document" notion as it is a cross-document relationship.

Same thing - it is a blessing and a curse as it brings to the table character serialization penalty and an easy way to make a very convoluted data design. And I do not even want to go into performance issues of a random graph traversal.