DocumentDB revisited Part 1 – The theory

Last time I spent with DocumentDB was, although pleasant, a PaaS service in preview and did not contain all the features it does today. Therefor I decided to revisit the component again to update my experience.

I will discuss this topic it in several posts. One theory based one (this one) and one hands-on example and how to get started with Document DB and several advanced topics later on.

The theory (and classical solution) – Relational Databases

Relational SQL databases have been around for ages now and for all things that survive this long can be seen as a symbol for a life worthy component. Relational SQL databases are simple to use, cross- platform Query language, efficient search, transactional etc. It is also an extremely good investment for professionals to study because you will make solutions with SQL databases during your career (for most IT professional more often than not.

But is it always the most efficient and logical way to structure your data this way? Consider the classical order scenario and you’ll quickly end up with a lot of relational tables. The more you normalize the more tables you will get for just managing your order.

Order Scenario Relational Databases

The relational database also is restricted by structure by its columns and tables. Before I continue any further I would like to say that DocumentDB will not remove all your needs for relational databases in the future but it may in some cases be a good choise. In some cases one could consider a combination and why not in this scenario. We keep orders in progress in DocumentDB and put the archived orders in a relational database that can be used for data warehousing with KPIs (with Azure Datawarehouse perhaps).

The theory – “No SQL” databases

But our objective is to take an order from A->Z (order to delivery) – we are not interested in the CustomerInvoice table and how the parts of the order is stored. We want to consider the order as one object instead (or maybe just test the word Document). The reason why I sometimes avoid using the word Document is because it is often confused with word and excel documents so it can confuse newcomers to the technology. DocumentDB is not related to word documents as this exercise will show later.

The same object in DocumentDB can be handled as a single instance “object”. Described in DocumentDB (a non-SQL database in Azure) the order is consisted by JSON properties/objects but all of the order is represented and stored as a single unit.

{
“id”: “webShop635874294383904028”,
“transactionDateUTC”: “2016-01-03T14:50:38.3904028Z”,
“lines”: [
{
“articleNo”: “12345678”,
“articleName”: “Bookcase White Rodney 233×100”,
“quantity”: 1,
“itemPrice”: 44.45
}
],
“invoiceAddress”: {
“addressType”: “DELIVERY”,
“addressLine1”: “The big City Street 2233”,
“addressLine2”: “Dept Invoice”,
“city”: “Malmö”,
“zipCode”: 11111
},
“customer”: {
“firstName”: “Peter”,
“lastName”: “Mannerhult”,
“email”: “peter@example.com”,
“mobilePhone”: “+4642000000”
}
}

One order may not have all the same properties as all others. This is also one of the major differences to the relational database – there is no schema restriction in the objects in a non-sql database.

When to use DocumentDB instead of relational database (from MS documentation)

I have added comments after each statement to show why my Order example complies with the scenario)

  • There are contains relationships between entities. (Order contains OrderLines)
  • There are one-to-few relationships between entities. (Limited number of relations)
  • There is embedded data that changes infrequently. (The orderlines and delivery address is unlikely to change often)
  • There is embedded data won’t grow without bound. (there will not be a continuous growth of orderlines, it is likely that there will be a limited number of order lines)
  • There is embedded data that is integral to data in a document (all the data is related to the order).

 

But what about Table Storage?

Compared to the hyper-scalable Table storage that was introduced when  Azure was in first release, which is also a noSQL alternative, there are some key differences.

When it comes to similarities I would first mention that both can store non-schema enforced data, i.e. in many cases you could store the same information. You can also make transactional work in both with different limitations depending on the service chosen.

The key differences are:

  • Table Storage is a noSQL key/value storage while DocumentDB is more of a noSQL database.
  • You can save an amazing 100 TB in Table Storage while Document only can have 10 GB per collection, mitigated by the fact that you can have many collections in the same database.
  • Table Storage is based on Azure Storage while DocumentDB is SSD disk based storage.
  • It is much cheaper with Table Storage
  • Table Storage is extremely fast if searching for a certain “record” (using partitionkey and rowid) but poor att range searches as it does not build complete indexes. DocumentDB on the other hand is indexing all values and allows searches on all columns (with some performance hits of course).
  • DocumentDB is your choise for hierarchical data saved as one entity. In table storage you should save the orderlines and orderhead in separate records in different (or the same) table. I assume you could serialize a more hierarchical object to a string property in table storage but it would effectively disable any efficient queries on the data.

 

Azure DocumentDB for developers

DocumentDB is a super fast, low latency, scalable, SSD, automatic replicated service to store JSON entities in the database. It is easy to code against and its REST interface enables it to be used from a variety of platforms. There are SDKs for JavaScript, Java, Node.js, Python, and .NET. It is a PaaS service so you don’t have to deal with any low-level disk management.

Azure DocumentDB supports querying documents using a SQL language, which is rooted in the JavaScript type system, and expressions with support for rich hierarchical queries. The DocumentDB query language is a simple yet powerful interface to query JSON documents. The language supports a subset of ANSI SQL grammar and adds deep integration of JavaScript object, arrays, object construction, and function invocation. DocumentDB provides its query model without any explicit schema or indexing hints from the developer.

User Defined Functions (UDFs) can be registered with DocumentDB and referenced as part of a SQL query, thereby extending the grammar to support custom application logic. These UDFs are written as JavaScript programs and executed within the database.

For .NET developers, DocumentDB also offers a LINQ query provider as part of the .NET SDK.

Storage

Azure storage is based on an account, which in its turn contains databases, which contains collections(which is also the transactional border) and each collections contains Documents. Apart from that you can add attachments to the document.

Import Data

You have many options for importing data from other data sources into DocumentDB for migrations such as JSON content, SQL databases, CSV Files, MongoDB files at https://azure.microsoft.com/en-gb/documentation/articles/documentdb-import-data/

 

In the next part of this I will set up a DocumentDB from the ground up and create a simple program. Stay tuned…

One thought on “DocumentDB revisited Part 1 – The theory

Leave a comment