Azure DocumentDB in production!

We had the opportunity to try a new database recently released in april 2015. Indeed, Microsoft has decided to add their first NoSQL database to their cloud services.

What is DocumentDB ?

Briefly, Microsoft considers DocumentDB as a “Database as a Service”, which receives its data in JSON and allows flexible queries (e.g. SQL). As it is a database service, there is almost nothing to manage regarding the infrastructure and it is scalable to infinity, report Microsoft.

How DocumentDB works overall ?

DocumentDB uses the terms “database”, “collection” and “document”. A document represents an entry, an entity. These documents are placed in collections, which are considered as containers. Collections are part of a database, and there can be several databases.

To persist an entity, you just have to convert the object to JSON and use the SDK to create a new document in the collection. As NoSQL databases are schema-free, each document can have a different structure. For example, there could be a document that represents a user, then another document that represents a shopping cart in the same collection.

Encountered challenges

As DocumentDB was published recently, unfortunately there are very few documentation so far. Luckily for you, we decided to list some problems that we have experienced and the solutions we found.

Documents scope

Each collection is limited by a number of requests per second. As much as this can be a disadvantage at times, it becomes an advantage in terms of flexibility. To minimize the number of requests needed, we discovered that having the smallest possible document greatly increases query performance, especially reducing the number of requests per second. Therefore, we considered that each document should not contain more than one object aggregate at a time.

Data partitioning

At the time of writing this article, each collection is limited to 10GB storage. You have to implement yourself a data partitioning strategy. There are several strategies : spillover, hashing, per period of time, etc.

The simplest strategy to implement is a spillover strategy. The strategy consist in adding documents in the latest collection created, and when approaching a certain quota, 90% of capacity for example, then a new collection is created. See an example of Microsoft’s implementation.

The impact of the connection mode

DocumentDB can use two types of connections : direct mode and gateway mode. The gateway mode is best when we prefer to use DocumentDB’s API directly. On the other hand, direct mode is used with the SDK, as it has direct access to routing tables and collections. Thus, this mode is faster.

We can easily change the connection mode as follows :

Avoid waiting delay on opening

By default, the first request takes much longer to run, mainly because it has to retrieve the routing table. To avoid this delay, you can simply open the client on server start as follows :

Fully asynchronous requests

Every requests made to DocumentDB with the SDK are asynchronous. In order to use the SDK efficiently, we must understand the basics of asynchronous methods in C#. It can be a great challenge when we start using DocumentDB, because you will suffer from deadlocks problems. There are interesting lectures (MSDN: Asynchronous Programming with Async and Await and Blog Stephen Cleary : Async and Await) available to better understand asynchronous methods in C#.

Return a large set of data

By default, requests are returned in chunks of 100 items. In order to limit the number of trips, we can increase the number of data returned per trip.

Ideally, we want to minimize the number of data returned by using some kind of pagination. Unfortunately, DocumentDB does not offer that alternative so far.

Obviously, we have not covered all the problems we encountered. In the coming weeks, we will publish other articles on more complex problems. Stay tuned to our blog to follow the advancements of this technology!

Overall impression of DocumentDB

DocumentDB is available to the public for nearly 6 months. Unfortunately it is not yet mature and lacks some essential features. At the time of this writing, it does not allow paging (limited results) and it becomes frustrating when you begin to have a lot of data. In addition, there is currently no aggregate functions (Sum, Count, Average, Group, etc.).

Despite these flaws, Microsoft did an incredible job to publish this new database, and be aware that they release very quickly. Nevertheless, this remains a significant technology when we need to pick a new infrastructure in our projects.

Share this post

Comments (3)

  • Ryan Reply  

    I’m a bit confused … “client.CreateDocumentAsync(database.SelfLink, json);” if database is injected in the constructor, where does the collection get set? a create of document is expecting a Collection Link, not a Database Link. If I tried creating a document by specifying a database link I will get an error

    November 30, 2015 to 2:56 pm
    • Marc-Olivier Duval Reply  

      Hi Ryan,
      as mentionned in the documentation, the method CreateDocumentAsync either takes the documents feed or the database link. When you provide a partitioner, it actually finds the collection for you, so you can simply use the database link.

      November 30, 2015 to 3:48 pm
      • Ryan Reply  

        Yeah, aware that you can supply a database link IF you have a partition resolver registered. I didn’t get that from your code snippet and hence the confusion.

        November 30, 2015 to 3:55 pm

Leave a comment