How to Create LRU Cache in Python

What is an LRU Cache?

For any software product, application performance is one of the key quality attributes. One common technique used for improving performance of a software application is to use memory cache. A memory cache puts frequently used application data in the fast RAM of the computing device. Reading or writing data to an in memory cache is usually much much faster than reading/writing data from a database or a file. However, there is a limit to the amount of data that you can cache in memory since the system RAM is a limited and
expensive resource.

So in practical applications, you set a limit to cache size and then you try to optimise the cache for all your data requirements. One approach used for balancing cache size is the LRU cache. In an LRU(Least Recently Used) cache, whenever the cache runs out of space, program will replace the least recently used item in cache with the data you want to cache. In an LRU cache, the algorithm keeps track of all cache items and how recently each one was used relative to each other.

Any generic cache implementation has two main operations,

  • get(key) - Get a cache entry given its unique identifier key.
  • put(key,value) - Insert a cache entry with its unique identifier.

In an LRU cache, the put() and get() will have basic internal implementation to manage how recently the cache entries was accessed. In put() operation, LRU cache will check the size of the cache and it will invalidate the LRU cache entry and replace it with the new one if the cache is running out of space.

If you are using Python 3, you can either build your own LRU cache using basic data structures or you can use the built-in LRU cache implementation available in functools.lru_cache(). In this article, I will start with the basic data structure solution since it enables you to understand the LRU concept better.

How to Create an LRU Cache in Python?

Let us now create a simple LRU cache implementation using Python. It is relatively easy and concise due to the features of Python. The following program is tested on Python 3.6 and above.

Python provides an ordered hash table called OrderedDict which retains the order of the insertion of the keys. Hence this order can be used to indicate which entries are the most recently used. Here is the strategy followed in the python program given below,

  • Whenever get() is invoked, the item is removed from dictionary and then added at the end of the ordered keys. This ensures that recently used items are always at the end of the dictionary.
  • Whenever put() is invoked, if we run out of space, the first entry in ordered keys is replaced with the latest entry. This works because every get() is moving items to the end of the ordered keys and hence first item is the least recently used item!
import collections

class SimpleLRUCache:
  def __init__(self, size):
    self.size = size
    self.lru_cache = collections.OrderedDict()

  def get(self, key):
    try:
      value = self.lru_cache.pop(key)
      self.lru_cache[key] = value
      return value
    except KeyError:
      return -1

  def put(self, key, value):
    try:
      self.lru_cache.pop(key)
    except KeyError:
      if len(self.lru_cache) >= self.size:
        self.lru_cache.popitem(last=False)
    self.lru_cache[key] = value

  def show_entries(self):
    print(self.lru_cache)



# Create an LRU Cache with a size of 3
cache = SimpleLRUCache(3)


cache.put("1","1")
cache.put("2","2")
cache.put("3","3")

cache.get("1")
cache.get("3")

cache.put("4","4") # This will replace 2
cache.show_entries() # shows 1,3,4
cache.put("5","5") # This will replace 1
cache.show_entries() # shows 3,4,5

The following diagram shows how the LRU cache works in the above implementation.

LRU Cache in Python

How to Create an LRU Cache in Python Using functools?

Since LRU cache is a common application need, Python from version 3.2 onwards provides a built-in LRU cache decorator as part of the functools module. This decorator can be applied to any function which takes a potential key as an input and returns the corresponding data object. When the function is called again, the decorator will not execute function statements if the data corresponding to the key already exists in the cache!

from functools import lru_cache

@lru_cache(maxsize=3)
def get_data(key):
  print("Cache miss with "+key)
  # A costly I/O usually implemented below
  return key + ":value"

print(get_data("1"))
print(get_data("2"))
print(get_data("3"))
print(get_data("4"))
print(get_data("4"))
print(get_data("3"))
print(get_data("1"))

How to Download Blobs from Azure Storage Using Python

Azure blob storage offers a cheap and reliable solution for storing large amounts of unstructured data(such as images). Blob storage has no hierarchical structure, but you can emulate folders using blob names with slashes(/) in it. In this article, I will explore how we can use the Azure Python SDK to bulk download blob files from an Azure storage account.

When it comes to Python SDK for Azure storage services, there are two options,

The following code samples will be using the latest Azure Python SDK(v12).

Pre-requisites for Sample Python Programs

The samples below requires python 3.6 or above. On windows, you can download it from the official python website. On a mac machine, use the Homebrew to install python 3,

brew install python3

Next you will need the azure python sdk for blob storage access. Use pip to install the azure python sdk,

pip3 install azure-storage-blob --user

Now you are all set to run the following python programs.

How to Bulk Download Files from Azure Blob Storage Using Python

The following python program uses Azure python SDK for storage to download all blobs in a storage container to a specified local folder. The program will create local folders for blobs which use virtual folder names(name containing slashes).

Before running the program ensure you give proper values for MY_CONNECTION_STRING, MY_BLOB_CONTAINER and LOCAL_BLOB_PATH. The connection string can be obtained from Azure portal and it contains the account url and access key inside it. Note that the program may take a while if your storage account contains a large number of blob files. See the next program below to see how this can be speeded up using python's ThreadPool class.

# download_blobs.py
# Python program to bulk download blob files from azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container
MY_BLOB_CONTAINER = "myimages"

# Replace with the local folder where you want files to be downloaded
LOCAL_BLOB_PATH = "REPLACE_THIS"

class AzureBlobFileDownloader:
  def __init__(self):
    print("Intializing AzureBlobFileDownloader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
    self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)


  def save_blob(self,file_name,file_content):
    # Get full path to the file
    download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)

    # for nested blobs, create local path as well!
    os.makedirs(os.path.dirname(download_file_path), exist_ok=True)

    with open(download_file_path, "wb") as file:
      file.write(file_content)

  def download_all_blobs_in_container(self):
    my_blobs = self.my_container.list_blobs()
    for blob in my_blobs:
      print(blob.name)
      bytes = self.my_container.get_blob_client(blob).download_blob().readall()
      self.save_blob(blob.name, bytes)

# Initialize class and upload files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()

Fast/Parallel File Downloads from Azure Blob Storage Using Python

The following program uses ThreadPool class in Python to download files in parallel from Azure storage. This substantially speeds up your download if you have good bandwidth. The program currently uses 10 threads, but you can increase it if you want faster downloads. Don't forget to change MY_CONNECTION_STRING, LOCAL_BLOB_PATH and MY_BLOB_CONTAINER variables.

# download_blobs_parallel.py
# Python program to bulk download blobs from azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above

import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS_CONNECTION"

# Replace with blob container name
MY_BLOB_CONTAINER = "myimages"

# Replace with the local folder where you want downloaded files to be stored
LOCAL_BLOB_PATH = "REPLACE_THIS_PATH"

class AzureBlobFileDownloader:
  def __init__(self):
    print("Intializing AzureBlobFileDownloader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
    self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)

  def download_all_blobs_in_container(self):
    # get a list of blobs
    my_blobs = self.my_container.list_blobs()
    result = self.run(my_blobs)
    print(result)

  def run(self,blobs):
    # Download 10 files at a time!
    with ThreadPool(processes=int(10)) as pool:
     return pool.map(self.save_blob_locally, blobs)

  def save_blob_locally(self,blob):
    file_name = blob.name
    print(file_name)
    bytes = self.my_container.get_blob_client(blob).download_blob().readall()

    # Get full path to the file
    download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)
    # for nested blobs, create local path as well!
    os.makedirs(os.path.dirname(download_file_path), exist_ok=True)

    with open(download_file_path, "wb") as file:
      file.write(bytes)
    return file_name

# Initialize class and upload files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()

How to Upload Files to Azure Storage Blobs Using Python

Azure storage blobs offers a very cost effective and fast storage solution for unstructured data. It is an ideal solution if you want serve content such as images. It is also possible to speed up content delivery performance using Azure CDN service with it.

Before running the following programs, ensure that you have the pre-requisites ready. In the following sample python programs, I will be using the latest Python SDK v12 for Azure storage blob.

Install Python 3.6 or above. In Mac, use Homebrew to install python 3,

brew install python3

Install the Azure Blob storage client library for Python package,

pip3 install azure-storage-blob --user

Using Azure portal, create an Azure storage v2 account and a container before running the following programs. You will also need to copy the connection string for your storage account from the Azure portal. If you want public access to uploaded images, set the container public access level to "Blob (anonymous read access for blobs only)".

How to Upload Files to Azure Storage Blobs Using Python

The following program demonstrates a typical use case where you want to bulk upload a set of jpg images from a local folder to the Azure blob storage container. Note that for large number of files, this program may not be efficient as it sequentially uploads the images. Replace MY_CONNECTION_STRING, MY_IMAGE_CONTAINER and LOCAL_IMAGE_PATH before running the program.

# upload_blob_images.py
# Python program to bulk upload jpg image files as blobs to azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container. This should be already created in azure storage.
MY_IMAGE_CONTAINER = "myimages"

# Replace with the local folder which contains the image files for upload
LOCAL_IMAGE_PATH = "REPLACE_THIS"

class AzureBlobFileUploader:
  def __init__(self):
    print("Intializing AzureBlobFileUploader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)

  def upload_all_images_in_folder(self):
    # Get all files with jpg extension and exclude directories
    all_file_names = [f for f in os.listdir(LOCAL_IMAGE_PATH)
                    if os.path.isfile(os.path.join(LOCAL_IMAGE_PATH, f)) and ".jpg" in f]

    # Upload each file
    for file_name in all_file_names:
      self.upload_image(file_name)

  def upload_image(self,file_name):
    # Create blob with same name as local file name
    blob_client = self.blob_service_client.get_blob_client(container=MY_IMAGE_CONTAINER,
                                                          blob=file_name)
    # Get full path to the file
    upload_file_path = os.path.join(LOCAL_IMAGE_PATH, file_name)

    # Create blob on storage
    # Overwrite if it already exists!
    image_content_setting = ContentSettings(content_type='image/jpeg')
    print(f"uploading file - {file_name}")
    with open(upload_file_path, "rb") as data:
      blob_client.upload_blob(data,overwrite=True,content_settings=image_content_setting)


# Initialize class and upload files
azure_blob_file_uploader = AzureBlobFileUploader()
azure_blob_file_uploader.upload_all_images_in_folder()

Parallel Bulk Upload of Files to Azure Storage Blobs Using Python

The following python program is an improved version of the above program. This program uses a thread pool to upload a predefined number of images in parallel. Note that the program uses 10 as the thread pool count, but you can increase it for faster uploads if you have sufficient network bandwidth. If you don't specify content type, it will default to application/octet-stream.

# upload_blob_images_parallel.py
# Python program to bulk upload jpg image files as blobs to azure storage
# Uses ThreadPool for faster parallel uploads!
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container
MY_IMAGE_CONTAINER = "myimages"

# Replace with the local folder which contains the image files for upload
LOCAL_IMAGE_PATH = "REPLACE_THIS"

class AzureBlobFileUploader:
  def __init__(self):
    print("Intializing AzureBlobFileUploader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)

  def upload_all_images_in_folder(self):
    # Get all files with jpg extension and exclude directories
    all_file_names = [f for f in os.listdir(LOCAL_IMAGE_PATH)
                    if os.path.isfile(os.path.join(LOCAL_IMAGE_PATH, f)) and ".jpg" in f]

    result = self.run(all_file_names)
    print(result)

  def run(self,all_file_names):
    # Upload 10 files at a time!
    with ThreadPool(processes=int(10)) as pool:
      return pool.map(self.upload_image, all_file_names)

  def upload_image(self,file_name):
    # Create blob with same name as local file name
    blob_client = self.blob_service_client.get_blob_client(container=MY_IMAGE_CONTAINER,
                                                          blob=file_name)
    # Get full path to the file
    upload_file_path = os.path.join(LOCAL_IMAGE_PATH, file_name)

    # Create blob on storage
    # Overwrite if it already exists!
    image_content_setting = ContentSettings(content_type='image/jpeg')
    print(f"uploading file - {file_name}")
    with open(upload_file_path, "rb") as data:
      blob_client.upload_blob(data,overwrite=True,content_settings=image_content_setting)
    return file_name

# Initialize class and upload files
azure_blob_file_uploader = AzureBlobFileUploader()
azure_blob_file_uploader.upload_all_images_in_folder()

Using the ContentSettings object, it is possible to set content type, content encoding, content md5 or cache control for the blobs. See here for the full set of content_settings attributes.

How to Change Content Type of Azure Storage Blobs Using Python SDK

Microsoft Azure cloud platform offers a comprehensive portfolio of storage services. Each of these services is suitable for specific use cases. Here is a quick overview of Azure storage services,

  • Azure Storage Account - Azure storage account can be used for a wide variety of workloads. Using this account, it is possible to create blob storage
    or file storage. Blob storage is suitable for storing a large sets of unstructured data(images for example). File storage makes content available over the SMB protocol. It is also possible to create storage queues or tables using this storage account. For analytical workloads, storage accounts can be configured as Azure data lake storage gen2 with hierarchical namespace. Finally, there is also the option of enabling archival mode for the storage offering a lower cost option.
  • Azure Managed Disk - This offers high performance block storage. There is plenty of choice for managed disk such as ultra disk storage, premium SSD, standard SSD and standard HDD.
  • Azure HPC Cache - Powered by Azure storage account, this service offers caching of files on cloud for high throughput needs.

Out of these services, Azure storage account is the most useful and powerful one. The blob storage in particular is suited for a wide range of use cases where large sets of unstructured binary data such as images is involved. It can also integrate with CDNs for faster content delivery.

There are 4 different ways of accessing Azure blob storage. These are,

  • Azure portal - Use the web based azure portal
  • Azure CLI - Command line interface
  • Azure Powershell - Powershell based command line interface
  • Azure SDKs (Python, .NET etc.) - SDKs based on various languages

In this article, we are looking at accessing blob storage using Python. When it comes to Python SDK for Azure storage services, there are two options,

Since Azure Python SDK v2.1 is deprecated, we will using Microsoft Azure Python SDK v12 for the following examples. Python 3.6 or above is required for the following examples.

How to Change Content Type of Azure Storage Blobs Using Python SDK

Before using Python SDK for Azure storage, ensure that you have the following pre-requisites installed.
Step 1: Install Python 3.6 or above. In Mac, use Homebrew to install python 3,

brew install python3

Step 2: Install the Azure Blob storage client library for Python package,

pip3 install azure-storage-blob --user

Run the following program to convert the content type of all files with extension .jpg to image/jpeg in Azure blob storage using Python SDK. You can also customize the same program for changing content type, content encoding, content md5 or cache control for the blobs. See here for the full set of content_settings attributes.

The program assumes existence of an Azure storage account with a container containing image files with wrong content type as application/octet-stream.

# Python program to change content type of .jpg files to image/jpeg in Azure blob storage
# This is useful if you accidentally uploaded jpg files with application/octet-stream content type
# Requires python 3.6 or above
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# This usually starts with DefaultEndpointsProtocol=https;..
MY_CONNECTION_STRING = "REPLACE_THIS"
MY_IMAGE_CONTAINER = "myimages"

class AzureBlobImageProcessor:
  def __init__(self):
    print("Intializing AzureBlobImageProcessor")

    # Ensure that the container mentioned exists in storage account (myimages in this example)
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING).get_container_client(MY_IMAGE_CONTAINER)

  def change_content_type_for_jpg_files(self):
    print("changing content type for all files with .jpg extension")

    # You can optionally pass a prefix for blob names to list_blobs()
    # This is useful if you want to process a large number of files
    # You can run multiple instances of this program with different prefixes!
    blob_list = self.blob_service_client.list_blobs()
    file_count = 0
    for blob in blob_list:
      if ".jpg" in blob.name:

        # Print file name and current content type to monitor progress
        print(f"For file {blob.name}  current type is {blob.content_settings.content_type}")

        # Note that in addition to content_type, you can also set the following values,
        # content_encoding, content_language, content_disposition, cache_control and content_md5
        blob.content_settings.content_type = "image/jpeg"
        self.blob_service_client.get_blob_client(blob).set_http_headers(blob.content_settings)
        file_count += 1

    print(f"changing content type completed. Processed {file_count} files")

# Initialize class and change content type for all files
azure_blob_image_processor = AzureBlobImageProcessor()
azure_blob_image_processor.change_content_type_for_jpg_files()

Save the above program in file change_blob_content_type.py and then run the following command. Don't forget to update MY_CONNECTION_STRING and MY_IMAGE_CONTAINER variables before running the program,

python3 change_blob_content_type.py

I used this program for converting around 200,000 image files which were wrongly uploaded as application/octet-stream. I ran this program from a VM located in the same region where the storage is located to speed up processing. Running it locally may take a lot of time. This program is useful for bulk conversion of azure blob file content type.

If you can partition the set of files in storage using the blob name prefix, there is a way to run the content type conversion in parallel. For each instance of the python program, you can pass a different blob name prefix to the list_blobs() method. Assume you have blobs of the following full name,

  • subset1/a.jpg
  • subset1/b.jpg
  • subset2/c.jpg
  • subset2/d.jpg

You can run two python program instances with list_blobs("subset1") and list_blobs("subset2") to speed up conversion.

MongoDB History

When it comes to modern web application development, MongoDB is the king. If you are a full stack programmer, you hear about MERN or MEAN stacks every day. The M is in every one of them and it stands for MongoDB. The free and open source community version of MongoDB powers a large number of Web applications. From its humble beginnings in 2007, MongoDB has come a long way. It is the primary product behind the company MongoDB Inc. with over 10 billion dollars in market capitalisation. Like many other products before and after, online advertising was the key catalyst behind the vision and development of MongoDB. The story of MongoDB is an interesting one and in this article, I will take you on a journey through the MongoDB land and its history.

The Idea of the Humongous Database - The Beginnings

The story of MongoDB has its beginnings much earlier than 2007. In 1995, Dwight Merriman and Kevin O'Connor created the famous online advertising company DoubleClick. Kevin Ryan joined the team soon after(Dwight and Kevin later cofounded 5 companies together - Gilt, 10gen, Panther Express, ShopWiki and Business Insider). DoubleClick soon took off and within a few years it was serving as much as 400,000 ads/second. Such large scale traffic was not anticipated by the relational database technologies available at that time. Configuring relational databases for such scale also required substantial amount of money and hardware resources. So Dwight(who was the CTO at the time) and his team wrote custom database implementations to scale DoubleClick for the increased traffic. In one of his early talks on MongoDB, Dwight talks about getting a network hardware with serial number 7 and wondering whether it will work! This was even before the invention of load balancers.

People behind MongoDBIn 2003, Eliot Horowitz joined DoubleClick R&D division as a software engineer immediately after his college. Within 2 years he left DoubleClick to start ShopWiki along with Dwight. Both of them realised that they were solving the same horizontal scalability issues again and again. So in 2007, Dwight, Eliot and Kevin Ryan started a new company called 10gen. 10gen was focused on creating a PaaS hosting solution with its own application and database stack. 10gen soon got the attention of the venture capitalist Albert Wenger (Union Square Ventures) and he invested $1.5 million into it. This is what Albert Wenger wrote in 2008 about the 10gen investment,

Today we are excited to announce that we are backing a team working on an alternative, the amazingly talented folks at 10gen. They bring together experience in building Internet scale systems, such as DART and the Panther Express CDN, with extensive Open Source involvement, including the Apache Software Foundation. They are building an open source stack for cloud computing that includes an appserver and a database both written from scratch based on the capabilities of modern hardware and the many lessons learned in what it takes to build a web site or service. The appserver initially supports server side Javascript and (experimentally) Ruby. The database stores objects using an interesting design that balances fast random access with efficient scanning of collections.

What Albert was refering to as "the database with interesting design" was in fact MongoDB. Rapid development on the new database was undertaken from 2007 to 2009. The first commit of the MongoDB database server by Dwight can be seen here. The core engine was written in C++. The database was named MongoDB since the idea was to use it to store and serve humongous amount of data required in typical use cases such as content serving. Initially the team had only 4 engineers(including Dwight and Eliot) and decided to focus just on the MongoDB database instead of the initial PaaS product. The business idea was to release the database as an open source free download and offer commercial support and training services on top of it.

MongoDB 1.0 was released in February 2009. The initial version focused on providing a usable query language with document model, indexing and basic support for replication. It also had experimental version of sharding, but production ready sharding clusters were available only in version 1.6 released a year later.

Here is how Dwight responded to a question on the suitability of Mongo for a highly scalable system (MongoDB user group - September 2009),

For horizontal scaling, one would use auto-sharding to build out large MongoDB clusters. This is in alpha now, but if your project is just getting started, it will be in production by the time you need it.

Early MongoDB Design Philosophy

In the early years, the basic design principles behind MongoDB development were the following,

  • Quick and easy data model for faster programming - document model with CRUD.
  • Use of familiar language and format - JavaScript/JSON.
  • Schema less documents for agile iterative development.
  • Only essential features for faster development and easy scaling. No join, no transactions across collections.
  • Support easy horizontal scaling and durability/availability (replication/sharding).

In his ZendCon 2011 presentation titled "NoSQL and why we created MongoDB", Dwight talks about these principles in detail. Around 42 minutes mark, there is also an interesting discussion on the difference between replication and sharding. As the database server code matured and once MongoDB hit the mainstream, many of these principles were obviously diluted. Latest MongoDB server versions support joining to some extend and since MongoDB 4.2, even distributed transactions are supported!

What is MongoDB?

Before we get into detailed MongoDB history and how it evolved over the years, let us briefly look at what exactly it is!

MongoDB is a document based NoSQL database. It can run on all major platforms (Windows, Linux, Mac) and the open source version is available as a free download. MongoDB stores data entities in a container called collection and each piece of data stored is in a JSON document format. For example, if a customer submits an online order, the entire details of that order (order number, order line items, delivery address etc.) are kept in a single hierarchical document in JSON format. It is then saved to a collection named "customer_order".

MongoDB also comes with a console client called MongoDB shell. This is a fully functional JavaScript environment using which you can add, remove, edit or query document data in the database.

MongoDB Architecture

The following MongoDB architecture diagram provides a high level view of the major components in the MongoDB server.

MongoDB Architecture Diagram

MongoDB currently offers drivers for 13 languages including Java, Node.JS, Python, PHP and Swift. The storage engine MMAPv1 is removed since version 4.2. The encrypted storage engine is only supported in the commercial enterprise server.

The beauty of MongoDB is that using the same open source free community server you can,

  • Run a simple single machine instance suitable for most small applications.
  • Run a multi-machine instance with durability/high availability suitable for most business applications.
  • Run a large horizontally scaled cluster of machines(shard cluster) handling both very large sets of data and high volume of query traffic. MongoDB provides automatic infrastructure to distribute both data and its processing across machines. A typical use case would be running a popular ad service with thousands of customers and millions of impressions.

The following diagrams show various options available for running MongoDB instances.

Single Server/Fault Tolerant Setup

For small applications, a single server setup is enough with frequent data backups. For installations that require fault tolerance, a replica set implementation can be done. In the fault tolerant deployment, usually there are 3 or more MongoDB instances. Only one of them work as the primary instance and if it fails, one of the other 2 secondaries takes over as the primary. The data is identical in all instances.

Single server/fault tolerant setup

Shard Cluster for Horizontal Scalability

For a large database with both horizontal scalability and fault tolerance requirements, a MongoDB shard cluster is configured. As can be seen from the diagram below, minimum recommended number of machines for a fault tolerant shard cluster is 14! Each fault tolerant replica set in this case handles only a subset of the data. This data partitioning is automatically done by MongoDB engine.
Shard cluster with partitioned data

When you download the latest version of MongoDB (4.4) and extract it, you will find that it contains only the following 3 main files,

  • mongo - MongoDB shell for interacting with your server using JavaScript based commands.
  • mongod - The MongoDB main executable. This can run as a single database instance, as a database member of a sharded cluster or as a configuration server of a sharded cluster.
  • mongos - A router application only needed for sharded horizontally scaled cluster of database servers.

In a mac machine, the total size of these 3 executables is around 150MB. These are the only components you need for any type of MongoDB deployments! In a world of bloated software, this is a welcome change! This simplicity and elegance is what makes MongoDB so powerful and reliable.

Evolution of MongoDB (2009 to 2020)

MongoDB 1.0 was released in February 2009 and it had most of the basic query functionalities. MongoDB 1.2 was released in December 2009 and it introduced large scale data processing using map-reduce. Realising that MongoDB has good potential, 10gen quickly ramped up the team. MongoDB 1.4 (March 2010) introduced background index creation and MongoDB 1.6 (August 2010) introduced some major features such as production ready sharding for horizontal scaling, replica sets with automatic failover and IPv6 support.

MongoDB 2009 to 2013

By 2012, 10gen had 100 employees and the company started providing 24/7 support. MongoDB 2.2 release(August 2012) introduced aggregation pipeline enabling multiple data processing steps as a chain of operations. By 2013, 10gen had over 250 employees and 1000 customers. Realising the true business potential, 10gen was renamed as MongoDB Inc. to focus fully on the database product. MongoDB 2.4 release (March 2013) introduced text search and Google's V8 JS engine in Mongo shell among other enhancements. Along with 2.4, a commercial version of the database called MongoDB Enterprise was released. It included additional features such as monitoring and security integrations.

One of the major problems with early MongoDB versions was its relatively weak storage engine used for saving and managing the data on the disk. MongoDB Inc.'s first acquisition was WiredTiger, a company behind the super stable storage engine with the same name. MongoDB acquired both the team and the product and its main architect Michael Cahill(also one of the architects of Berkeley DB) became director of engineering(storage) in the company. WiredTiger is an efficient storage engine. Using a variety of programming techniques such as hazard pointers, lock-free algorithms, fast latching and message passing, WiredTiger performs more work per CPU core than alternative engines. To minimize on-disk overhead and I/O, WiredTiger uses compact file formats, and optionally, compression.

Next major release of MongoDB was 3.0 (March 2015) which featured the new WiredTiger storage engine, pluggable storage engine API, increased replica set member limit of 50 and security improvements. The same year Glassdoor featured MongoDB Inc. as one of the best places to work. Later in the year version 3.2 was released and it supported document validation, partial indexes and some major aggregation enhancements.

In 2017, Microsoft released a proprietary, globally distributed, multi-model NoSQL database service called CosmosDB as part of the Microsoft Azure cloud platform. This offered protocol compatibility with MongoDB 3.2 so that queries written in MongoDB 3.2 could be run on CosmosDB. I think this accelerated the adoption of CosmosDB among developers.

By 2016, MongoDB Inc. had 500 employees and the database itself was downloaded over 20 million times. In October 2017, MongoDB Inc. went public with over 1 billion dollars in market capitalisation. MongoDB 3.6 was released a month later(November 2017) and it included better support for $lookup for multi-collection joins, change streams and $jsonSchema operator to support document validation using JSON Schema. Notably MongoDB 3.6 is the latest version supported by Microsoft Azure CosmosDB as of August 2020.

MongoDB 2013 to 2020

In 2018, MongoDB Inc. went for its second acquisition by taking over mLab for 68 million dollars. At the time mLab was providing MongoDB as a service(DBaaS) on the cloud and had a large number of customers. Cloud was the future and MongoDB Inc. moved quickly to acquire and integrate mLab as part of the MongoDB Atlas cloud platform. They then decided to address the issue of more competitors appearing in DBaaS space by changing the licensing terms of the open source version!

MongoDB open source community version and the premium enterprise version were both powered by the same underlying engine. This meant that anyone can take the community version and then offer a paid cloud version on top of it. This was a major problem for MongoDB Inc. since it meant direct competition for their cloud product, MongoDB Atlas. So in a controversial move, MongoDB Inc. changed the license of the community version from GNU AGPLv3 (AGPL), to Server Side Public License(SSPL) in October 2018. The license had a clause to prevent future SaaS competitors to use MongoDB and offer their own SaaS version,

If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge, under the terms of this License.

This was a license written by MongoDB Inc. itself and they claimed it is an OSI compliant license. The license was later withdrawn from the approval process of Open Source Initiative(OSI), but the open source version is still licensed under SSPL.

By 2018, the company had over 1000 employees. Next major release MongoDB 4.0 (June 2018) came with the capability to have transactions across multiple documents. It was a major milestone and MongoDB was getting ready for use cases with high data integrity needs.

The cloud ecosystem was rapidly growing and soon MongoDB Inc. realised the need for a full fledged cloud platform instead of just offering the database service. In 2019, MongoDB Inc. went for its third acquisition by taking over Realm, a cloud based mobile database company for $39 million. This was interesting since MongoDB originally started as a PaaS hosting solution and after 12 years, it was back on the same direction. In the same year MongoDB 4.2 was released with distributed transaction support.

The current version of MongoDB community server as of August 2020 is MongoDB 4.4. It is notable for the separation of MongoDB database tools as a separate download. MongoDB 4.4 contains some major feature enhancements such as union aggregation from multiple collections, refinable/compound hashed shard keys and hedged reads/mirrored reads.

MongoDB Today

As of 2020, MongoDB is downloaded 110 million times worldwide. MongoDB Inc. currently has 2000+ employees and has over 18,000 paying customers many of whom will be using both MongoDB Atlas and MongoDB Enterprise. The current version of MongoDB community server as of August 2020 is MongoDB 4.4. Most large companies must be using the community version internally for some use case. MongoDB community server is still open source and except some key features, it is still on par with MongoDB enterprise.

MongoDB enterprise server (seems to be priced around the range of $10k per year per server) offers the following additional features,

  • In-memory storage engine. This is for use cases where fast data access is needed and persistent storage is not required.
  • Auditing. This allows administrators to track system activity for deployments.
  • Authentication & Authorization. Supports Kerberos authentication and LDAP authentication and authorization.
  • Encryption at Rest. The WiredTiger engine has a native encryption option. The default is AES256 using OpenSSL.
  • In addition to the community server, MongoDB Inc. offers the following products,

    • MongoDB Database Tools - A collection of command line tools for working with a MongoDB installation. This includes import/export(mongodump, mongorestore, etc.) and diagnostic tools (mongostat, mongotop).
    • MongoDB Enterprise Server - The enterprise version with additional security and auditing features.
    • MongoDB Atlas - A premium cloud based SaaS version of the MongoDB server.
    • Atlas Data Lake - A cloud based data lake tool powered by MongoDB query language that allows you to query and analyse data across MongoDB Atlas and AWS S3 buckets.
    • Atlas Search - A cloud based full-text search engine that works on top of MongoDB Atlas.
    • MongoDB Realm - A managed cloud service offering backend services for mobile apps.
    • MongoDB Charts - A cloud tool to create visual representations of MongoDB data.
    • MongoDB Compass - Downloadable GUI tool for connecting to the MongoDB database and querying data.
    • MongoDB Ops Manager - On-premise management platform for deployment, backup and scaling of MongoDB on custom infrastructure.
    • MongoDB Cloud Manager - The cloud version of the Ops manager.
    • MongoDB Connectors - A set of drivers for other platforms/tools to connect to MongoDB.

    The Road Ahead

    Since the SSPL license controversy, some among the developer community are wary of the MongoDB ecosystem. There is also the investor pressure to generate revenue around the eco-system. This is very evident if you consider the home page of MongoDB side by side between 2008 version and 2020 version (see image below). The MongoDB community server download page is actually showing the features available in the commercial enterprise version!

    MongoDB Home Page in 2008 and 2020

    There is also heavy competition from Cloud vendors who are offering competing products. The main problem for MongoDB Inc. is that data storage is just one part of the enterprise application landscape. Without a compelling full stack of cloud services, MongoDB may find it hard in future to compete with cloud vendors.

    Eliot Horowitz (a key figure behind MongoDB) left the company in July 2020. He seems to be still in an advisory role, but there is a risk of the product loosing its focus, reduced support for the free community version or further changes in licensing terms.

    My Thoughts

    MongoDB is a perfect example of how successful companies are formed based on a focused open source technology product. It is also a brilliant example of how to pivot at the right time in a product lifecycle. With its simplicity and small install footprint, MongoDB server demonstrates that it is still possible to build complex software without adding a lot of overhead. I hope that MongoDB Inc. will continue to support the community version in the coming years.

    References/Further Reading