How to Use Azure CosmosDB Management API SDK for Python

Azure provides an excellent portal for managing various cloud resources. If you are working with CosmosDB database accounts, you will be familiar with the powerful interface available in Azure portal. Sometimes you may need to access some of the CosmosDB management features programatically. If you plan to build something on the command line, you can use either Azure CLI interface or the Azure PowerShell interface to access CosmosDB management features.

However, sometimes you may want direct programmatic access from your preferred programming language. Good news is that Azure API SDKs are available for common languages such as Python, .NET, Java and JavaScript. In the following section I will show you how to use powerful Azure management APIs using the Azure SDK for Python. For this example I will be specifically using the CosmosDB management API to list CosmosDB accounts and print connection strings (access keys) for one of the database accounts.

How to List CosmosDB Accounts in an Azure Subscription Using Azure Management API for Python

The following code snippet assumes that you already have Azure CLI setup so that the user credentials is available through the AzureCliCredential class. See this page for other methods of providing user credentials/service principal for API access. If you get the error "SubscriptionNotFound", check the subscription id used in the code and that you have logged into the subscription using Azure CLI.

# This program will list all the CosmosDB database accounts under a subscription.
# Uses Azure Management API for Python
# Also assumes Azure CLI is installed and configured with user authorization. Hence this code doesn't expose Azure user id and password.

from azure.mgmt.cosmosdb import CosmosDBManagementClient
from azure.identity import AzureCliCredential
import re

# Replace the following variable with your subscription id
subscription_id = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX'

# Connect to CosmosDB management API
cosmosdb_mgmt_client = CosmosDBManagementClient(AzureCliCredential(),subscription_id)

# Query the list of CosmosDB accounts under the subscription specified above
cosmosdb_accounts = cosmosdb_mgmt_client.database_accounts.list()

# For each account, let us print account name, id, access endpoint. Please note that the object has lot more attributes available.
for db_account in cosmosdb_accounts:
    print(f"Name={db_account.name},Id={db_account.id},Endpoint={db_account.document_endpoint}")

# There are additional APIs that can be used to get more specific details of CosmosDB cosmosdb_accounts

How to List Connection Strings for a CosmosDB Account Using Azure CosmosDB SDK for Python

The following code snippet assumes that you already have Azure CLI setup so that the user credentials is available through the AzureCliCredential class. If you get the error "SubscriptionNotFound", check the subscription id used in the code and that you have logged into the subscription using Azure CLI.

# This program will list all the CosmosDB database accounts under a subscription.
# Uses Azure Management API for Python
# Also assumes Azure CLI is installed and configured with user authorization. Hence this code doesn't expose Azure user id and password.

from azure.mgmt.cosmosdb import CosmosDBManagementClient
from azure.identity import AzureCliCredential
import re

# Replace the following variable with your subscription id
subscription_id = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX'

# Connect to CosmosDB management API
cosmosdb_mgmt_client = CosmosDBManagementClient(AzureCliCredential(),subscription_id)


# The following code snippet shows how to print connection strings with access keys of an account

# Let us get the first account from the accounts list
first_account = next(cosmosdb_mgmt_client.database_accounts.list())
account_id = first_account.id
account_name = first_account.name

# Extract resource group name for the CosmosDB account since the next API needs resource group
resource_group = re.search('resourceGroups/(.*)/providers', account_id).group(1)


# Call API to list connection strings. Note that the API returns an object
cs_list_obj = cosmosdb_mgmt_client.database_accounts.list_connection_strings(
       f"{resource_group}",
       f"{account_name}"
   )

# This lists all 4 keys of CosmosDB accounts
# This includes primary read-write, secondary read-write, primary read only and secondary read only
for connection_string_obj in cs_list_obj.connection_strings:
    print(connection_string_obj.connection_string)

The CosmosDBManagementClient used above is quite powerful and can also be used for getting detailed metrics for CosmosDB accounts, databases or collections. This is useful if you plan to use your own solution for monitoring and alerting instead of the Azure's core services for the same. For one of my solutions I use this API to fetch Max RUs Per Second, Mongo Query Request Charge and Throttled Requests for a CosmosDB account provisioned for MongoDB API.

Additional References

How to Create LRU Cache in Python

What is an LRU Cache?

For any software product, application performance is one of the key quality attributes. One common technique used for improving performance of a software application is to use memory cache. A memory cache puts frequently used application data in the fast RAM of the computing device. Reading or writing data to an in memory cache is usually much much faster than reading/writing data from a database or a file. However, there is a limit to the amount of data that you can cache in memory since the system RAM is a limited and expensive resource.

So in practical applications, you set a limit to cache size and then you try to optimise the cache for all your data requirements. One approach used for balancing cache size is the LRU cache. In an LRU(Least Recently Used) cache, whenever the cache runs out of space, program will replace the least recently used item in cache with the data you want to cache. In an LRU cache, the algorithm keeps track of all cache items and how recently each one was used relative to each other.

Any generic cache implementation has two main operations,

  • get(key) - Get a cache entry given its unique identifier key.
  • put(key,value) - Insert a cache entry with its unique identifier.

In an LRU cache, the put() and get() will have basic internal implementation to manage how recently the cache entries was accessed. In put() operation, LRU cache will check the size of the cache and it will invalidate the LRU cache entry and replace it with the new one if the cache is running out of space.

If you are using Python 3, you can either build your own LRU cache using basic data structures or you can use the built-in LRU cache implementation available in functools.lru_cache(). In this article, I will start with the basic data structure solution since it enables you to understand the LRU concept better.

How to Create an LRU Cache in Python?

Let us now create a simple LRU cache implementation using Python. It is relatively easy and concise due to the features of Python. The following program is tested on Python 3.6 and above.

Python provides an ordered hash table called OrderedDict which retains the order of the insertion of the keys. Hence this order can be used to indicate which entries are the most recently used. Here is the strategy followed in the python program given below,

  • Whenever get() is invoked, the item is removed from dictionary and then added at the end of the ordered keys. This ensures that recently used items are always at the end of the dictionary.
  • Whenever put() is invoked, if we run out of space, the first entry in ordered keys is replaced with the latest entry. This works because every get() is moving items to the end of the ordered keys and hence first item is the least recently used item!
import collections

class SimpleLRUCache:
  def __init__(self, size):
    self.size = size
    self.lru_cache = collections.OrderedDict()

  def get(self, key):
    try:
      value = self.lru_cache.pop(key)
      self.lru_cache[key] = value
      return value
    except KeyError:
      return -1

  def put(self, key, value):
    try:
      self.lru_cache.pop(key)
    except KeyError:
      if len(self.lru_cache) >= self.size:
        self.lru_cache.popitem(last=False)
    self.lru_cache[key] = value

  def show_entries(self):
    print(self.lru_cache)



# Create an LRU Cache with a size of 3
cache = SimpleLRUCache(3)


cache.put("1","1")
cache.put("2","2")
cache.put("3","3")

cache.get("1")
cache.get("3")

cache.put("4","4") # This will replace 2
cache.show_entries() # shows 1,3,4
cache.put("5","5") # This will replace 1
cache.show_entries() # shows 3,4,5

The following diagram shows how the LRU cache works in the above implementation.

LRU Cache in Python

How to Create an LRU Cache in Python Using functools?

Since LRU cache is a common application need, Python from version 3.2 onwards provides a built-in LRU cache decorator as part of the functools module. This decorator can be applied to any function which takes a potential key as an input and returns the corresponding data object. When the function is called again, the decorator will not execute function statements if the data corresponding to the key already exists in the cache!

from functools import lru_cache

@lru_cache(maxsize=3)
def get_data(key):
  print("Cache miss with "+key)
  # A costly I/O usually implemented below
  return key + ":value"

print(get_data("1"))
print(get_data("2"))
print(get_data("3"))
print(get_data("4"))
print(get_data("4"))
print(get_data("3"))
print(get_data("1"))

How to Download Blobs from Azure Storage Using Python

Azure blob storage offers a cheap and reliable solution for storing large amounts of unstructured data(such as images). Blob storage has no hierarchical structure, but you can emulate folders using blob names with slashes(/) in it. In this article, I will explore how we can use the Azure Python SDK to bulk download blob files from an Azure storage account.

When it comes to Python SDK for Azure storage services, there are two options,

The following code samples will be using the latest Azure Python SDK(v12).

Pre-requisites for Sample Python Programs

The samples below requires python 3.6 or above. On windows, you can download it from the official python website. On a mac machine, use the Homebrew to install python 3,

brew install python3

Next you will need the azure python sdk for blob storage access. Use pip to install the azure python sdk,

pip3 install azure-storage-blob --user

Now you are all set to run the following python programs.

How to Bulk Download Files from Azure Blob Storage Using Python

The following python program uses Azure python SDK for storage to download all blobs in a storage container to a specified local folder. The program will create local folders for blobs which use virtual folder names(name containing slashes).

Before running the program ensure you give proper values for MY_CONNECTION_STRING, MY_BLOB_CONTAINER and LOCAL_BLOB_PATH. The connection string can be obtained from Azure portal and it contains the account url and access key inside it. Note that the program may take a while if your storage account contains a large number of blob files. See the next program below to see how this can be speeded up using python's ThreadPool class.

# download_blobs.py
# Python program to bulk download blob files from azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container
MY_BLOB_CONTAINER = "myimages"

# Replace with the local folder where you want files to be downloaded
LOCAL_BLOB_PATH = "REPLACE_THIS"

class AzureBlobFileDownloader:
  def __init__(self):
    print("Intializing AzureBlobFileDownloader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
    self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)


  def save_blob(self,file_name,file_content):
    # Get full path to the file
    download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)

    # for nested blobs, create local path as well!
    os.makedirs(os.path.dirname(download_file_path), exist_ok=True)

    with open(download_file_path, "wb") as file:
      file.write(file_content)

  def download_all_blobs_in_container(self):
    my_blobs = self.my_container.list_blobs()
    for blob in my_blobs:
      print(blob.name)
      bytes = self.my_container.get_blob_client(blob).download_blob().readall()
      self.save_blob(blob.name, bytes)

# Initialize class and upload files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()

Fast/Parallel File Downloads from Azure Blob Storage Using Python

The following program uses ThreadPool class in Python to download files in parallel from Azure storage. This substantially speeds up your download if you have good bandwidth. The program currently uses 10 threads, but you can increase it if you want faster downloads. Don't forget to change MY_CONNECTION_STRING, LOCAL_BLOB_PATH and MY_BLOB_CONTAINER variables.

# download_blobs_parallel.py
# Python program to bulk download blobs from azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above

import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS_CONNECTION"

# Replace with blob container name
MY_BLOB_CONTAINER = "myimages"

# Replace with the local folder where you want downloaded files to be stored
LOCAL_BLOB_PATH = "REPLACE_THIS_PATH"

class AzureBlobFileDownloader:
  def __init__(self):
    print("Intializing AzureBlobFileDownloader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
    self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)

  def download_all_blobs_in_container(self):
    # get a list of blobs
    my_blobs = self.my_container.list_blobs()
    result = self.run(my_blobs)
    print(result)

  def run(self,blobs):
    # Download 10 files at a time!
    with ThreadPool(processes=int(10)) as pool:
     return pool.map(self.save_blob_locally, blobs)

  def save_blob_locally(self,blob):
    file_name = blob.name
    print(file_name)
    bytes = self.my_container.get_blob_client(blob).download_blob().readall()

    # Get full path to the file
    download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)
    # for nested blobs, create local path as well!
    os.makedirs(os.path.dirname(download_file_path), exist_ok=True)

    with open(download_file_path, "wb") as file:
      file.write(bytes)
    return file_name

# Initialize class and upload files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()

How to Upload Files to Azure Storage Blobs Using Python

Azure storage blobs offers a very cost effective and fast storage solution for unstructured data. It is an ideal solution if you want serve content such as images. It is also possible to speed up content delivery performance using Azure CDN service with it.

Before running the following programs, ensure that you have the pre-requisites ready. In the following sample python programs, I will be using the latest Python SDK v12 for Azure storage blob.

Install Python 3.6 or above. In Mac, use Homebrew to install python 3,

brew install python3

Install the Azure Blob storage client library for Python package,

pip3 install azure-storage-blob --user

Using Azure portal, create an Azure storage v2 account and a container before running the following programs. You will also need to copy the connection string for your storage account from the Azure portal. If you want public access to uploaded images, set the container public access level to "Blob (anonymous read access for blobs only)".

How to Upload Files to Azure Storage Blobs Using Python

The following program demonstrates a typical use case where you want to bulk upload a set of jpg images from a local folder to the Azure blob storage container. Note that for large number of files, this program may not be efficient as it sequentially uploads the images. Replace MY_CONNECTION_STRING, MY_IMAGE_CONTAINER and LOCAL_IMAGE_PATH before running the program.

# upload_blob_images.py
# Python program to bulk upload jpg image files as blobs to azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container. This should be already created in azure storage.
MY_IMAGE_CONTAINER = "myimages"

# Replace with the local folder which contains the image files for upload
LOCAL_IMAGE_PATH = "REPLACE_THIS"

class AzureBlobFileUploader:
  def __init__(self):
    print("Intializing AzureBlobFileUploader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)

  def upload_all_images_in_folder(self):
    # Get all files with jpg extension and exclude directories
    all_file_names = [f for f in os.listdir(LOCAL_IMAGE_PATH)
                    if os.path.isfile(os.path.join(LOCAL_IMAGE_PATH, f)) and ".jpg" in f]

    # Upload each file
    for file_name in all_file_names:
      self.upload_image(file_name)

  def upload_image(self,file_name):
    # Create blob with same name as local file name
    blob_client = self.blob_service_client.get_blob_client(container=MY_IMAGE_CONTAINER,
                                                          blob=file_name)
    # Get full path to the file
    upload_file_path = os.path.join(LOCAL_IMAGE_PATH, file_name)

    # Create blob on storage
    # Overwrite if it already exists!
    image_content_setting = ContentSettings(content_type='image/jpeg')
    print(f"uploading file - {file_name}")
    with open(upload_file_path, "rb") as data:
      blob_client.upload_blob(data,overwrite=True,content_settings=image_content_setting)


# Initialize class and upload files
azure_blob_file_uploader = AzureBlobFileUploader()
azure_blob_file_uploader.upload_all_images_in_folder()

Parallel Bulk Upload of Files to Azure Storage Blobs Using Python

The following python program is an improved version of the above program. This program uses a thread pool to upload a predefined number of images in parallel. Note that the program uses 10 as the thread pool count, but you can increase it for faster uploads if you have sufficient network bandwidth. If you don't specify content type, it will default to application/octet-stream.

# upload_blob_images_parallel.py
# Python program to bulk upload jpg image files as blobs to azure storage
# Uses ThreadPool for faster parallel uploads!
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container
MY_IMAGE_CONTAINER = "myimages"

# Replace with the local folder which contains the image files for upload
LOCAL_IMAGE_PATH = "REPLACE_THIS"

class AzureBlobFileUploader:
  def __init__(self):
    print("Intializing AzureBlobFileUploader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)

  def upload_all_images_in_folder(self):
    # Get all files with jpg extension and exclude directories
    all_file_names = [f for f in os.listdir(LOCAL_IMAGE_PATH)
                    if os.path.isfile(os.path.join(LOCAL_IMAGE_PATH, f)) and ".jpg" in f]

    result = self.run(all_file_names)
    print(result)

  def run(self,all_file_names):
    # Upload 10 files at a time!
    with ThreadPool(processes=int(10)) as pool:
      return pool.map(self.upload_image, all_file_names)

  def upload_image(self,file_name):
    # Create blob with same name as local file name
    blob_client = self.blob_service_client.get_blob_client(container=MY_IMAGE_CONTAINER,
                                                          blob=file_name)
    # Get full path to the file
    upload_file_path = os.path.join(LOCAL_IMAGE_PATH, file_name)

    # Create blob on storage
    # Overwrite if it already exists!
    image_content_setting = ContentSettings(content_type='image/jpeg')
    print(f"uploading file - {file_name}")
    with open(upload_file_path, "rb") as data:
      blob_client.upload_blob(data,overwrite=True,content_settings=image_content_setting)
    return file_name

# Initialize class and upload files
azure_blob_file_uploader = AzureBlobFileUploader()
azure_blob_file_uploader.upload_all_images_in_folder()

Using the ContentSettings object, it is possible to set content type, content encoding, content md5 or cache control for the blobs. See here for the full set of content_settings attributes.

How to Change Content Type of Azure Storage Blobs Using Python SDK

Microsoft Azure cloud platform offers a comprehensive portfolio of storage services. Each of these services is suitable for specific use cases. Here is a quick overview of Azure storage services,

  • Azure Storage Account - Azure storage account can be used for a wide variety of workloads. Using this account, it is possible to create blob storage
    or file storage. Blob storage is suitable for storing a large sets of unstructured data(images for example). File storage makes content available over the SMB protocol. It is also possible to create storage queues or tables using this storage account. For analytical workloads, storage accounts can be configured as Azure data lake storage gen2 with hierarchical namespace. Finally, there is also the option of enabling archival mode for the storage offering a lower cost option.
  • Azure Managed Disk - This offers high performance block storage. There is plenty of choice for managed disk such as ultra disk storage, premium SSD, standard SSD and standard HDD.
  • Azure HPC Cache - Powered by Azure storage account, this service offers caching of files on cloud for high throughput needs.

Out of these services, Azure storage account is the most useful and powerful one. The blob storage in particular is suited for a wide range of use cases where large sets of unstructured binary data such as images is involved. It can also integrate with CDNs for faster content delivery.

There are 4 different ways of accessing Azure blob storage. These are,

  • Azure portal - Use the web based azure portal
  • Azure CLI - Command line interface
  • Azure Powershell - Powershell based command line interface
  • Azure SDKs (Python, .NET etc.) - SDKs based on various languages

In this article, we are looking at accessing blob storage using Python. When it comes to Python SDK for Azure storage services, there are two options,

Since Azure Python SDK v2.1 is deprecated, we will using Microsoft Azure Python SDK v12 for the following examples. Python 3.6 or above is required for the following examples.

How to Change Content Type of Azure Storage Blobs Using Python SDK

Before using Python SDK for Azure storage, ensure that you have the following pre-requisites installed.
Step 1: Install Python 3.6 or above. In Mac, use Homebrew to install python 3,

brew install python3

Step 2: Install the Azure Blob storage client library for Python package,

pip3 install azure-storage-blob --user

Run the following program to convert the content type of all files with extension .jpg to image/jpeg in Azure blob storage using Python SDK. You can also customize the same program for changing content type, content encoding, content md5 or cache control for the blobs. See here for the full set of content_settings attributes.

The program assumes existence of an Azure storage account with a container containing image files with wrong content type as application/octet-stream.

# Python program to change content type of .jpg files to image/jpeg in Azure blob storage
# This is useful if you accidentally uploaded jpg files with application/octet-stream content type
# Requires python 3.6 or above
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# This usually starts with DefaultEndpointsProtocol=https;..
MY_CONNECTION_STRING = "REPLACE_THIS"
MY_IMAGE_CONTAINER = "myimages"

class AzureBlobImageProcessor:
  def __init__(self):
    print("Intializing AzureBlobImageProcessor")

    # Ensure that the container mentioned exists in storage account (myimages in this example)
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING).get_container_client(MY_IMAGE_CONTAINER)

  def change_content_type_for_jpg_files(self):
    print("changing content type for all files with .jpg extension")

    # You can optionally pass a prefix for blob names to list_blobs()
    # This is useful if you want to process a large number of files
    # You can run multiple instances of this program with different prefixes!
    blob_list = self.blob_service_client.list_blobs()
    file_count = 0
    for blob in blob_list:
      if ".jpg" in blob.name:

        # Print file name and current content type to monitor progress
        print(f"For file {blob.name}  current type is {blob.content_settings.content_type}")

        # Note that in addition to content_type, you can also set the following values,
        # content_encoding, content_language, content_disposition, cache_control and content_md5
        blob.content_settings.content_type = "image/jpeg"
        self.blob_service_client.get_blob_client(blob).set_http_headers(blob.content_settings)
        file_count += 1

    print(f"changing content type completed. Processed {file_count} files")

# Initialize class and change content type for all files
azure_blob_image_processor = AzureBlobImageProcessor()
azure_blob_image_processor.change_content_type_for_jpg_files()

Save the above program in file change_blob_content_type.py and then run the following command. Don't forget to update MY_CONNECTION_STRING and MY_IMAGE_CONTAINER variables before running the program,

python3 change_blob_content_type.py

I used this program for converting around 200,000 image files which were wrongly uploaded as application/octet-stream. I ran this program from a VM located in the same region where the storage is located to speed up processing. Running it locally may take a lot of time. This program is useful for bulk conversion of azure blob file content type.

If you can partition the set of files in storage using the blob name prefix, there is a way to run the content type conversion in parallel. For each instance of the python program, you can pass a different blob name prefix to the list_blobs() method. Assume you have blobs of the following full name,

  • subset1/a.jpg
  • subset1/b.jpg
  • subset2/c.jpg
  • subset2/d.jpg

You can run two python program instances with list_blobs("subset1") and list_blobs("subset2") to speed up conversion.