How to Get Distinct Keys of All Documents in a Collection in MongoDB

MongoDB is intended as a document storage database where we don't know all the fields required in a stored entity when we build an application. This means that MongoDB collections can contain documents with different schemas. Some documents may have additional keys or different keys compared to other documents. This flexibility allows for system to evolve rapidly since there is no constrained schema for the document.

However such flexibility and power comes with its own headaches. Since different documents can have different keys and structure, the application logic should be robust enough to handle all the combinations.

When an application is rapidly evolving and new documents are being added, we may need to get a consolidated view of all the keys present across the documents in a collection. The following script iterates through all the documents in the collection and creates a distinct set of top level keys used across them. You can use this to check if there is any key present in any document which you are not expecting in your application. Such a key may have been added by an intermediate production release of the application.

Script to Get Distinct Set of Top Level Keys of All Documents in a MongoDB Collection

Let us first create a sample collection named user which is intended to capture user details of an application. Let us create 3 documents which represents 3 releases of the application each progressively adding a new key to the document.

// create a user collection
db.createCollection("user");

// Initially user collection has 3 fields
db.user.insert({ name:"jack", 
email:"jack@quickprogrammingtips.com", age:40 });

// In release 2, we added a new field for title. Hence newer user records contain it
db.user.insert({ name:"tom", 
email:"tom@quickprogrammingtips.com", age:42, title:"mr" });

// In release 3, we added a new field for mobile number.
db.user.insert({ name:"ted", 
email:"ted@quickprogrammingtips.com", age:44, title:"mr", mobile:"99999999" });

When there are large number of changes in the data model across releases, we need a script to find the complete list of distinct keys across the documents in the collection. The following script finds all the distinct top level keys in our user collection.

var keys = {};

db.user.find().forEach(function(doc){
    for (var key in doc){
        if(!(key in keys)){
           keys[key] = key;
        }
    }
});

print(keys)

One limitation of the above script is that it only prints the top level keys of the document. The following MongoDB script prints the super set of all keys including the nested keys. However there are limitations when it comes to arrays.

How to Get Distinct Set of All Keys of All Documents in a MongoDB Collection

var keys = {};

// Recursive function to print all nested keys of a document
function getKeys(doc, keys, prefix) {
    for (var key in doc){
        // note that we are ignoring _id
        if(!(key in keys) && key!='_id'){

           if(typeof doc[key] == "object") {
               getKeys(doc[key],keys, prefix+key+".")
           }else {
               keys[prefix+key] = prefix+key;
           }
        }
    }
}

// print all keys. Replace "user" with your collection name.
db.user.find().forEach(function(doc) {
    getKeys(doc,keys,"")
});
print(keys)

Given the following document,

{
    "_id" : ObjectId("5f1b277b49eebe76ce081e19"),
    "name" : "jack",
    "email" : "jack@quickprogrammingtips.com",
    "age" : 40.0,
    "address" : {
        "line1" : "l1",
        "line2" : "l2"
    },
    "GlossSeeAlso" : [
        "GML",
        {
            "x" : "1"
        },
        "XML"
    ],
    "abc" : [
        {
            "id" : 28,
            "Title" : "Sweden"
        },
        {
            "id" : 56,
            "Title" : "USA"
        },
        {
            "id" : 89,
            "Title" : "England"
        }
    ]
}

The output is,

{
    "name" : "name",
    "email" : "email",
    "age" : "age",
    "title" : "title",
    "address.line1" : "address.line1",
    "address.line2" : "address.line2",
    "GlossSeeAlso.0" : "GlossSeeAlso.0",
    "GlossSeeAlso.1.x" : "GlossSeeAlso.1.x",
    "GlossSeeAlso.2" : "GlossSeeAlso.2",
    "abc.0.id" : "abc.0.id",
    "abc.0.Title" : "abc.0.Title",
    "abc.1.id" : "abc.1.id",
    "abc.1.Title" : "abc.1.Title",
    "abc.2.id" : "abc.2.id",
    "abc.2.Title" : "abc.2.Title",
    "mobile" : "mobile"
}

Note that the above function prints array indexes up to the highest one in all documents.