How to Find the Document with the Longest Array in a MongoDB Collection?

When working with MongoDB collections, sometimes you may want to find the document that contains the longest array. For example, you may have a collection for storing shirts and the array “sizes” in it may contain all the sizes in which the shirt is available. In this case, how do you find the shirts with largest number of sizes? In other words, what is the shirt model with largest choice of sizes?

Let us assume that we use the collection “shirts” to store our shirt models available for sale. Let us populate it with some sample data.

db.shirts.insertMany([
{ "_id" : 1, "name" : "Blue Shirt", sizes: [ "S", "M", "L"] },
{ "_id" : 2, "name" : "Stripped Shirt", sizes: [ "S", "M", "L","XL","XXL"] },
{ "_id" : 3, "name" : "Black Shirt", sizes: [ "S", "M", "L","XL","XXL","XXXL"] },
{ "_id" : 4, "name" : "Grey Shirt", sizes: [ "S", "M", "L","XXL"] },
{ "_id" : 5, "name" : "Green Shirt", sizes: [ "S", "M", "L"] },
]);

From the data, it is clear that the Black shirt has the largest number of sizes. How do you find this using a MongoDB query? We can use the MongoDB aggregation with pipeline operations to get the document with the longest array. The following script returns the document id containing Black shirt which has the largest number of sizes.

db.shirts.aggregate( [
  { $unwind : "$sizes" },
  { $group : { _id : "$_id", length : { $sum : 1 } } },
  { $sort : { length : -1 } },
  { $limit : 1 }
] )

How does the above script work? Here is what happens,

  • The $unwind pipeline deconstructs an array field from the input documents to output a document for each element
  • The $group pipeline groups the documents based on _id
  • The $sum computes number of documents after grouping (due to _id grouping this becomes array size)
  • The $sort pipeline sorts in descending order of array length
  • Finally $limit returns the document id with largest array and size of array

How to Find Largest Document in MongoDB

MongoDB collections are usually schema less. This means that a collection can potentially contain documents with different structure. Even if they are same, since data stored is different, document size can substantially vary. Sometimes you may want to find the largest document in a MongoDB collection. This is required if you are planning to migrate your MongoDB database to a cloud database such as Azure CosmosDB which has restrictions on the maximum size of a single document in a collection (currently it is 2MB).

How to Find the Largest Document in a MongoDB Collection?

The following script finds the largest document in a collection. The following example uses the “user” collection in my database. Replace it with your collection name.

var max = 0, id = null;

// iterate through each document
db.store.find().forEach(doc => {
    var size = Object.bsonsize(doc);
    if(size > max) {
        max = size;
        id = doc._id;
    }
});

// document id and size of the largest doc in bytes
print(""+id+", "+max);

Note that the above script loads every document in the collection to compute its size. Hence this can be very slow in large collections.

How to Find the Largest Document in a MongoDB Database Across Collections?

The following script finds the largest document across all collections in the MongoDB database. It prints the collection name, document id and size in bytes as output. For large databases, this may take minutes or even hours. Also note that it loads every document to compute size and hence can incur lot of network traffic.

var max = 0, id = null, c = null ;

// Iterate through all the collections!
db.getCollectionNames().forEach(function(cName) {

    // Now iterate through each document in the collection
    db[cName].find().forEach(doc => {
        var size = Object.bsonsize(doc);
        if(size > max) {
            max = size;
            id = doc._id;
            c = cName
        }
    });
});

// Collection name, document id and size of the largest doc in bytes
print(c+", "+id+", "+max);

If you are using MongoDB 4.4 or later, you can simply use the new aggregation operator $bsonSize to find the size of the document. It will substantially speed up the computation.

How to Get Sorted List of Collection and Index Sizes in a MongoDB Database

If you have a large MongoDB database in production with a large of number of collections and indexes, you may need to periodically analyse the size of them. Here are a few MongoDB script snippets that you may find useful.

How to Get Sorted List of MongoDB Collections by Size?

The following script sorts and prints collections by descending order of the size of collections. The size here refers to the total size in memory for all records in a collection. This is different from the storage size of the collection. Note that the size printed is in kilobytes.

var sizeArray = [];

// Find statistics of every collection and add to an array
db.getCollectionNames().forEach(function(cName) {
    sizeArray.push(db[cName].stats());
});

// Sort the stats array by size. The field 'size' is used
sizeArray = sizeArray.sort(function(a, b) { return b['size'] - a['size']; });

// Print output in kb
for (var stat in sizeArray) {
    print(sizeArray[stat]['ns'] + ": " + (sizeArray[stat]['size']/1024).toFixed(2) + " kb") ;
}

Following is the sample output,

testdb.store: 3797.75 kb
testdb.customer: 526.26 kb
testdb.user: 0.52 kb

In the above snippet, change column from “size” to “storageSize” if you want to sort by storage size.

How to Get Sorted List of All MongoDB Indexes by Size in a Database?

The following program iterates through all indexes of all collections and then sorts it in the descending order of the size of the index.

var sizeMap = {}

// Iterate through every collection
db.getCollectionNames().forEach(function(cName) {
   indexes = db[cName].stats().indexSizes

   // Now iterate through every index of the current collection
   // We create a map here with key as combination of collection name and index name
   for (i in indexes) sizeMap[cName + " - " + i] = indexes[i];
});

var sizeArray = [];

// Map is converted to an array each element of which is a two member array
// This inner arrary contains the collection+index name key and the size itself
for (key in sizeMap) sizeArray.push([key, sizeMap[key]])

// Now sort outer array using the second column of inner array
var sizeArray = sizeArray.sort(function(a, b) {return b[1] - a[1]})

// Print list of index size in sorted form
for (x in sizeArray) print( sizeArray[x][0] + ": " + (sizeArray[x][1]/1024).toFixed(2) +" kb");

Following is a sample output,

store – country_1_city_1: 848.00 kb
store – _id_: 448.00 kb
customer – _id_: 96.00 kb
customer – age_1: 80.00 kb
user – _id_: 32.00 kb

How to Get Distinct Keys of All Documents in a Collection in MongoDB

MongoDB is intended as a document storage database where we don’t know all the fields required in a stored entity when we build an application. This means that MongoDB collections can contain documents with different schemas. Some documents may have additional keys or different keys compared to other documents. This flexibility allows for system to evolve rapidly since there is no constrained schema for the document.

However such flexibility and power comes with its own headaches. Since different documents can have different keys and structure, the application logic should be robust enough to handle all the combinations.

When an application is rapidly evolving and new documents are being added, we may need to get a consolidated view of all the keys present across the documents in a collection. The following script iterates through all the documents in the collection and creates a distinct set of top level keys used across them. You can use this to check if there is any key present in any document which you are not expecting in your application. Such a key may have been added by an intermediate production release of the application.

Script to Get Distinct Set of Top Level Keys of All Documents in a MongoDB Collection

Let us first create a sample collection named user which is intended to capture user details of an application. Let us create 3 documents which represents 3 releases of the application each progressively adding a new key to the document.

// create a user collection
db.createCollection("user");

// Initially user collection has 3 fields
db.user.insert({ name:"jack", 
email:"jack@quickprogrammingtips.com", age:40 });

// In release 2, we added a new field for title. Hence newer user records contain it
db.user.insert({ name:"tom", 
email:"tom@quickprogrammingtips.com", age:42, title:"mr" });

// In release 3, we added a new field for mobile number.
db.user.insert({ name:"ted", 
email:"ted@quickprogrammingtips.com", age:44, title:"mr", mobile:"99999999" });

When there are large number of changes in the data model across releases, we need a script to find the complete list of distinct keys across the documents in the collection. The following script finds all the distinct top level keys in our user collection.

var keys = {};

db.user.find().forEach(function(doc){
    for (var key in doc){
        if(!(key in keys)){
           keys[key] = key;
        }
    }
});

print(keys)

One limitation of the above script is that it only prints the top level keys of the document. The following MongoDB script prints the super set of all keys including the nested keys. However there are limitations when it comes to arrays.

How to Get Distinct Set of All Keys of All Documents in a MongoDB Collection

var keys = {};

// Recursive function to print all nested keys of a document
function getKeys(doc, keys, prefix) {
    for (var key in doc){
        // note that we are ignoring _id
        if(!(key in keys) && key!='_id'){

           if(typeof doc[key] == "object") {
               getKeys(doc[key],keys, prefix+key+".")
           }else {
               keys[prefix+key] = prefix+key;
           }
        }
    }
}

// print all keys. Replace "user" with your collection name.
db.user.find().forEach(function(doc) {
    getKeys(doc,keys,"")
});
print(keys)

Given the following document,

{
    "_id" : ObjectId("5f1b277b49eebe76ce081e19"),
    "name" : "jack",
    "email" : "jack@quickprogrammingtips.com",
    "age" : 40.0,
    "address" : {
        "line1" : "l1",
        "line2" : "l2"
    },
    "GlossSeeAlso" : [
        "GML",
        {
            "x" : "1"
        },
        "XML"
    ],
    "abc" : [
        {
            "id" : 28,
            "Title" : "Sweden"
        },
        {
            "id" : 56,
            "Title" : "USA"
        },
        {
            "id" : 89,
            "Title" : "England"
        }
    ]
}

The output is,

{
    "name" : "name",
    "email" : "email",
    "age" : "age",
    "title" : "title",
    "address.line1" : "address.line1",
    "address.line2" : "address.line2",
    "GlossSeeAlso.0" : "GlossSeeAlso.0",
    "GlossSeeAlso.1.x" : "GlossSeeAlso.1.x",
    "GlossSeeAlso.2" : "GlossSeeAlso.2",
    "abc.0.id" : "abc.0.id",
    "abc.0.Title" : "abc.0.Title",
    "abc.1.id" : "abc.1.id",
    "abc.1.Title" : "abc.1.Title",
    "abc.2.id" : "abc.2.id",
    "abc.2.Title" : "abc.2.Title",
    "mobile" : "mobile"
}

Note that the above function prints array indexes up to the highest one in all documents.

How to Get All Indexes in a MongoDB Database as a Script

When you have collections with large amount of data, it is very important to create proper indexes. MongoDB queries can be substantially faster with proper indexes created on collections to meet the application needs of data sorting and filtering. MongoDB automatically creates a uniq index on the _id field during the creation of the collection. All other indexes must be manually created.

Sometimes programmers don’t maintain a separate script for all the indexes they create during the project. To resolve performance issues, they may create indexes directly on the database without adding them to a separate script file. This is a bad practice since when a similar database is created for testing purposes, it may be missing some important indexes causing all kinds of all application errors. In such situations, the following script will be handy to extract all indexes in a running MongoDb database.

The output of the following MongoDB script itself is a MongoDB script that can be executed on a different database instance to recreate all indexes. I will also recommend committing the output to your version control system and mandate developers to update it before applying an index on the database. You can also use the script to compare indexes on database in different environments such as dev, sit, uat and production.

How to Export All Indexes in a MongoDB Database as a script

// iterate through every collection in MongoDB database
db.getCollectionNames().forEach(function(collection) {
    var indexes = db.getCollection(collection).getIndexes();
    // now iterate through every index in the collection
    indexes.forEach(function(index) {
        // we don't need these as it will be auto created
        delete index.v;delete index.ns;
        var key = index.key;
        delete index.key
        var options = {};
        // let us also copy all options associated with the index
        // index property unique is an example
        for (var option in index) {
            options[option] = index[option]
        }
        // Create script output
        print("db.getCollection(\""+collection+"\").createIndex("+tojson(key)+", "+tojson(options)+");");
    });
});

You can run the above script as a file using MongoDB shell or you can simply copy paste and run the same on a MonogDB client such as RoboMongo. The output can be saved as a script file for creating indexes of the database.

Sometimes you may want to apply indexes on an existing collection to a copy of the collection on the same database. In such cases you can use the following script. You need to replace the collection names c1 and c2 with actual names of the collections in your database. Note the use of background option (only needed in MongoDB versions less than 4.2) to ensure that the index creation do not block operations on the collection.

How to Copy Indexes from One Collection to Another in MongoDB?

var indexes = db.c1.getIndexes();
// now iterate through every index in the collection
indexes.forEach(function(index) {
    // we don't need these as it will be auto created
    delete index.v;delete index.ns;
    var key = index.key;
    delete index.key
    var options = {};
    // let us also copy all options associated with the index
    // index property unique is an example
    for (var option in index) {
        options[option] = index[option]
    }
    // Copy indexes to a new collection!
    db.c2.createIndex(key, options);
});

Please note that all the examples given are tested on MongoDB 3.6. These scripts should also work if you are using MongoDB 3.6 API on an Azure CosmosDB database.