Using a Hash to Remove Duplicates in Mongoose, MongoDB – Aggregate Exceeded Document Size Work Around/Fix

0001494_search-engine-powered-by-mongodb

One way that has been proposed for removing duplicates in MongoDB is to use the MongoDb aggregate function. Its a straightforward process in which 1.) You specify the criteria for comparison (i.e. the field you want to match in order to determine a duplicate) 2.) You group these duplicates (where each record belongs to only one group) 3.) Now you know these duplicates, you weed out the offenders – keep the first element in each duplicate group and delete the others. See this excerpt from an aggregate solution on Stackoverflow. When developing Nodejs apps that need MongoDB interaction, I usually use the Mongoose library, an elegant MongoDB object modeling for Node.js . This post addresses an alternate (and naive) approach to removing duplicates where situations (memory issues) make the aggregate option non-viable.

var duplicates = [];

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
])               // You can display result until this and check duplicates 
// If your result getting response in "result" then use else don't use ".result" in query    
.result          
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )    
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})

When aggregate doesnt work – Maximum Document Size Exceeded

exception: aggregation result exceeds maximum document size (16MB)

When your dataset is large (e.g millions of records), or the fields you use for duplicate comparisons are text heavy (in my case … these were email fields – sender, receiver, body, timestamp), you can run into the MongoDb maximum document size exceeded error. A work around for this (when writing native mongodb queries) is to specify the diskuse option to true which allows mongodb write to temporary files to manage memory use. However, mongoose does not seem to support this option well and even after setting diskuse to true, I still have the maximum document size exceeded error. Note that you can still run this aggregation using native mongo queries run on the mongodb terminal.

A Naive Work Around – Using Hashes.

The alternative approach I propose here is naive,  runs in Linear time, but requires multiple steps. For a process like this which will run only once, in my current application, I can hazard the multiple steps.

Step 1: Compute a hash on your comparison fields. Preferably add this as a field in you db. I used a nodejs md5 hash libary to compute a new database attribute.

  md5("");
 var hash = md5.create();
 hash.update(record.from + record.sent_at + record.to + record.body); // concatenate comparison fields and compute hash
 var hashhex = hash.hex() ; 

// Update your db, add hashhex as an attribute for each record

Note: You will need to add the hash to you db records either while you are creating your dataset or write an update script to insert a new hash field.

Step 2: Aggregate based on your single hash field (much lighter and shouldn’t result in the maximum document size error). In my case, turned out to be a fairly fast operation. After aggregation based on the hash field, you can generate a list of duplicate ids which will subsequently be removed.

var duplicates = []; store duplicate ids
    db.collectionName.aggregate([
    {$group : {
      _id : {hash: "$hash"} ,
      dups: { "$addToSet": "$_id" },
      count : { $sum : 1 }
    }  },
    { $match: { count: { $gt: 1 } }  } ])
    .allowDiskUse(true)
    .exec(function(err, data) {
      if (err) {
        throw err;
      }else { 
            data.forEach(function(doc) {
            doc.dups.shift();      // First element skipped for deleting
            doc.dups.forEach( function(dupId){
              duplicates.push(dupId);   // Getting all duplicate ids
              //console.log("pushing id ", dupId);
            }
          )
        })
    }
  });

Step 3: Delete your duplicates.

db.collectionName.remove({_id:{$in:duplicates}}, function (err, count) { console.log( " done removing duplicates ")});

All done! Was this approach useful for your particular scenario ?

About Vykthur

Mobile and Web App Developer and Researcher. Passionate about learning, teaching, and recently - writing.
This entry was posted in Programming and tagged , , , , . Bookmark the permalink.