Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Duplicates from MongoDB

hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed

db.testkdd.ensureIndex({
        duration  : 1 , protocol_type  : 1 , service  : 1 ,
        flag  : 1 , src_bytes  : 1 , dst_bytes  : 1 ,
        land  : 1 , wrong_fragment  : 1 , urgent  : 1 ,
        hot  : 1 , num_failed_logins  : 1 , logged_in  : 1 ,
        num_compromised  : 1 , root_shell  : 1 , su_attempted  : 1 ,
        num_root  : 1 , num_file_creations  : 1 , num_shells  : 1 ,
        num_access_files  : 1 , num_outbound_cmds  : 1 , is_host_login  : 1 ,
        is_guest_login  : 1 , count  : 1 ,  srv_count  : 1 ,
        serror_rate  : 1 , srv_serror_rate  : 1 , rerror_rate  : 1 ,
        srv_rerror_rate  : 1 , same_srv_rate  : 1 , diff_srv_rate  : 1 ,
        srv_diff_host_rate  : 1 , dst_host_count  : 1 , dst_host_srv_count  : 1 ,
        dst_host_same_srv_rate  : 1 , dst_host_diff_srv_rate  : 1 ,
        dst_host_same_src_port_rate  : 1 ,  dst_host_srv_diff_host_rate  : 1 ,
        dst_host_serror_rate  : 1 , dst_host_srv_serror_rate  : 1 ,
        dst_host_rerror_rate  : 1 , dst_host_srv_rerror_rate  : 1 , lable  : 1 
    },
    {unique: true, dropDups: true}
)

run this code i get a error "errmsg" : "namespace name generated from index ..

{
    "ok" : 0,
    "errmsg" : "namespace name generated from index name \"project.testkdd.$duration_1_protocol_type_1_service_1_flag_1_src_bytes_1_dst_bytes_1_land_1_wrong_fragment_1_urgent_1_hot_1_num_failed_logins_1_logged_in_1_num_compromised_1_root_shell_1_su_attempted_1_num_root_1_num_file_creations_1_num_shells_1_num_access_files_1_num_outbound_cmds_1_is_host_login_1_is_guest_login_1_count_1_srv_count_1_serror_rate_1_srv_serror_rate_1_rerror_rate_1_srv_rerror_rate_1_same_srv_rate_1_diff_srv_rate_1_srv_diff_host_rate_1_dst_host_count_1_dst_host_srv_count_1_dst_host_same_srv_rate_1_dst_host_diff_srv_rate_1_dst_host_same_src_port_rate_1_dst_host_srv_diff_host_rate_1_dst_host_serror_rate_1_dst_host_srv_serror_rate_1_dst_host_rerror_rate_1_dst_host_srv_rerror_rate_1_lable_1\" is too long (127 byte max)",
    "code" : 67
}

how can solve the problem ?

like image 652
mohamedzajith Avatar asked Mar 16 '23 08:03

mohamedzajith


1 Answers

The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. It is not a very good idea in most cases to use this as the "removal" is arbitrary and any "duplicate" could be removed. Which means what gets "removed" may not be what you really want removed.

Anyhow, you are running into an "index length" error since the value of the index key here would be longer that is allowed. Generally speaking, you are not "meant" to index 43 fields in any normal application.

If you want to remove the "duplicates" from a collection then your best bet is to run an aggregation query to determine which documents contain "duplicate" data and then cycle through that list removing "all but one" of the already "unique" _id values from the target collection. This can be done with "Bulk" operations for maximum efficiency.

NOTE: I do find it hard to believe that your documents actually contain 43 "unique" fields. It is likely that "all you need" is to simply identify only those fields that make the document "unique" and then follow the process as outlined below:

var bulk = db.testkdd.initializeOrderedBulkOp(),
    count = 0;

// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.testkdd.aggregate([
    { "$group": {
        "_id": {
           "duration" : "$duration",
          "protocol_type": "$protocol_type", 
          "service": "$service",
          "flag": "$flag"
        },
        "ids": { "$push": "$_id" },
        "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
    doc.ids.shift();     // remove first match
    bulk.find({ "_id": { "$in": doc.ids } }).remove();  // removes all $in list
    count++;

    // Execute 1 in 1000 and re-init
    if ( count % 1000 == 0 ) {
       bulk.execute();
       bulk = db.testkdd.initializeOrderedBulkOp();
    }
});

if ( count % 1000 != 0 ) 
    bulk.execute();

If you have a MongoDB version "lower" than 2.6 and don't have bulk operations then you can try with standard .remove() inside the loop as well. Also noting that .aggregate() will not return a cursor here and the looping must change to:

db.testkdd.aggregate([
   // pipeline as above
]).result.forEach(function(doc) {
    doc.ids.shift();  
    db.testkdd.remove({ "_id": { "$in": doc.ids } });
});

But do make sure to look at your documents closely and only include "just" the "unique" fields you expect to be part of the grouping _id. Otherwise you end up removing nothing at all, since there are no duplicates there.

like image 64
Blakes Seven Avatar answered Apr 06 '23 00:04

Blakes Seven