hi I have a ~5 million documents in mongodb (replication) each document 43 fields. how to remove duplicate document. I tryed
db.testkdd.ensureIndex({
duration : 1 , protocol_type : 1 , service : 1 ,
flag : 1 , src_bytes : 1 , dst_bytes : 1 ,
land : 1 , wrong_fragment : 1 , urgent : 1 ,
hot : 1 , num_failed_logins : 1 , logged_in : 1 ,
num_compromised : 1 , root_shell : 1 , su_attempted : 1 ,
num_root : 1 , num_file_creations : 1 , num_shells : 1 ,
num_access_files : 1 , num_outbound_cmds : 1 , is_host_login : 1 ,
is_guest_login : 1 , count : 1 , srv_count : 1 ,
serror_rate : 1 , srv_serror_rate : 1 , rerror_rate : 1 ,
srv_rerror_rate : 1 , same_srv_rate : 1 , diff_srv_rate : 1 ,
srv_diff_host_rate : 1 , dst_host_count : 1 , dst_host_srv_count : 1 ,
dst_host_same_srv_rate : 1 , dst_host_diff_srv_rate : 1 ,
dst_host_same_src_port_rate : 1 , dst_host_srv_diff_host_rate : 1 ,
dst_host_serror_rate : 1 , dst_host_srv_serror_rate : 1 ,
dst_host_rerror_rate : 1 , dst_host_srv_rerror_rate : 1 , lable : 1
},
{unique: true, dropDups: true}
)
run this code i get a error "errmsg" : "namespace name generated from index ..
{
"ok" : 0,
"errmsg" : "namespace name generated from index name \"project.testkdd.$duration_1_protocol_type_1_service_1_flag_1_src_bytes_1_dst_bytes_1_land_1_wrong_fragment_1_urgent_1_hot_1_num_failed_logins_1_logged_in_1_num_compromised_1_root_shell_1_su_attempted_1_num_root_1_num_file_creations_1_num_shells_1_num_access_files_1_num_outbound_cmds_1_is_host_login_1_is_guest_login_1_count_1_srv_count_1_serror_rate_1_srv_serror_rate_1_rerror_rate_1_srv_rerror_rate_1_same_srv_rate_1_diff_srv_rate_1_srv_diff_host_rate_1_dst_host_count_1_dst_host_srv_count_1_dst_host_same_srv_rate_1_dst_host_diff_srv_rate_1_dst_host_same_src_port_rate_1_dst_host_srv_diff_host_rate_1_dst_host_serror_rate_1_dst_host_srv_serror_rate_1_dst_host_rerror_rate_1_dst_host_srv_rerror_rate_1_lable_1\" is too long (127 byte max)",
"code" : 67
}
how can solve the problem ?
The "dropDups" syntax for index creation has been "deprecated" as of MongoDB 2.6 and removed in MongoDB 3.0. It is not a very good idea in most cases to use this as the "removal" is arbitrary and any "duplicate" could be removed. Which means what gets "removed" may not be what you really want removed.
Anyhow, you are running into an "index length" error since the value of the index key here would be longer that is allowed. Generally speaking, you are not "meant" to index 43 fields in any normal application.
If you want to remove the "duplicates" from a collection then your best bet is to run an aggregation query to determine which documents contain "duplicate" data and then cycle through that list removing "all but one" of the already "unique" _id
values from the target collection. This can be done with "Bulk" operations for maximum efficiency.
NOTE: I do find it hard to believe that your documents actually contain 43 "unique" fields. It is likely that "all you need" is to simply identify only those fields that make the document "unique" and then follow the process as outlined below:
var bulk = db.testkdd.initializeOrderedBulkOp(),
count = 0;
// List "all" fields that make a document "unique" in the `_id`
// I am only listing some for example purposes to follow
db.testkdd.aggregate([
{ "$group": {
"_id": {
"duration" : "$duration",
"protocol_type": "$protocol_type",
"service": "$service",
"flag": "$flag"
},
"ids": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } } }
],{ "allowDiskUse": true}).forEach(function(doc) {
doc.ids.shift(); // remove first match
bulk.find({ "_id": { "$in": doc.ids } }).remove(); // removes all $in list
count++;
// Execute 1 in 1000 and re-init
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.testkdd.initializeOrderedBulkOp();
}
});
if ( count % 1000 != 0 )
bulk.execute();
If you have a MongoDB version "lower" than 2.6 and don't have bulk operations then you can try with standard .remove()
inside the loop as well. Also noting that .aggregate()
will not return a cursor here and the looping must change to:
db.testkdd.aggregate([
// pipeline as above
]).result.forEach(function(doc) {
doc.ids.shift();
db.testkdd.remove({ "_id": { "$in": doc.ids } });
});
But do make sure to look at your documents closely and only include "just" the "unique" fields you expect to be part of the grouping _id
. Otherwise you end up removing nothing at all, since there are no duplicates there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With