Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate JSON object in file using C#

Tags:

json

c#

bigdata

I have few thousand json objects in a file. I would want to clean it by removing duplicate entry by id.

JSON File Structure:

{
"quote": [{
"id":1,
"title": "Mahatma Gandhi",
"quote":"A man is but a product of his thoughts. What he thinks he becomes.",
"videourl": "https://www.youtube.com/watch?v=2GgK_Nq9NLw",
"websiteurl":"http://www.mkgandhi.org/",
"otherurl":"https://en.wikipedia.org/wiki/Mahatma_Gandhi",
"updatedon":"19-Aug-2017",
"username":"[email protected]",
"details":"Mohandas Karamchand Gandhi (2 October 1869 – 30 January 1948), commonly known as Mahatma Gandhi (Sanskrit: महात्मा mahātmā 'Great Soul'). In India he is generally regarded as Bapu (Gujarati: બાપુ bāpu 'father'), Jathi Pitha and Raashtra Pita; he was an advocate and pioneer of nonviolent social protest and direct action in the form he called Satyagraha. He led the struggle for India's independence from British colonial rule."
},
{
"id":2,
"title": "Ernest Hemingway",
"quote":"The world breaks everyone, and afterward, some are strong at the broken places.",
"videourl": "https://www.youtube.com/watch?v=35vrg7a64nY",
"websiteurl":"https://www.biography.com/people/ernest-hemingway-9334498",
"otherurl":"https://en.wikipedia.org/wiki/Ernest_Hemingway",
"updatedon":"20-Aug-2017",
"username":"[email protected]",
"details":"Ernest Miller Hemingway was an American novelist, short story writer, and journalist. His economical and understated style had a strong influence on 20th-century fiction, while his life of adventure and his public image influenced later generations."
},
...
]

I created a Quote entity and deserialize object to find duplicates using Linq and write the stream back, in C#. It take considerable amount of time to complete.

Please suggest any other effective method you know which works well.

Edit: Enclosed the code I tried,

string json = "";

using (StreamReader file = File.OpenText(@"D:\motivation-pyramid\data\demo\quote.json"))
{
    json = file.ReadToEnd();
}

var managedObjectCollection = JsonConvert.DeserializeObject<ManagedObjectCollection>(json);

var distinctManagedObjects = managedObjectCollection.quote.GroupBy(a => a.quote).Select(b => b.First());

json = JsonConvert.SerializeObject(distinctManagedObjects.ToArray());

File.WriteAllText(@"D:\motivation-pyramid\data\demo\quote.json", json);
like image 289
Bharathkumar V Avatar asked Mar 14 '26 20:03

Bharathkumar V


1 Answers

Using Netwonsoft Json.NET library, this can be achieved pretty easily:

JObject jObj = JObject.Parse(json_string);

List<JToken> uni = jObj["quote"].GroupBy(x => x["id"]).Select(x => x.First()).ToList();

for (Int32 i = jObj["quote"].Count() - 1; i >= 0; --i)
{
    JToken token = jObj["quote"][i];

    if (!uni.Contains(token))
        token.Remove();
}

If you don't want that library, once your Json data is serialized, into a list you can use the following LINQ expression:

list.DistinctBy(x => x.ID).ToList();

Alternatively, with MoreLINQ:

list.DistinctBy(x => x.ID).ToList();
like image 50
Tommaso Belluzzo Avatar answered Mar 17 '26 08:03

Tommaso Belluzzo