I have few thousand json objects in a file. I would want to clean it by removing duplicate entry by id.
JSON File Structure:
{
"quote": [{
"id":1,
"title": "Mahatma Gandhi",
"quote":"A man is but a product of his thoughts. What he thinks he becomes.",
"videourl": "https://www.youtube.com/watch?v=2GgK_Nq9NLw",
"websiteurl":"http://www.mkgandhi.org/",
"otherurl":"https://en.wikipedia.org/wiki/Mahatma_Gandhi",
"updatedon":"19-Aug-2017",
"username":"[email protected]",
"details":"Mohandas Karamchand Gandhi (2 October 1869 – 30 January 1948), commonly known as Mahatma Gandhi (Sanskrit: महात्मा mahātmā 'Great Soul'). In India he is generally regarded as Bapu (Gujarati: બાપુ bāpu 'father'), Jathi Pitha and Raashtra Pita; he was an advocate and pioneer of nonviolent social protest and direct action in the form he called Satyagraha. He led the struggle for India's independence from British colonial rule."
},
{
"id":2,
"title": "Ernest Hemingway",
"quote":"The world breaks everyone, and afterward, some are strong at the broken places.",
"videourl": "https://www.youtube.com/watch?v=35vrg7a64nY",
"websiteurl":"https://www.biography.com/people/ernest-hemingway-9334498",
"otherurl":"https://en.wikipedia.org/wiki/Ernest_Hemingway",
"updatedon":"20-Aug-2017",
"username":"[email protected]",
"details":"Ernest Miller Hemingway was an American novelist, short story writer, and journalist. His economical and understated style had a strong influence on 20th-century fiction, while his life of adventure and his public image influenced later generations."
},
...
]
I created a Quote entity and deserialize object to find duplicates using Linq and write the stream back, in C#. It take considerable amount of time to complete.
Please suggest any other effective method you know which works well.
Edit: Enclosed the code I tried,
string json = "";
using (StreamReader file = File.OpenText(@"D:\motivation-pyramid\data\demo\quote.json"))
{
json = file.ReadToEnd();
}
var managedObjectCollection = JsonConvert.DeserializeObject<ManagedObjectCollection>(json);
var distinctManagedObjects = managedObjectCollection.quote.GroupBy(a => a.quote).Select(b => b.First());
json = JsonConvert.SerializeObject(distinctManagedObjects.ToArray());
File.WriteAllText(@"D:\motivation-pyramid\data\demo\quote.json", json);
Using Netwonsoft Json.NET library, this can be achieved pretty easily:
JObject jObj = JObject.Parse(json_string);
List<JToken> uni = jObj["quote"].GroupBy(x => x["id"]).Select(x => x.First()).ToList();
for (Int32 i = jObj["quote"].Count() - 1; i >= 0; --i)
{
JToken token = jObj["quote"][i];
if (!uni.Contains(token))
token.Remove();
}
If you don't want that library, once your Json data is serialized, into a list you can use the following LINQ expression:
list.DistinctBy(x => x.ID).ToList();
Alternatively, with MoreLINQ:
list.DistinctBy(x => x.ID).ToList();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With