Does string exist check 20k times

Question

I have an ITunes library XML file backup file - about 15 MB.

I have 20K music files on my C drive and about 25K files on E drive under exactly similar folder structures.

I am traversing the first location and going file by file and checking if the file exiss in the second location. That part works for me.

Now, for all such duplicate files, if the file path from E drive exists in the XML, but the C drive path does not exist in the XML, then I want to delete the file from the C drive.

What is my best way of checking if a string exists in the XML file (I have to do this atleast 20K times)?

Chris Laplante · Accepted Answer

Depending on if you want to count how many times a string occurs, or if you are merely checking for the existence of the strings, your approach will be slightly different. But, these are the two ways I would consider doing it:

If you want to do it with minimal memory:

Load the file line by line (or, if your XML is not formatted like this, node by node using an XML parser... I believe there are XML parsers that can do this). Do a search operation on the line for each string. No more than one line/node will be in memory at a time, if you correctly overwrite the last line. The downside to this is that it will take longer and the file will be open longer.

If you want to do it fast:

Load the entire file into memory, don't bother parsing it, and just search for each string.

EDIT

Based on your clarifications, I would first collect all duplicate file names in an array, and then proceed to scan each line of the XML file using my first method (above). If you are already storing 20K file names in memory, I would hesitate to load the entire 15MB XML at the same time.

Eduardo Mauro · Answer

A suggestion: load as text, use a regular expression to extract the desired strings (I suppose they are enclosed with a specific tag) and build a hash list with them. You can use the list to check the existence.

Dirk Vollmar · Answer

Here is a simple solution using Linq. Runs sufficiently fast enough for one-time use:

using System;
using System.IO;
using System.Linq;
using System.Xml.Linq;

class ITunesChecker
{
    static void Main(string[] args)
    {
        // retrieve file names
        string baseFolder = @"E:\My Music\";
        string[] filesM4a = Directory.GetFiles(baseFolder, "*.m4a", SearchOption.AllDirectories);
        string[] filesMp3 = Directory.GetFiles(baseFolder, "*.mp3", SearchOption.AllDirectories);
        string[] files = new string[filesM4a.Length + filesMp3.Length];
        Array.Copy(filesM4a, 0, files, 0, filesM4a.Length);
        Array.Copy(filesMp3, 0, files, filesM4a.Length, filesMp3.Length);

        // convert to the format used by iTunes
        for (int i = 0; i < files.Length; i++)
        {
            Uri uri = null;
            if (Uri.TryCreate(files[i], UriKind.Absolute, out uri))
            {
                files[i] = uri.AbsoluteUri.Replace("file:///", "file://localhost/");
            }
        }

        // read the files from iTunes library.xml
        XDocument library = XDocument.Load(@"E:\My Music\iTunes\iTunes Music Library.xml");
        var q = from node in library.Document.Descendants("string")
                where node.ElementsBeforeSelf("key").Where(n => n.Parent == node.Parent).Last().Value == "Location"
                select node.Value;

        // do the set operations you are interested in
        var missingInLibrary = files.Except(q, StringComparer.InvariantCultureIgnoreCase);
        var missingInFileSystem = q.Except(files, StringComparer.InvariantCultureIgnoreCase);
        var presentInBoth = files.Intersect(q, StringComparer.InvariantCultureIgnoreCase);
    }
}

Jimmy Hoffa · Answer

Alphabetically sort your list of strings that you're matching on, then build an index array which tells you where the start of your list is for each character that is a starting character for one of the strings, maybe indexing to the second character depending on the breadth of variety and if your match is case sensitive or not.

Read the file character by character with a stream to minimize memory footprint, checking into the index array to see where that character starts and ends in the list of strings so you can pull out that characters page, if there's anything starting with those character combinations. Then continue filtering inside of the page until you have one match left and the next character makes matches 0.

Remove that string from the list of strings to match, put it in another list if you want. Then start checking your index on the next character and continue doing so each time you run into no matches.

The index gives you a more efficient aggregate to minimize number of items iterated against.

This could give you a two character depth index:

Dictionary<string,int> stringIndex = new Dictionary<char,int>();
for(int i = 0; i < sortedSearchStrings.Length; i++;)
{
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0])) stringIndex[sortedSearchStrings[i][0]] = i;
    if (!stringIndex.Keys.Contains(sortedSearchStrings[i][0] + sortedSearchStrings[i][1])) stringIndex[sortedSearchStrings[i][0] + sortedSearchStrings[i][1]] = i;
}

Then to find the starting index in your list you just access:

int startOfCurrentCharPage = stringIndex[string.Format("{0}{1}", lastChar, currentChar)];

Does string exist check 20k times

Tags:

string

c#

search

Raj More

4 Answers

Chris Laplante

Eduardo Mauro

Dirk Vollmar

Jimmy Hoffa

Recent Activity

Donate For Us

Does string exist check 20k times

Tags:

string

c#

search

Raj More

4 Answers

Chris Laplante

Eduardo Mauro

Dirk Vollmar

Jimmy Hoffa

Related questions

Recent Activity

Donate For Us