Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find digits in file names and cross reference them with others

First, I will quickly describe my motivation for this and the actual problem:
I deal with large batches of files constantly and more specifically, I find myself having to rename them according to the following rule:
They may all contain words and digits, but only one set of digits is incrementing and not 'constant'. I need to extract those and only those digits and rename the files accordingly. For example:

Foo_1_Bar_2015.jpg
Foo_2_Bar_2015.jpg
Foo_03_Bar_2015.jpg
Foo_4_Bar_2015.jpg

Will be renamed:

1.jpg
2.jpg
3.jpg or 03.jpg (The leading zero can stay or go)
4.jpg

So what we start with is a vector with std::wstring objects for all the filenames in the specified directory. I urge you to stop reading for 3 minutes and think about how to approach this before I continue with my attempts and questions. I don't want my ideas to nudge you in one direction or another and I've always found fresh ideas are the best.

Now, here are two ways that I can think of:

1) Old style C string manipulation and comparisons:
In my mind, this entails parsing each filename and remembering each digit sequence position and length. This is easily stored in a vector or what-not for each file. This works well (basically uses string searches with increasing offsets):

while((offset = filename_.find_first_of(L"0123456789", offset)) != filename.npos)
{
    size = filename.find_first_not_of(L"0123456789", offset) - offset;
    digit_locations_vec.emplace_back(offset, size);
    offset += size;
}

What I have after this is a vector of (Location, Size) pairs for all the digits in the filename, constant (by using the definition in the motivation) or not.
After this, chaos ensues, as you need to cross reference the strings and find out which digits are the ones that need to be extracted. This will grow exponentially with the number of files (which tends to be huge) not to mentioned multiplied by the number of digit sequences in each string. Also, not very readable, maintainable or elegant. No go.

2) Regular Expressions

If there was ever a use for regex's, it's this. Create a regex object out of the first filename and try to match it with what comes next. Success? Instant ability to extract the required number. Failure? Add the offending filename as a new regex object and try to match against the two existing regex's. Rinse and repeat. The regex's would look something like this:

Foo_(\d+)_Bar_(\d+).jpg

or create a regex for each digit sequence separately:

Foo_(\d+)_Bar_2015.jpg
Foo_1_Bar_(\d+).jpg

The rest is cake. Just keep on matching as you go, and in the best case, it might require only one pass! Question is...

What I need to know:

1) Can you think of any other superior way to achieve this? I've been banging my head against the wall for days.
2) Although the cost of string manipulation and vector constructing\destructing may be substantial in the first method, perhaps it pales in comparison to the cost of regex objects. Second method, worst case: as many regex objects as files. Would this be disastrous with potentially thousands of files?
3) The second method can be adjusted for one of two possibilities: Few std::regex object constructions, many regex_match calls or the other way around. Which is more expensive, the construction of the regex object or trying to match a string with it?

like image 270
Mark Avatar asked Jun 05 '15 17:06

Mark


2 Answers

For me (gcc4.6.2 32-bit optimizations O3), manual string manipulation was about 2x faster than regular expressions. Not worth the cost.

Example runnable complete code (link with boost_system and boost_regex, or change include if you have regex in the compiler already):

#include <ctime>
#include <cctype>
#include <algorithm>
#include <string>
#include <iostream>
#include <vector>
#include <sstream>
using namespace std;

#include <boost/regex.hpp>
using namespace boost;

/*
Foo_1_Bar_2015.jpg
Foo_1_Bar_2016.jpg
Foo_2_Bar_2016.jpg
Foo_2_Bar_2015.jpg
...
*/
vector<string> generateNames(int lenPerYear, int yearStart, int years);

/*
Foo_1_Bar_2015.jpg -> 1_2015.jpg
Foo_7_Bar_2016.jpg -> 7_2016.jpg
*/

void rename_method_string(const vector<string> & names, vector<string> & renamed);
void rename_method_regex(const vector<string> & names, vector<string> & renamed);

typedef void rename_method_t(const vector<string> & names, vector<string> & renamed);
void testMethod(const vector<string> & names, const string & description, rename_method_t method);

int main()
{
    vector<string> names = generateNames(10000, 2014, 100);
    cout << "names.size() = " << names.size() << '\n';
    cout << '\n';
    testMethod(names, "method 1 - string manipulation: ", rename_method_string);
    cout << '\n';
    testMethod(names, "method 2 - regular expressions: ", rename_method_regex);

    return 0;
}

void testMethod(const vector<string> & names, const string & description, rename_method_t method)
{
    vector<string> renamed(names.size());
    clock_t timeStart = clock();
    method(names, renamed);
    clock_t timeEnd = clock();
    cout << "renamed examples:\n";
    for (int i = 0; i < 10 && i < names.size(); ++i)
        cout << names[i] << " -> " << renamed[i] << '\n';
    cout << description << 1000 * (timeEnd - timeStart) / CLOCKS_PER_SEC << " ms\n";
}

vector<string> generateNames(int lenPerYear, int yearStart, int years)
{
    vector<string> result;
    for (int year = yearStart, yearEnd = yearStart + years; year < yearEnd; ++year)
    {
        for (int i = 0; i < lenPerYear; ++i)
        {
            ostringstream oss;
            oss << "Foo_" << i << "_Bar_" << year << ".jpg";
            result.push_back(oss.str());
        }
    }
    return result;
}

template<typename T>
bool equal_safe(T itShort, T itShortEnd, T itLong, T itLongEnd)
{
    if (itLongEnd - itLong < itShortEnd - itShort)
        return false;
    return equal(itShort, itShortEnd, itLong);
}

void rename_method_string(const vector<string> & names, vector<string> & renamed)
{
    //manually: "Foo_(\\d+)_Bar_(\\d+).jpg" -> \1_\2.jpg
    const string foo = "Foo_", bar = "_Bar_", jpg = ".jpg";

    for (int i = 0; i < names.size(); ++i)
    {
        const string & name = names[i];
        //starts with foo?
        if (!equal_safe(foo.begin(), foo.end(), name.begin(), name.end()))
        {
            renamed[i] = "ERROR no foo";
            continue;
        }
        //extract number
        auto it = name.begin() + foo.size();
        for (; it != name.end() && isdigit(*it); ++it) {}
        string str_num1(name.begin() + foo.size(), it);
        //continues with bar?
        if (!equal_safe(bar.begin(), bar.end(), it, name.end()))
        {
            renamed[i] = "ERROR no bar";
            continue;
        }
        //extract number
        it += bar.size();
        auto itStart = it;
        for (; it != name.end() && isdigit(*it); ++it) {}
        string str_num2(itStart, it);
        //check *.jpg
        if (!equal_safe(jpg.begin(), jpg.end(), it, name.end()))
        {
            renamed[i] = "ERROR no .jpg";
            continue;
        }
        renamed[i] = str_num1 + "_" + str_num2 + ".jpg";
    }
}

void rename_method_regex(const vector<string> & names, vector<string> & renamed)
{
    regex searching("Foo_(\\d+)_Bar_(\\d+).jpg");
    smatch found;
    for (int i = 0; i < names.size(); ++i)
    {
        if (regex_search(names[i], found, searching))
        {
            if (3 != found.size())
                renamed[i] = "ERROR weird match";
            else
                renamed[i] = found[1].str() + "_" + found[2].str() + ".jpg";
        }
        else renamed[i] = "ERROR no match";
    }
}

It produces output for me:

names.size() = 1000000

renamed examples:
Foo_0_Bar_2014.jpg -> 0_2014.jpg
Foo_1_Bar_2014.jpg -> 1_2014.jpg
Foo_2_Bar_2014.jpg -> 2_2014.jpg
Foo_3_Bar_2014.jpg -> 3_2014.jpg
Foo_4_Bar_2014.jpg -> 4_2014.jpg
Foo_5_Bar_2014.jpg -> 5_2014.jpg
Foo_6_Bar_2014.jpg -> 6_2014.jpg
Foo_7_Bar_2014.jpg -> 7_2014.jpg
Foo_8_Bar_2014.jpg -> 8_2014.jpg
Foo_9_Bar_2014.jpg -> 9_2014.jpg
method 1 - string manipulation: 421 ms

renamed examples:
Foo_0_Bar_2014.jpg -> 0_2014.jpg
Foo_1_Bar_2014.jpg -> 1_2014.jpg
Foo_2_Bar_2014.jpg -> 2_2014.jpg
Foo_3_Bar_2014.jpg -> 3_2014.jpg
Foo_4_Bar_2014.jpg -> 4_2014.jpg
Foo_5_Bar_2014.jpg -> 5_2014.jpg
Foo_6_Bar_2014.jpg -> 6_2014.jpg
Foo_7_Bar_2014.jpg -> 7_2014.jpg
Foo_8_Bar_2014.jpg -> 8_2014.jpg
Foo_9_Bar_2014.jpg -> 9_2014.jpg
method 2 - regular expressions: 796 ms

Also, I think it's completely pointless, because actual I/O (getting file name, renaming file) will be much slower than any CPU string manipulation, in your example. So to answer your questions:

  1. I don't see any superior way, I/O is what's slow, don't bother with superiority
  2. regex object wasn't expensive in my experience, within 2x slow-down vs manual method, that's constant slow-down and negligible, compared to how much work it saves
  3. How many std::regex objects for how many regex_match calls? Depends on the amount of regex_match calls: more matches there are, more it's worth to create specific std::regex object. However this will be very library-dependant. If there are lots of match calls, create separate, if you are not sure, do not bother.
like image 188
peenut Avatar answered Oct 25 '22 23:10

peenut


Why dont you use split to split the string between letters and numbers:

Regex.Split(fileName, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");

then get whichever index you need for the numbers, maybe using a Where clause, to find the ones increasing in value, while the others indices are matching, then you can use .Last() to get the extension.

like image 28
maksymiuk Avatar answered Oct 25 '22 23:10

maksymiuk