First, I will quickly describe my motivation for this and the actual problem:
I deal with large batches of files constantly and more specifically, I find myself having to rename them according to the following rule:
They may all contain words and digits, but only one set of digits is incrementing and not 'constant'. I need to extract those and only those digits and rename the files accordingly. For example:
Foo_1_Bar_2015.jpg
Foo_2_Bar_2015.jpg
Foo_03_Bar_2015.jpg
Foo_4_Bar_2015.jpg
Will be renamed:
1.jpg
2.jpg
3.jpg or 03.jpg (The leading zero can stay or go)
4.jpg
So what we start with is a vector with std::wstring
objects for all the filenames in the specified directory. I urge you to stop reading for 3 minutes and think about how to approach this before I continue with my attempts and questions. I don't want my ideas to nudge you in one direction or another and I've always found fresh ideas are the best.
Now, here are two ways that I can think of:
1) Old style C string manipulation and comparisons:
In my mind, this entails parsing each filename and remembering each digit sequence position and length. This is easily stored in a vector or what-not for each file. This works well (basically uses string searches with increasing offsets):
while((offset = filename_.find_first_of(L"0123456789", offset)) != filename.npos)
{
size = filename.find_first_not_of(L"0123456789", offset) - offset;
digit_locations_vec.emplace_back(offset, size);
offset += size;
}
What I have after this is a vector of (Location, Size) pairs for all the digits in the filename, constant (by using the definition in the motivation) or not.
After this, chaos ensues, as you need to cross reference the strings and find out which digits are the ones that need to be extracted. This will grow exponentially with the number of files (which tends to be huge) not to mentioned multiplied by the number of digit sequences in each string. Also, not very readable, maintainable or elegant. No go.
2) Regular Expressions
If there was ever a use for regex's, it's this. Create a regex object out of the first filename and try to match it with what comes next. Success? Instant ability to extract the required number. Failure? Add the offending filename as a new regex object and try to match against the two existing regex's. Rinse and repeat. The regex's would look something like this:
Foo_(\d+)_Bar_(\d+).jpg
or create a regex for each digit sequence separately:
Foo_(\d+)_Bar_2015.jpg
Foo_1_Bar_(\d+).jpg
The rest is cake. Just keep on matching as you go, and in the best case, it might require only one pass! Question is...
What I need to know:
1) Can you think of any other superior way to achieve this? I've been banging my head against the wall for days.
2) Although the cost of string manipulation and vector constructing\destructing may be substantial in the first method, perhaps it pales in comparison to the cost of regex objects. Second method, worst case: as many regex objects as files. Would this be disastrous with potentially thousands of files?
3) The second method can be adjusted for one of two possibilities: Few std::regex
object constructions, many regex_match
calls or the other way around. Which is more expensive, the construction of the regex object or trying to match a string with it?
For me (gcc4.6.2 32-bit optimizations O3), manual string manipulation was about 2x faster than regular expressions. Not worth the cost.
Example runnable complete code (link with boost_system and boost_regex, or change include if you have regex in the compiler already):
#include <ctime>
#include <cctype>
#include <algorithm>
#include <string>
#include <iostream>
#include <vector>
#include <sstream>
using namespace std;
#include <boost/regex.hpp>
using namespace boost;
/*
Foo_1_Bar_2015.jpg
Foo_1_Bar_2016.jpg
Foo_2_Bar_2016.jpg
Foo_2_Bar_2015.jpg
...
*/
vector<string> generateNames(int lenPerYear, int yearStart, int years);
/*
Foo_1_Bar_2015.jpg -> 1_2015.jpg
Foo_7_Bar_2016.jpg -> 7_2016.jpg
*/
void rename_method_string(const vector<string> & names, vector<string> & renamed);
void rename_method_regex(const vector<string> & names, vector<string> & renamed);
typedef void rename_method_t(const vector<string> & names, vector<string> & renamed);
void testMethod(const vector<string> & names, const string & description, rename_method_t method);
int main()
{
vector<string> names = generateNames(10000, 2014, 100);
cout << "names.size() = " << names.size() << '\n';
cout << '\n';
testMethod(names, "method 1 - string manipulation: ", rename_method_string);
cout << '\n';
testMethod(names, "method 2 - regular expressions: ", rename_method_regex);
return 0;
}
void testMethod(const vector<string> & names, const string & description, rename_method_t method)
{
vector<string> renamed(names.size());
clock_t timeStart = clock();
method(names, renamed);
clock_t timeEnd = clock();
cout << "renamed examples:\n";
for (int i = 0; i < 10 && i < names.size(); ++i)
cout << names[i] << " -> " << renamed[i] << '\n';
cout << description << 1000 * (timeEnd - timeStart) / CLOCKS_PER_SEC << " ms\n";
}
vector<string> generateNames(int lenPerYear, int yearStart, int years)
{
vector<string> result;
for (int year = yearStart, yearEnd = yearStart + years; year < yearEnd; ++year)
{
for (int i = 0; i < lenPerYear; ++i)
{
ostringstream oss;
oss << "Foo_" << i << "_Bar_" << year << ".jpg";
result.push_back(oss.str());
}
}
return result;
}
template<typename T>
bool equal_safe(T itShort, T itShortEnd, T itLong, T itLongEnd)
{
if (itLongEnd - itLong < itShortEnd - itShort)
return false;
return equal(itShort, itShortEnd, itLong);
}
void rename_method_string(const vector<string> & names, vector<string> & renamed)
{
//manually: "Foo_(\\d+)_Bar_(\\d+).jpg" -> \1_\2.jpg
const string foo = "Foo_", bar = "_Bar_", jpg = ".jpg";
for (int i = 0; i < names.size(); ++i)
{
const string & name = names[i];
//starts with foo?
if (!equal_safe(foo.begin(), foo.end(), name.begin(), name.end()))
{
renamed[i] = "ERROR no foo";
continue;
}
//extract number
auto it = name.begin() + foo.size();
for (; it != name.end() && isdigit(*it); ++it) {}
string str_num1(name.begin() + foo.size(), it);
//continues with bar?
if (!equal_safe(bar.begin(), bar.end(), it, name.end()))
{
renamed[i] = "ERROR no bar";
continue;
}
//extract number
it += bar.size();
auto itStart = it;
for (; it != name.end() && isdigit(*it); ++it) {}
string str_num2(itStart, it);
//check *.jpg
if (!equal_safe(jpg.begin(), jpg.end(), it, name.end()))
{
renamed[i] = "ERROR no .jpg";
continue;
}
renamed[i] = str_num1 + "_" + str_num2 + ".jpg";
}
}
void rename_method_regex(const vector<string> & names, vector<string> & renamed)
{
regex searching("Foo_(\\d+)_Bar_(\\d+).jpg");
smatch found;
for (int i = 0; i < names.size(); ++i)
{
if (regex_search(names[i], found, searching))
{
if (3 != found.size())
renamed[i] = "ERROR weird match";
else
renamed[i] = found[1].str() + "_" + found[2].str() + ".jpg";
}
else renamed[i] = "ERROR no match";
}
}
It produces output for me:
names.size() = 1000000
renamed examples:
Foo_0_Bar_2014.jpg -> 0_2014.jpg
Foo_1_Bar_2014.jpg -> 1_2014.jpg
Foo_2_Bar_2014.jpg -> 2_2014.jpg
Foo_3_Bar_2014.jpg -> 3_2014.jpg
Foo_4_Bar_2014.jpg -> 4_2014.jpg
Foo_5_Bar_2014.jpg -> 5_2014.jpg
Foo_6_Bar_2014.jpg -> 6_2014.jpg
Foo_7_Bar_2014.jpg -> 7_2014.jpg
Foo_8_Bar_2014.jpg -> 8_2014.jpg
Foo_9_Bar_2014.jpg -> 9_2014.jpg
method 1 - string manipulation: 421 ms
renamed examples:
Foo_0_Bar_2014.jpg -> 0_2014.jpg
Foo_1_Bar_2014.jpg -> 1_2014.jpg
Foo_2_Bar_2014.jpg -> 2_2014.jpg
Foo_3_Bar_2014.jpg -> 3_2014.jpg
Foo_4_Bar_2014.jpg -> 4_2014.jpg
Foo_5_Bar_2014.jpg -> 5_2014.jpg
Foo_6_Bar_2014.jpg -> 6_2014.jpg
Foo_7_Bar_2014.jpg -> 7_2014.jpg
Foo_8_Bar_2014.jpg -> 8_2014.jpg
Foo_9_Bar_2014.jpg -> 9_2014.jpg
method 2 - regular expressions: 796 ms
Also, I think it's completely pointless, because actual I/O (getting file name, renaming file) will be much slower than any CPU string manipulation, in your example. So to answer your questions:
Why dont you use split to split the string between letters and numbers:
Regex.Split(fileName, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");
then get whichever index you need for the numbers, maybe using a Where clause, to find the ones increasing in value, while the others indices are matching, then you can use .Last() to get the extension.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With