Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Sort of Directory Filenames in C++

I have a directory listing for which I want to retrieve the filenames and put them in a vector of strings such that they are sorted in a "natural" manner. e.g. { "10.txt" "0.txt" "2.txt" "1.m" "Jan12" "July13.txt" "Nov25.txt" "Jane" "John" } should be {"0.txt" "1.m" "2.txt" "10.txt" "Jan12" "July13.txt" "Nov25.txt" "Jane" "John" } . What is the easiest way to do this?

Elaborating on "natural" we assume for a string made up from parts of numbers (N) and text (T) such that ...(N)(T)... , then for ...(N1)(T1)... and ...(N2)(T2)... will be (N1<N2) (<) (T1<T2) where (<) implies left term precedence over right term. In this case, numbers take precedence over text fields if they are in the same position in the string, i.e. 1.z (<) 1_t.txt .

Is there already a library function to do that kind of sort for alphanumeric strings, or directory entries?

DESIRED order in which files should come. The file names will be stored in a vector of strings.

Abhinav@Abhinav-PC /cygdrive/c/AbhinavSamples/shell
$ ls -lv
total 8
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 1.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:55 1_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:50 3.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 4.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 10.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 10_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 13.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 20.txt

**Simple Sort**
Abhi@Abhi-PC /cygdrive/c/AbhinavSamples/shell
$ ls -l
total 8
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 1.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 10.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 10_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:56 13.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:55 1_t.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:53 20.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:50 3.txt
-rw-r--r--+ 1 Abhinav None 2 Mar 17 00:51 4.txt
like image 487
Abhinav Avatar asked Mar 16 '12 19:03

Abhinav


3 Answers

You need a function which makes the natural comparison between two strings. After that you can use std::sort with the comparison function as third argument (as @chac already pointed out). In the following I tried to implement such a function in a recursive way. Note, that it can handle arbitrary filenames which don't need to begin with a number part and end with a string part:

bool compareNat(const std::string& a, const std::string& b)
{
    if (a.empty())
        return true;
    if (b.empty())
        return false;
    if (std::isdigit(a[0]) && !std::isdigit(b[0]))
        return true;
    if (!std::isdigit(a[0]) && std::isdigit(b[0]))
        return false;
    if (!std::isdigit(a[0]) && !std::isdigit(b[0]))
    {
        if (std::toupper(a[0]) == std::toupper(b[0]))
            return compareNat(a.substr(1), b.substr(1));
        return (std::toupper(a[0]) < std::toupper(b[0]));
    }

    // Both strings begin with digit --> parse both numbers
    std::istringstream issa(a);
    std::istringstream issb(b);
    int ia, ib;
    issa >> ia;
    issb >> ib;
    if (ia != ib)
        return ia < ib;

    // Numbers are the same --> remove numbers and recurse
    std::string anew, bnew;
    std::getline(issa, anew);
    std::getline(issb, bnew);
    return (compareNat(anew, bnew));
}

Here's a simple test case:

#include <iostream> // std::cout
#include <string>
#include <algorithm> // std::sort, std::copy
#include <iterator> // std::ostream_iterator
#include <sstream> // std::istringstream
#include <vector>
#include <cctype> // std::isdigit

int main()
{
    std::vector<std::string> str;
    str.push_back("20.txt");
    str.push_back("10.txt");
    str.push_back("1.txt");
    str.push_back("z2.txt");
    str.push_back("z10.txt");
    str.push_back("z100.txt");
    str.push_back("1_t.txt");

    std::sort(str.begin(), str.end(), compareNat);
    std::copy(str.begin(), str.end(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

The result is:

1.txt
1_t.txt
10.txt
20.txt
z2.txt
z10.txt
z100.txt
like image 53
Christian Ammer Avatar answered Oct 20 '22 12:10

Christian Ammer


There is a function that does exactly what you want in glibc. Unfortunately it is C, not C++, so if you can live with that here is the simplest possible solution "out of the box", without reimplementing anything and reinventing the wheel. BTW: this is exactly as ls -lv is implemented. The most important part of it is the versionsort function which does the natural sort for you. It is used here as a comparison function for scandir. The simple example below prints all files/directories in current directory sorted as you wish.

#define _GNU_SOURCE
#include <dirent.h>
#include <stdlib.h>
#include <stdio.h>

int main(void)
{
    struct dirent **namelist;
    int n,i;

    n = scandir(".", &namelist, 0, versionsort);
    if (n < 0)
        perror("scandir");
    else
    {
        for(i =0 ; i < n; ++i)
        {
            printf("%s\n", namelist[i]->d_name);
            free(namelist[i]);
        }
        free(namelist);
    }
    return 0;
}
like image 26
sirgeorge Avatar answered Oct 20 '22 13:10

sirgeorge


I have come across an algorithm which works pretty nicely: http://sourcefrog.net/projects/natsort/

I have modified the source a little to meet my needs:

  1. pass std::string as parameter.
  2. use with std::sort and std::vector.

My version of source is here.

A use case example:

#include <iostream>
#include <vector>
#include <algorithm>
#include "strnatcmp.hpp"

int main(){
    std::vector<std::string> files;

    files.push_back("20.txt");
    files.push_back("10.txt");
    files.push_back("1.txt");
    files.push_back("z2.txt");
    files.push_back("z10.txt");
    files.push_back("z100.txt");
    files.push_back("1_t.txt ");
    files.push_back("ABc");
    files.push_back("aBCd");
    files.push_back("aBc");
    files.push_back("aaa");
    files.push_back("aBcd");
    files.push_back("aaA");

    std::sort(files.begin(),files.end(),compareNat);

    for(int i=0;i<(int)files.size();i++)std::cout<< files[i]+"\n";

    return 0;
}

Output:

1.txt
1_t.txt 
10.txt
20.txt
aaa
aaA
ABc
aBc
aBCd
aBcd
z2.txt
z10.txt
z100.txt

This is the strnatcmp.hpp header.

like image 24
Jahid Avatar answered Oct 20 '22 12:10

Jahid