Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently sorting a String by the number of occurrences of each of its characters

I'm trying to sort a string by the number of occurrences of each of its characters, with the most frequent at the beginning and rarest at the end. After sorting, I would need to remove all the character repeats. Because examples are always clearer, the program should do the following:

String str = "aebbaaahhhhhhaabbbccdfffeegh";
String output = sortByCharacterOccurrencesAndTrim(str);

In this case, the 'sortByCharacterOccurrencesAndTrim' method should return:

String output = "habefcdg"

In the case where 2 characters have the same occurrence, their order in the returned string doesn't matter. So "habefcdg" can also equal "habfecgd", because both 'f' and 'e' occur 3 times and both 'd' and 'g' occur once.

"habefcdg" would effectively be the same as "habfecgd"

Note: I would like to point out that performance matters in this case, so I would prefer the most efficient method possible. I say this because the string length can range from 1 to the max length (which I think is the same as Integer.MAX_VALUE, but not sure), so I want to minimize any potential bottlenecks.

like image 353
thearchitector Avatar asked Sep 30 '16 03:09

thearchitector


People also ask

How can I sort a string by the number of characters?

A simple approach will be to use sorting algorithms like quick sort or merge sort and sort the input string and print it. // Hash array to keep count of characters. // initialized to zero.

What is the complexity of sorting strings by comparison?

Bookmark this question. Show activity on this post. Sorting strings by comparisons (e.g. standard QuickSort + strcmp-like function) may be a bit slow, especially for long strings sharing a common prefix (the comparison function takes O (s) time, where s is the length of string), thus a standard solution has the complexity of O (s * nlog n).

Is there a way to sort a map by number of occurrences?

If you use Java's TreeMap implementation of the Map it will keep your values sorted by key. There is also a constructor where you can pass a custom implementation of Comparator if you need your way of sorting. Thank you. But output is sorted in alphabetical order. I want to sort by number of occurrences. So 0=4, then e =3 and so on

Is there a faster string sort algorithm?

There is the burst sort algorithm which claims to be faster, but I don't know of any implementation. There is an article Fast String Sort in C# and F# that describes the algorithm and has a reference to Sedgewick's code as well as to C# code. (disclosure: it's an article and code that I wrote based on Sedgewick's paper).


2 Answers

"A map and several while loops" is certainly the easiest way to go, and probably would be very fast. The idea is:

for each character
    increment its count in the map
Sort the map in descending order
Output the map keys in that order

But 100,000,000 map lookups could get pretty expensive. You could potentially speed it up by creating an array of 65,536 integer counts (or 128 characters if it's ASCII). Then:

for each character
    array[(int)ch] += 1

Then, you go through that array and create a map of the characters with non-zero counts:

for i = 0 to 65535
    if array[i] > 0
        map.add((char)i, array[i])

Then sort the map in descending order and output the characters in that order.

That could perform quite a bit faster, simply because indexing into an array 100,000,000 times is likely a lot faster than doing 100,000,000 map lookups.

like image 162
Jim Mischel Avatar answered Oct 31 '22 12:10

Jim Mischel


Note: This is not an answer, but simply showing performance test code of answers by Jim Mischel and Óscar López (with parallel streams in response to comment by OP).

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.function.Function;
import java.util.stream.Collectors;

public class Test {
    public static void main(String[] args) {
        long start = System.currentTimeMillis();
        String s = buildString();
        System.out.println("buildString: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result1 = testUsingArray(s);
        System.out.println("testUsingArray: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result2 = testUsingMap(s);
        System.out.println("testUsingMap: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result3 = testUsingStream(s);
        System.out.println("testUsingStream: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result4 = testUsingParallelStream(s);
        System.out.println("testUsingParallelStream: " + (System.currentTimeMillis() - start) + "ms");

        System.out.println(result1);
        System.out.println(result2);
        System.out.println(result3);
        System.out.println(result4);
    }
    private static String buildString() {
        Random rnd = new Random();
        char[] buf = new char[100_000_000];
        for (int i = 0; i < buf.length; i++)
            buf[i] = (char)(rnd.nextInt(127 - 33) + 33);
        return new String(buf);
    }
    private static String testUsingArray(String s) {
        int[] count = new int[65536];
        for (int i = 0; i < s.length(); i++)
            count[s.charAt(i)]++;
        List<CharCount> list = new ArrayList<>();
        for (int i = 0; i < 65536; i++)
            if (count[i] != 0)
                list.add(new CharCount((char)i, count[i]));
        Collections.sort(list);
        char[] buf = new char[list.size()];
        for (int i = 0; i < buf.length; i++)
            buf[i] = list.get(i).ch;
        return new String(buf);
    }
    private static String testUsingMap(String s) {
        Map<Character, CharCount> map = new HashMap<>();
        for (int i = 0; i < s.length(); i++)
            map.computeIfAbsent(s.charAt(i), CharCount::new).count++;
        List<CharCount> list = new ArrayList<>(map.values());
        Collections.sort(list);
        char[] buf = new char[list.size()];
        for (int i = 0; i < buf.length; i++)
            buf[i] = list.get(i).ch;
        return new String(buf);
    }
    private static String testUsingStream(String s) {
        int[] output = s.codePoints()
                        .boxed()
                        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                        .entrySet()
                        .stream()
                        .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed())
                        .mapToInt(Map.Entry::getKey)
                        .toArray();
        return new String(output, 0, output.length);
    }
    private static String testUsingParallelStream(String s) {
        int[] output = s.codePoints()
                        .parallel()
                        .boxed()
                        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                        .entrySet()
                        .parallelStream()
                        .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed())
                        .mapToInt(Map.Entry::getKey)
                        .toArray();
        return new String(output, 0, output.length);
    }
}
class CharCount implements Comparable<CharCount> {
    final char ch;
    int count;
    CharCount(char ch) {
        this.ch = ch;
    }
    CharCount(char ch, int count) {
        this.ch = ch;
        this.count = count;
    }
    @Override
    public int compareTo(CharCount that) {
        return Integer.compare(that.count, this.count); // descending
    }
}

Sample Output

buildString: 974ms
testUsingArray: 48ms
testUsingMap: 216ms
testUsingStream: 1279ms
testUsingParallelStream: 442ms
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
like image 24
Andreas Avatar answered Oct 31 '22 14:10

Andreas