Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scanner.findInLine() leaks memory massively

I'm running a simple scanner to parse a string, however I've discovered that if called often enough I get OutOfMemory errors. This code is called as part of the constructor of an object that is built repeatedly for an array of strings :

Edit: Here's the constructor for more infos; not much more happening outside of the try-catch regarding the Scanner

   public Header(String headerText) {
        char[] charArr;
        charArr = headerText.toCharArray();
        // Check that all characters are printable characters
        if (charArr.length > 0 && !commonMethods.isPrint(charArr)) {
            throw new IllegalArgumentException(headerText);
        }
        // Check for header suffix
        Scanner sc = new Scanner(headerText);
        MatchResult res;
        try {
            sc.findInLine("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");
            res = sc.match();
        } finally {
            sc.close();
        }

        if (res.group(1) == null || res.group(1).isEmpty()) {
            throw new IllegalArgumentException("Missing header keyword found");     // Empty header to store
        } else {
            mnemonic = res.group(1).toLowerCase();                            // Store header
        }
        if (res.group(2) == null || res.group(2).isEmpty()) {
            suffix = -1;
        } else {
            try {
                suffix = Integer.parseInt(res.group(2));       // Store suffix if it exists
            }  catch (NumberFormatException e) {
                throw new NumberFormatException(headerText);
            }
        }
        if (res.group(3) == null || res.group(3).isEmpty()) {
            isQuery= false;
        } else {
            if (res.group(3).equals("?")) {
                isQuery = true;
            } else {
                throw new IllegalArgumentException(headerText);
            }
        }

        // If command was of the form *ABC, reject suffixes and prefixes
        if (mnemonic.contains("*") 
                && suffix != -1) {
            throw new IllegalArgumentException(headerText);
        }
    }

A profiler memory snapshot shows the read(Char) method of Scanner.findInLine() to be allocated massive amounts of memory during operation as a I scan through a few hundred thousands strings; after a few seconds it already is allocated over 38MB.

enter image description here

I would think that calling close() on the scanner after using it in the constructor would flag the old object to be cleared by the GC, but somehow it remains and the read method accumulates gigabytes of data before filling the heap.

Can anybody point me in the right direction?

like image 270
darkhelmet Avatar asked Oct 21 '22 14:10

darkhelmet


1 Answers

You haven't posted all your code, but given that you are scanning for the same regex repeatedly, it would be much more efficient to compile a static Pattern beforehand and use this for the scanner's find:

static Pattern p = Pattern.compile("(\\D*[a-zA-Z]+)(\\d*)(\\D*)");

and in the constructor:

sc.findInLine(p);

This may or may not be the source of the OOM issue, but it will definitely make your parsing a bit faster.

Related: java.util.regex - importance of Pattern.compile()?

Update: after you posted more of your code, I see some other issues. If you're calling this constructor repeatedly, it means you are probably tokenizing or breaking up the input beforehand. Why create a new Scanner to parse each line? They are expensive; you should be using the same Scanner to parse the entire file, if possible. Using one Scanner with a precompiled Pattern will be much faster than what you are doing now, which is creating a new Scanner and a new Pattern for each line you are parsing.

like image 108
Andrew Mao Avatar answered Oct 27 '22 23:10

Andrew Mao