Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text Search based algorithm not behaving as intended

Update

I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.

Update #2

I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.

Thread ID - Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2) could show up multiple times within the logs.

Script - Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.

File - Every Thread ID runs a Script on a File. If Thread(10) is running myscript.script on myfile.file, then that EXACT line won't be executed again. A successful example using the above example would be something like so.

------START------

Thread(10) starting myscript.script on myfile.file

Thread(10) finished myscript.script on myfile.file

------END-------

An unsuccessful example using the above example would be:

------START------

Thread(10) starting myscript.script on myfile.file

------END------


Before addressing my query I'll give a rundown of the code used and the desired behavior.


Summary

I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.

In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:

    REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
    Set started, finished
    for (int i=log.size()-1; i >=0; i--) {
    if(group(2).contains("starting")
        started.add(log.get(i))
    else if(group(2).contains("finished")
        finished.add(log.get(i)    
    }
    started.removeAll(finished);

Search Hung Threads

Set<String> started = new HashSet<String>(), finished = new HashSet<String>();
            
for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
    if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread")) 
        JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);     
}
started.removeAll(finished);
            
commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) { 
    JLogger.appendLineToConsole(s);
    commonTextArea.append(s+"\n");
}

Has Thread Hung

public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {      
    Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
    Matcher m = r.matcher(str);
    boolean hasHung = m.find();
    
        if(m.group(2).contains("starting"))
            started.add(str);
        else if (m.group(2).contains("finished"))
            finished.add(str);
        
        System.out.println("Started size: " + started.size());
        System.out.println("Finished size: " + finished.size());
        
    return hasHung;
}

Example Data

ScriptExecThread(1) started on afile.xyz

ScriptExecThread(2) started on bfile.abc

ScriptExecThread(3) started on cfile.zyx

ScriptExecThread(4) started on dfile.zxy

ScriptExecThread(5) started on efile.yzx

ScriptExecThread(1) finished on afile.xyz

ScriptExecThread(2) finished on bfile.abc

ScriptExecThread(3) finished on cfile.zyx

ScriptExecThread(4) finished on dfile.zxy

ScriptExecThread(5) finished on efile.yzy

ScriptExecThread(1) started on bfile.abc

ScriptExecThread(2) started on dfile.zxy

ScriptExecThread(3) started on afile.xyz

ScriptExecThread(1) finished on bfile.abc

END OF LOG

If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).

Sample Data

09.08 15:06.53, ScriptExecThread(7),Info,########### starting

09.08 15:06.54, ScriptExecThread(18),Info,###################### starting

09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########

09.08 15:06.54, ScriptExecThread(13),Info,########## starting

09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########

09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########

09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########

09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(8),Info,####### starting

09.08 15:06.55, ScriptExecThread(0),Info,##########starting

09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####

09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(0),Info,########### starting

09.08 15:06.55, ScriptExecThread(9),Info,########## starting

09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########

09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######

09.08 15:06.56, ScriptExecThread(17),Info,###################### starting

09.08 15:06.56, ScriptExecThread(1),Info,########## starting


Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.

I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.

like image 477
Juxhin Avatar asked Apr 27 '15 13:04

Juxhin


People also ask

What are the main challenges of text analysis?

Basically, the challenge in text analysis is decoding the ambiguity of human language, while in text analytics it's detecting patterns and trends from the numerical results.

Which algorithm is suitable for text data?

The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis, overall.


2 Answers

If I understand correctly, you have large files and are trying to find patterns of the form "X started (but no mention of X finished)" for all numerical values of X.

If I were to do this, I would use this pseudocode:

Pattern p = Pattern.compile(
   "ScriptExecThread\\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
     int n = Integer.parseInt(m.group(1));
     if (m.group(2).equals("started")) started.add(n);
     else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished

This requires a single regex pass over each file and an O(size-of-finished) set substraction; it should be 20x faster than your current approach. The regex would use optional (|) matching to look for both alternatives at once, reducing the number of passes.

Edit: made regex explicit. Compiling the regex once instead of once-per-line should shave off some extra run-time.


Edit 2: implemented pseudocode, works for me


Edit 3: replaced implementation to show file & line. Reduces memory requirements (does not load whole file into memory); but printing the line does require all "start" lines to be stored.

public class T {

    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(   
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> started = new HashMap<Integer, String>();
        Set<Integer> finished = new HashSet<Integer>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting")) started.put(n, d);
                else finished.add(n);
            }                
        }
        for (int f : finished) {
            started.remove(f);
        }
        return started.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

Input: sample data above, as the first argument, called "data.txt". The same data in another file called "data2.txt" as the second argument (javac T.java && java T data.txt data2.txt). Output:

data.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
data2.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data2.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
like image 159
tucuxi Avatar answered Nov 15 '22 19:11

tucuxi


Keeping two separate sets of started and finished threads (as described by @tucuxi) can't work. If a thread with ID 5 starts, runs, and finishes, then 5 will appear in the finished set, forever. If another thread with ID 5 starts, and hangs, it won't be reported.

Suppose, though, for a moment, that thread IDs aren't reused. Every thread ever created receives a new ID. By keeping separate started and finished sets, you'll have hundreds of thousands of elements in each by the time you're done. Performance on those data structures is proportional to what they've got in them at the time of the operation. It's unlikely that performance will matter for your use case, but if you were performing more expensive operations, or running on data 100 times larger, it might.

Preamble out of the way, here is a working version of @tucuxi's code:

import java.util.*;
import java.io.*;
import java.util.regex.*;

public class T {
    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> running = new HashMap<Integer, String>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting"))
                    running.put(n, d);
                else
                    running.remove(n);
            }
        }
        return running.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

Note that I've dropped the finished set, and the HashMap is now called running. When new threads start, they go in, and when the thread finishes, it is pulled out. This means that the HashMap will always be the size of the number of currently running threads, which will always be less than (or equal) to the total number of threads ever run. So the operations on it will be faster. (As a pleasant side effect, you can now keep track of how many threads are running on a log line by log line basis, which might be useful in determining why the threads are hanging.)

Here's a Python program I used to generate huge test cases:

#!/usr/bin/python

from random import random, choice
from datetime import datetime
import tempfile

all_threads = set([])
running = []
hung = []
filenames = { }

target_thread_count = 16
hang_chance = 0.001

def log(id, msg):
    now = datetime.now().strftime("%m:%d %H:%M:%S")
    print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)

def new_thread():
    if len(all_threads)>0:
        for t in range(0, 2+max(all_threads)):
            if t not in all_threads:
                all_threads.add(t)
                return t
    else:
        all_threads.add(0)
        return 0

for i in range(0, 100000):
    if len(running) > target_thread_count:
        new_thread_chance = 0.25
    else:
        new_thread_chance = 0.75
        pass

    if random() < new_thread_chance:
        t = new_thread()
        name = next(tempfile._get_candidate_names())+".txt"
        filenames[t] = name
        log(t, "%s starting" % (name,))
        if random() < hang_chance:
            hung.append(t)
        else:
            running.append(t)
    elif len(running)>0:
        victim = choice(running)
        all_threads.remove(victim)
        running.remove(victim)
        log(t, "%s finished" % (filenames[victim],))
like image 23
Jay Kominek Avatar answered Nov 15 '22 20:11

Jay Kominek