I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.
I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.
Thread ID
- Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2)
could show up multiple times within the logs.
Script
- Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.
File
- Every Thread ID
runs a Script
on a File
. If Thread(10)
is running myscript.script
on myfile.file
, then that EXACT line won't be executed again. A successful example using the above example would be something like so.
------START------
Thread(10) starting myscript.script on myfile.file
Thread(10) finished myscript.script on myfile.file
------END-------
An unsuccessful example using the above example would be:
------START------
Thread(10) starting myscript.script on myfile.file
------END------
Before addressing my query I'll give a rundown of the code used and the desired behavior.
I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.
In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:
REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
Set started, finished
for (int i=log.size()-1; i >=0; i--) {
if(group(2).contains("starting")
started.add(log.get(i))
else if(group(2).contains("finished")
finished.add(log.get(i)
}
started.removeAll(finished);
Set<String> started = new HashSet<String>(), finished = new HashSet<String>();
for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread"))
JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);
}
started.removeAll(finished);
commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) {
JLogger.appendLineToConsole(s);
commonTextArea.append(s+"\n");
}
public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {
Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
Matcher m = r.matcher(str);
boolean hasHung = m.find();
if(m.group(2).contains("starting"))
started.add(str);
else if (m.group(2).contains("finished"))
finished.add(str);
System.out.println("Started size: " + started.size());
System.out.println("Finished size: " + finished.size());
return hasHung;
}
ScriptExecThread(1) started on afile.xyz
ScriptExecThread(2) started on bfile.abc
ScriptExecThread(3) started on cfile.zyx
ScriptExecThread(4) started on dfile.zxy
ScriptExecThread(5) started on efile.yzx
ScriptExecThread(1) finished on afile.xyz
ScriptExecThread(2) finished on bfile.abc
ScriptExecThread(3) finished on cfile.zyx
ScriptExecThread(4) finished on dfile.zxy
ScriptExecThread(5) finished on efile.yzy
ScriptExecThread(1) started on bfile.abc
ScriptExecThread(2) started on dfile.zxy
ScriptExecThread(3) started on afile.xyz
ScriptExecThread(1) finished on bfile.abc
END OF LOG
If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).
09.08 15:06.53, ScriptExecThread(7),Info,########### starting
09.08 15:06.54, ScriptExecThread(18),Info,###################### starting
09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########
09.08 15:06.54, ScriptExecThread(13),Info,########## starting
09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########
09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########
09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########
09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########
09.08 15:06.55, ScriptExecThread(19),Info,########## starting
09.08 15:06.55, ScriptExecThread(8),Info,####### starting
09.08 15:06.55, ScriptExecThread(0),Info,##########starting
09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####
09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########
09.08 15:06.55, ScriptExecThread(19),Info,########## starting
09.08 15:06.55, ScriptExecThread(0),Info,########### starting
09.08 15:06.55, ScriptExecThread(9),Info,########## starting
09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########
09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######
09.08 15:06.56, ScriptExecThread(17),Info,###################### starting
09.08 15:06.56, ScriptExecThread(1),Info,########## starting
Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.
I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.
Basically, the challenge in text analysis is decoding the ambiguity of human language, while in text analytics it's detecting patterns and trends from the numerical results.
The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis, overall.
If I understand correctly, you have large files and are trying to find patterns of the form "X started (but no mention of X finished)" for all numerical values of X.
If I were to do this, I would use this pseudocode:
Pattern p = Pattern.compile(
"ScriptExecThread\\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
int n = Integer.parseInt(m.group(1));
if (m.group(2).equals("started")) started.add(n);
else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished
This requires a single regex pass over each file and an O(size-of-finished) set substraction; it should be 20x faster than your current approach. The regex would use optional (|
) matching to look for both alternatives at once, reducing the number of passes.
Edit: made regex explicit. Compiling the regex once instead of once-per-line should shave off some extra run-time.
Edit 2: implemented pseudocode, works for me
Edit 3: replaced implementation to show file & line. Reduces memory requirements (does not load whole file into memory); but printing the line does require all "start" lines to be stored.
public class T {
public static Collection<String> findHung(Iterable<String> data) {
Pattern p = Pattern.compile(
"ScriptExecThread\\(([0-9]+).*?(finished|starting)");
HashMap<Integer, String> started = new HashMap<Integer, String>();
Set<Integer> finished = new HashSet<Integer>();
for (String d : data) {
Matcher m = p.matcher(d);
if (m.find()) {
int n = Integer.parseInt(m.group(1));
if (m.group(2).equals("starting")) started.put(n, d);
else finished.add(n);
}
}
for (int f : finished) {
started.remove(f);
}
return started.values();
}
static Iterable<String> readFile(String path, String encoding) throws IOException {
final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
return new Iterable<String>() {
public Iterator<String> iterator() { return sc; }
};
}
public static void main(String[] args) throws Exception {
for (String fileName : args) {
for (String s : findHung(readFile(fileName, "UTF-8"))) {
System.out.println(fileName + ": '" + s + "' unfinished");
}
}
}
}
Input: sample data above, as the first argument, called "data.txt". The same data in another file called "data2.txt" as the second argument (javac T.java && java T data.txt data2.txt
). Output:
data.txt: ' 09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data.txt: ' 09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
data2.txt: ' 09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data2.txt: ' 09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
Keeping two separate sets of started
and finished
threads (as described by @tucuxi) can't work. If a thread with ID 5 starts, runs, and finishes, then 5 will appear in the finished
set, forever. If another thread with ID 5 starts, and hangs, it won't be reported.
Suppose, though, for a moment, that thread IDs aren't reused. Every thread ever created receives a new ID. By keeping separate started
and finished
sets, you'll have hundreds of thousands of elements in each by the time you're done. Performance on those data structures is proportional to what they've got in them at the time of the operation. It's unlikely that performance will matter for your use case, but if you were performing more expensive operations, or running on data 100 times larger, it might.
Preamble out of the way, here is a working version of @tucuxi's code:
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class T {
public static Collection<String> findHung(Iterable<String> data) {
Pattern p = Pattern.compile(
"ScriptExecThread\\(([0-9]+).*?(finished|starting)");
HashMap<Integer, String> running = new HashMap<Integer, String>();
for (String d : data) {
Matcher m = p.matcher(d);
if (m.find()) {
int n = Integer.parseInt(m.group(1));
if (m.group(2).equals("starting"))
running.put(n, d);
else
running.remove(n);
}
}
return running.values();
}
static Iterable<String> readFile(String path, String encoding) throws IOException {
final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
return new Iterable<String>() {
public Iterator<String> iterator() { return sc; }
};
}
public static void main(String[] args) throws Exception {
for (String fileName : args) {
for (String s : findHung(readFile(fileName, "UTF-8"))) {
System.out.println(fileName + ": '" + s + "' unfinished");
}
}
}
}
Note that I've dropped the finished
set, and the HashMap
is now called running
. When new threads start, they go in, and when the thread finishes, it is pulled out. This means that the HashMap
will always be the size of the number of currently running threads, which will always be less than (or equal) to the total number of threads ever run. So the operations on it will be faster. (As a pleasant side effect, you can now keep track of how many threads are running on a log line by log line basis, which might be useful in determining why the threads are hanging.)
Here's a Python program I used to generate huge test cases:
#!/usr/bin/python
from random import random, choice
from datetime import datetime
import tempfile
all_threads = set([])
running = []
hung = []
filenames = { }
target_thread_count = 16
hang_chance = 0.001
def log(id, msg):
now = datetime.now().strftime("%m:%d %H:%M:%S")
print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)
def new_thread():
if len(all_threads)>0:
for t in range(0, 2+max(all_threads)):
if t not in all_threads:
all_threads.add(t)
return t
else:
all_threads.add(0)
return 0
for i in range(0, 100000):
if len(running) > target_thread_count:
new_thread_chance = 0.25
else:
new_thread_chance = 0.75
pass
if random() < new_thread_chance:
t = new_thread()
name = next(tempfile._get_candidate_names())+".txt"
filenames[t] = name
log(t, "%s starting" % (name,))
if random() < hang_chance:
hung.append(t)
else:
running.append(t)
elif len(running)>0:
victim = choice(running)
all_threads.remove(victim)
running.remove(victim)
log(t, "%s finished" % (filenames[victim],))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With