I have written a Perl code to process huge number of CSV files and get output, which is taking 0.8326 seconds to complete.
my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;
open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
my $line = $_;
chomp $line;
my $severity = (split(",", $line))[6];
next if $severity =~ m/NORMAL/i;
$hash{$time}{$severity}++;
}
close(IN);
}
foreach my $time (sort {$b <=> $a} keys %hash) {
foreach my $severity ( keys %{$hash{$time}} ) {
print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
}
}
Now I'm writing the same logic in Java, which I wrote but taking 2600ms i.e 2.6 sec to complete. My question is why Java is taking so long time? How to achieve the same speed as Perl? Note: I ignored the VM initialization and class loading time.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
static String opname;
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
String severity=line.split(",")[6];
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
System.out.println(store);
}
public static void main(String[] args) throws IOException
{
opname = args[0];
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
}
}
File input format(A~B~C~D~E~20150715080000.csv),around 500 files of ~1MB each,
A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G
Java Version: 1.7
////////////////////Update///////////////////
As per the below comments , I replaced the split with regex , and the performance is improved a lot. Now I am doing this in a loop , and after 3-10 iteration the performance is quite acceptable .
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>();
static String opname="Etis_Egypt";
static Pattern pattern1=Pattern.compile("(\\d+\\.)");
static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
static long currentsystime=System.currentTimeMillis();
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
Matcher matcher=pattern1.matcher(mf.getName());
matcher.find();
//String timestamp=mf.getName().split("~")[5].replace(".csv", "");
String timestamp=matcher.group();
BufferedReader br= new BufferedReader(new FileReader(mf));
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
matcher=pattern2.matcher(line);
matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
//String severity=line.split(",")[6];
String severity=matcher.group();
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
br.close();
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
//System.out.println(time+"ms");
//System.out.println(store);
}
public static void main(String[] args) throws IOException
{
//opname = args[0];
for(int i=0;i<20;i++){
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println("Time taken for "+i+" is "+time+"ms");
}
}
}
But I have another question now ,
See the result while running on a small dataset,.
**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms
For first few instance time taken is more , and then its reduced ,.. Why ???
Thanks ,
A few seconds are not enough for Java to get to its full speed because of JIT compilation. Java is optimized for servers running for hours (or years), not for tiny utilities taking just a few seconds.
Concerning class loading, I guess you don't know about e.g. Pattern and Matcher which you use indirectly in split and which get loaded as needed.
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
A Perl hash is most like a Java HashMap, but you're using a TreeMap which is slower. I guess this doesn't matter, just note that there are way more differences than you think.
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
You're reading the time for each file again and again. You're doing it even for those whose name doesn't end with ".csv". That's surely not what find does.
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
Unlike Perl, Java doesn't cache regexes. As far as I know the split on a single character gets optimized separately, but otherwise you'd be much better with using something like
private static final Pattern FILENAME_PATTERN =
Pattern.compile("(?:[^~]*~){5}~([^~]*)\\.csv");
Matcher m = FILENAME_PATTERN.matcher(mf.getName());
if (!m.matches) ... do what you want
String timestamp = m.group(1);
BufferedReader br = new BufferedReader(new FileReader(mf), 1024*500);
This could be the culprit. By default, it uses platform encoding, which may be UTF-8. This is usually slower than ASCII or LATIN-1. As far as I know Perl works directly with bytes unless instructed otherwise.
The buffer size of half a megabyte is insanely big for anything taking just a few seconds, especially when you allocate it multiple times. Note that there's nothing like this in your Perl code.
That all said, Perl and find might be indeed faster for such short tasks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With