I have a legacy program, and after running it, it will generate a log file. Now I need to analysis this log file. But the file format is very strange. Please see the following,I used vi to open it, it looks like an unicode file, but it is not FFFE started. after I used notepad open it, save it and open again, I found that the FFFE is added by notepad. Then I can use command 'type log.txt > log1.txt" to convert the whole file to ANSI format. Later in perl, I can use /TDD/ in perl to search what I need. But now, I can't deal with this file format. Any comment or idea will be very appreciated. <pre class="prettyprint"><code>0000000: 5400 4400 4400 3e00 2000 4c00 6f00 6100 T.D.D.>. .L.o.a. </code></pre> After notepad save it <pre class="prettyprint"><code>0000000: fffe 5400 4400 4400 3e00 2000 4c00 6f00 ..T.D.D.>. .L.o. open STDIN, "< log.txt"; while(<>) { if (/TDD/) { # Add my logic. } } </code></pre> I have read the thread which is very useful, but still can't resolve my problem. How can I open a Unicode file with Perl? I can't add answer, so I edit my thread. Thanks Michael, I tried your script but got the following error. I checked my perl version is 5.1, OS is windows 2008. <pre class="prettyprint"><code>* ascii * ascii-ctrl * iso-8859-1 * null * utf-8-strict * utf8 UTF-16:Unrecognised BOM 5400 at test.pl line 12. </code></pre> Update I tried the UTF-16LE with the command: <pre class="prettyprint"><code>perl.exe open.pl utf-16le utf-16 <my log file>.txt </code></pre> but I still got the error like <pre class="prettyprint"><code>UTF-16LE:Partial character at open.pl line 18, <$fh> line 1824. </code></pre> also, I tried utf-16be, got the same error. If I used utf-16, I will got the error <pre class="prettyprint"><code>UTF-16:Unrecognised BOM 5400 at open.pl line 18. </code></pre> open.pl line 18 <pre class="prettyprint"><code>is "print while <$fh>;" </code></pre> Any idea? Updated: 5/11/2011. Thank you guys for your help. I resolved the problem. I found that the data in log file are not UTF-16 after all. So, I had to write a .net project by visual studio. It will read the log file with UTF-16 and write to a new file with UTF-8. And then I used perl script to parse the file and generate result data. It worked now. So, if any of you know how to use perl read a file with many garbage data, please tell me, thank you very much. e.g. garbage data sample <pre class="prettyprint"><code>tests.cpp:34) ਍吀䐀䐀㸀&ensp;䰀漀愀搀椀渀最&ensp;挀挀洀挀漀爀攀⸀搀氀 </code></pre> use hex reader to open it: <pre class="prettyprint"><code>0000070: a88d e590 80e4 9080 e490 80e3 b880 e280 ................ 0000080: 80e4 b080 e6bc 80e6 8480 e690 80e6 a480 ................ 0000090: e6b8 80e6 9c80 e280 80e6 8c80 e68c 80e6 ................ 00000a0: b480 e68c 80e6 bc80 e788 80e6 9480 e2b8 ................ </code></pre>

Your file seems to be encoded in UTF-16LE. The bytes notepad adds are called "Byte Order Mark", or just BOM. Here's how you can read your file using Perl: <pre class="prettyprint"><code>use strict; use warnings; use Encode; # list loaded encodings print STDERR map "* $_\n", Encode->encodings; # read arguments my $enc = shift || 'utf16'; die "no files :-(\n" unless @ARGV; # process files for ( @ARGV ) { open my $fh, "<:encoding($enc)", $_ or die "open $_: $!"; print <$fh>; close $fh; } # loaded more encodings now print STDERR map "* $_\n", Encode->encodings; </code></pre> Proceed like this, taking care to supply the correct encoding for your file: <pre class="prettyprint"><code>perl open.pl utf16 open.utf16be.txt perl open.pl utf16 open.utf16le.txt perl open.pl utf16le open.utf16le.nobom.txt </code></pre> Here's the revised version following tchrist's suggestions: <pre class="prettyprint"><code>use strict; use warnings; use Encode; # read arguments my $enc_in = shift || die 'pass file encoding as first parameter'; my $enc_out = shift || die 'pass STDOUT encoding as second parameter'; print STDERR "going to read files as encoded in: $enc_in\n"; print STDERR "going to write to standard output in: $enc_out\n"; die "no files :-(\n" unless @ARGV; binmode STDOUT, ":encoding($enc_out)"; # latin1, cp1252, utf8, UTF-8 print STDERR map "* $_\n", Encode->encodings; # list loaded encodings for ( @ARGV ) { # process files open my $fh, "<:encoding($enc_in)", $_ or die "open $_: $!"; print while <$fh>; close $fh; } print STDERR map "* $_\n", Encode->encodings; # more encodings now </code></pre>

How use perl to process a file whose format is similar to unicode?

Tags:

file

encode

perl

I have a legacy program, and after running it, it will generate a log file. Now I need to analysis this log file.

But the file format is very strange. Please see the following,I used vi to open it, it looks like an unicode file, but it is not FFFE started. after I used notepad open it, save it and open again, I found that the FFFE is added by notepad. Then I can use command 'type log.txt > log1.txt" to convert the whole file to ANSI format. Later in perl, I can use /TDD/ in perl to search what I need.

But now, I can't deal with this file format.

Any comment or idea will be very appreciated.

0000000: 5400 4400 4400 3e00 2000 4c00 6f00 6100  T.D.D.>. .L.o.a.

After notepad save it

0000000: fffe 5400 4400 4400 3e00 2000 4c00 6f00  ..T.D.D.>. .L.o.

open STDIN, "< log.txt";
while(<>)
{
  if (/TDD/)
  {
    # Add my logic.
  }
}

I have read the thread which is very useful, but still can't resolve my problem. How can I open a Unicode file with Perl?

I can't add answer, so I edit my thread.

Thanks Michael, I tried your script but got the following error. I checked my perl version is 5.1, OS is windows 2008.

* ascii
* ascii-ctrl
* iso-8859-1
* null
* utf-8-strict
* utf8
UTF-16:Unrecognised BOM 5400 at test.pl line 12.

Update

I tried the UTF-16LE with the command:

perl.exe open.pl utf-16le utf-16 <my log file>.txt

but I still got the error like

UTF-16LE:Partial character at open.pl line 18, <$fh> line 1824.

also, I tried utf-16be, got the same error.

If I used utf-16, I will got the error

UTF-16:Unrecognised BOM 5400 at open.pl line 18.

open.pl line 18

is "print while <$fh>;"

Any idea?

Updated: 5/11/2011. Thank you guys for your help. I resolved the problem. I found that the data in log file are not UTF-16 after all. So, I had to write a .net project by visual studio. It will read the log file with UTF-16 and write to a new file with UTF-8. And then I used perl script to parse the file and generate result data. It worked now.

So, if any of you know how to use perl read a file with many garbage data, please tell me, thank you very much.

e.g. garbage data sample

tests.cpp:34)
਍吀䐀䐀㸀 䰀漀愀搀椀渀最 挀挀洀挀漀爀攀⸀搀氀

use hex reader to open it:

0000070: a88d e590 80e4 9080 e490 80e3 b880 e280  ................
0000080: 80e4 b080 e6bc 80e6 8480 e690 80e6 a480  ................
0000090: e6b8 80e6 9c80 e280 80e6 8c80 e68c 80e6  ................
00000a0: b480 e68c 80e6 bc80 e788 80e6 9480 e2b8  ................

998

asked May 06 '11 07:05

Orionpax

1 Answers

Your file seems to be encoded in UTF-16LE. The bytes notepad adds are called "Byte Order Mark", or just BOM.

Here's how you can read your file using Perl:

use strict;
use warnings;
use Encode;
# list loaded encodings
print STDERR map "* $_\n", Encode->encodings;
# read arguments
my $enc = shift || 'utf16';
die "no files :-(\n" unless @ARGV;
# process files
for ( @ARGV ) {
    open my $fh, "<:encoding($enc)", $_ or die "open $_: $!";
    print <$fh>;
    close $fh;
}
# loaded more encodings now
print STDERR map "* $_\n", Encode->encodings;

Proceed like this, taking care to supply the correct encoding for your file:

perl open.pl utf16 open.utf16be.txt
perl open.pl utf16 open.utf16le.txt
perl open.pl utf16le open.utf16le.nobom.txt

Here's the revised version following tchrist's suggestions:

use strict;
use warnings;
use Encode;

# read arguments
my $enc_in  = shift || die 'pass file encoding as first parameter';
my $enc_out = shift || die 'pass STDOUT encoding as second parameter';
print STDERR "going to read files as encoded in: $enc_in\n";
print STDERR "going to write to standard output in: $enc_out\n";
die "no files :-(\n" unless @ARGV;

binmode STDOUT, ":encoding($enc_out)"; # latin1, cp1252, utf8, UTF-8

print STDERR map "* $_\n", Encode->encodings; # list loaded encodings

for ( @ARGV ) { # process files
    open my $fh, "<:encoding($enc_in)", $_ or die "open $_: $!";
    print while <$fh>;
    close $fh;
}

print STDERR map "* $_\n", Encode->encodings; # more encodings now

answered Oct 13 '22 11:10

Lumi

Related questions
                            
                                What is a good pure Perl on-line or streaming statistics package?
                            
                                Trapping arrow keys
                            
                                How can I POST a multipart HTTP request from Perl to Java and get a response?
                            
                                How can I filter a Perl DBIx recordset with 2 conditions on the same column?
                            
                                What is Perl's equivalent to awk's /text/,/END/?
                            
                                How can I tell if Log4perl emitted any warnings during a run?
                            
                                How can I determine if an object or reference has a valid string coercion?
                            
                                Perl - How to get the number of elements in an anonymous array, for concisely trimming pathnames
                            
                                Well written Perl Open Source to learn from? [closed]
                            
                                Modify POST request in mod_perl2
                            
                                Where can I read a clear explanation of POE (Perl Object Environment)?
                            
                                What, if any, are the disadvantages of SQL::Interp over SQL::Abstract?
                            
                                How do you create application-level options using Perl's App::Cmd?
                            
                                How can I find which jobs in a dependency tree can be run in parallel?
                            
                                How would I use a hash slice to initialize a hash stored in a data structure?
                            
                                Date formats for weeks
                            
                                Is there a good process monitoring and control framework in either Ruby or Perl?
                            
                                Adding values to a hash in Template Toolkit
                            
                                Understanding oAuth with Perl
                            
                                What good is -CSDA specified only on the shebang line?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With