Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode-aware strings(1) program

Tags:

string

unicode

Does anybody have a code sample for a unicode-aware strings program? Programming language doesn't matter. I want something that essentially does the same thing as the unix command "strings", but that also functions on unicode text (UTF-16 or UTF-8), pulling runs of english-language characters and punctuation. (I only care about english characters, not any other alphabet).

Thanks!

like image 782
Evan Avatar asked Feb 23 '09 15:02

Evan


2 Answers

Do you just want to use it, or do you for some reason insist on the code?

On my Debian system, it seems strings command can do this out of the box. See the exercept from the manpage:

  --encoding=encoding
       Select the character encoding of the strings that are to be found.  Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO  8859,
       etc.,  default),  S  = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful
       for finding wide character strings.

Edit: OK. I don't know C# so this may be a bit hairy, but basically, you need to search for sequences of alternating zeros and English characters.

byte b;
int i=0;
while(!endOfInput()) {
  b=getNextByte();
LoopBegin:
  if(!isEnglish(b)) {
    if(i>0) // report successful match of length i
    i=0;
    continue;
  }
  if(endOfInput()) break;
  if((b=getNextByte())!=0)
    goto LoopBegin;
  i++; // found another character
}

This should work for little-endian.

like image 123
jpalecek Avatar answered Nov 15 '22 18:11

jpalecek


I had a similar problem and tried the "strings -e ..." but I just found options for fix width chars encoding. (UTF-8 encoding is variable width).

Remeber thar by default characters outside ascii need extra strings options. This includes almost all non English language strings.

Nevertheless "-e S" (single 8 bits chars) output includes UTF-8 chars.

I wrote a very simple (opinion-ed) Perl script that applies a "strings -e S ... | iconv ..." to the input files.

I believe it is easy to tune it for specific restrictions. Usage: utf8strings [options] file*

#!/usr/bin/perl -s

our ($all,$windows,$enc);   ## use -all ignore the "3 letters word" restriction
use strict;
use utf8::all;

$enc = "ms-ansi" if     $windows;  ##
$enc = "utf8"    unless $enc    ;  ## defaul encoding=utf8
my $iconv = "iconv -c -f $enc -t utf8 |";

for (@ARGV){ s/(.*)/strings -e S '$1'| $iconv/;}

my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i;   # adapt this to your case

while(<>){
   # next if /regular expressions for common garbage/; 
   print    if ($all or /$word/);
}

In some situations, this approach produce some extra garbage.

like image 26
JJoao Avatar answered Nov 15 '22 19:11

JJoao