How to gather characters usage statistics in text file using Unix commands?

2 Answers

This should do what you're looking for:

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.

Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:

$ echo hello |  sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
  1 
  1 e
  1 h
  1 o
  2 l

125

answered Sep 18 '22 06:09

Kamal

This will do it:

#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <[email protected]>

use open ":utf8";

++$seen{ ord() } for split //;

END {
    for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
        printf "%04X %d\n", $cp, $seen{$cp};
    }
}

Run on itself, that program produces:

$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9

If you want the literal character and/or name of the character, too, that’s easy to add.

If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.

#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <[email protected]>

use strict;
use warnings qw<FATAL all>;

use open ":utf8";

my %cats;

our %Prop_Table;
build_prop_table();

if (@ARGV == 0 && -t STDIN) {
    warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING

} 

while (<>) {
    for (split(//)) {

        $cats{Total}++;

        if (/\p{ASCII}/) { $cats{ASCII}++   } 
        else             { $cats{Unicode}++ } 

        my $gcat   = get_general_category($_);
        $cats{$gcat}++;

        my $subcat = get_general_subcategory($_);
        $cats{$subcat}++;

    } 
} 

my $width = length $cats{Total};

my $mask = "%*d %s\n";

for my $cat(qw< Total ASCII Unicode >) { 
    printf $mask, $width => $cats{$cat} || 0, $cat; 
}
print "\n";

my @catnames = qw[
    L Lu Ll Lt Lm Lo
    N Nd Nl No
    S Sm Sc Sk So
    P Pc Pd Ps Pe Pi Pf Po
    M Mn Mc Me
    Z Zs Zl Zp
    C Cc Cf Cs Co Cn
];

#for my $cat (sort keys %cats) {
for my $cat (@catnames) {
    next if length($cat) > 2;
    next unless $cats{$cat};

    my $prop = length($cat) == 1 
                 ? ( " " . q<\p> .   $cat          )
                 : (       q<\p> . "{$cat}" . "\t" )
             ;

    my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});

    printf $mask, $width => $cats{$cat}, $desc;
} 

exit;

sub get_general_category {
    my $_ = shift();
    return "L" if /\pL/;
    return "S" if /\pS/;
    return "P" if /\pP/;
    return "N" if /\pN/;
    return "C" if /\pC/;
    return "M" if /\pM/;
    return "Z" if /\pZ/;

    die "not reached one: $_";
} 

sub get_general_subcategory {
    my $_ = shift();

    return "Lu" if /\p{Lu}/;
    return "Ll" if /\p{Ll}/;
    return "Lt" if /\p{Lt}/;
    return "Lm" if /\p{Lm}/;
    return "Lo" if /\p{Lo}/;

    return "Mn" if /\p{Mn}/;
    return "Mc" if /\p{Mc}/;
    return "Me" if /\p{Me}/;

    return "Nd" if /\p{Nd}/;
    return "Nl" if /\p{Nl}/;
    return "No" if /\p{No}/;

    return "Pc" if /\p{Pc}/;
    return "Pd" if /\p{Pd}/;
    return "Ps" if /\p{Ps}/;
    return "Pe" if /\p{Pe}/;
    return "Pi" if /\p{Pi}/;
    return "Pf" if /\p{Pf}/;
    return "Po" if /\p{Po}/;

    return "Sm" if /\p{Sm}/;
    return "Sc" if /\p{Sc}/;
    return "Sk" if /\p{Sk}/;
    return "So" if /\p{So}/;

    return "Zs" if /\p{Zs}/;
    return "Zl" if /\p{Zl}/;
    return "Zp" if /\p{Zp}/;

    return "Cc" if /\p{Cc}/;
    return "Cf" if /\p{Cf}/;
    return "Cs" if /\p{Cs}/;
    return "Co" if /\p{Co}/;
    return "Cn" if /\p{Cn}/;

    die "not reached two: <$_> " . sprintf("U+%vX", $_);

}

sub build_prop_table { 

    for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {

       L           Letter
       Lu          Uppercase_Letter
       Ll          Lowercase_Letter
       Lt          Titlecase_Letter
       Lm          Modifier_Letter
       Lo          Other_Letter

       M           Mark  (combining characters, including diacritics)
       Mn          Nonspacing_Mark
       Mc          Spacing_Mark
       Me          Enclosing_Mark

       N           Number
       Nd          Decimal_Number (also Digit)
       Nl          Letter_Number
       No          Other_Number

       P           Punctuation
       Pc          Connector_Punctuation
       Pd          Dash_Punctuation
       Ps          Open_Punctuation
       Pe          Close_Punctuation
       Pi          Initial_Punctuation (may behave like Ps or Pe depending on usage)
       Pf          Final_Punctuation (may behave like Ps or Pe depending on usage)
       Po          Other_Punctuation

       S           Symbol
       Sm          Math_Symbol
       Sc          Currency_Symbol
       Sk          Modifier_Symbol
       So          Other_Symbol

       Z           Separator
       Zs          Space_Separator
       Zl          Line_Separator
       Zp          Paragraph_Separator

       C           Other (means not L/N/P/S/Z)
       Cc          Control (also Cntrl)
       Cf          Format
       Cs          Surrogate   (not usable)
       Co          Private_Use
       Cn          Unassigned

End_of_Property_List

            my($short_prop, $long_prop) = $line =~ m{ 
                \b 
                 ( \p{Lu}  \p{Ll}   ? ) 
                \s + 
                 ( \p{Lu} [\p{L&}_] + )
                \b
            }x;

            $Prop_Table{$short_prop} = $long_prop;

    }

}

For example:

$ unicats book.txt
2357232 Total
2357199 ASCII
     33 Unicode

1604949  \pL   Letter
  74455 \p{Lu}   Uppercase_Letter
1530485 \p{Ll}   Lowercase_Letter
      9 \p{Lo}   Other_Letter
  10676  \pN   Number
  10676 \p{Nd}   Decimal_Number
  19679  \pS   Symbol
  10705 \p{Sm}   Math_Symbol
   8365 \p{Sc}   Currency_Symbol
    603 \p{Sk}   Modifier_Symbol
      6 \p{So}   Other_Symbol
 111899  \pP   Punctuation
   2996 \p{Pc}   Connector_Punctuation
   6145 \p{Pd}   Dash_Punctuation
  11392 \p{Ps}   Open_Punctuation
  11371 \p{Pe}   Close_Punctuation
  79995 \p{Po}   Other_Punctuation
 548529  \pZ   Separator
 548529 \p{Zs}   Space_Separator
  61500  \pC   Other
  61500 \p{Cc}   Control

answered Sep 17 '22 06:09

tchrist

Related questions
                            
                                Why doesn't "yes | head" hang?
                            
                                Reducing JSON with jq
                            
                                Extracting just a phone number from a file
                            
                                Installing VIM in FreeBSD
                            
                                How to set memory limit to a process in Golang
                            
                                NPM: make package.json work under both Unix (Mac OSX) and Windows
                            
                                How to hide a password from a prompted osascript dialogue from a Mac/UNIX Shell Script
                            
                                Terminal - current directory symbol [closed]
                            
                                Shell script to copy and prepend folder name to files from multiple subdirectories
                            
                                Remove ^@ Characters in a Unix File
                            
                                How to output single character in C with write?
                            
                                How to remove newlines inside csv cells using regex/terminal tools?
                            
                                tee command not working as expected (with read and echo)
                            
                                Why is grep so slow and memory intensive with -w (--word-regexp) flag?
                            
                                Combine lines with matching first field
                            
                                Docker build image - glob error { Error: EPERM: operation not permitted, scandir
                            
                                emulate unix 'cut' using standard windows command line/batch commands
                            
                                How to execute a Unix shell script from Windows
                            
                                Get instruction pointer of running application on Unix
                            
                                Linux/UNIX install data files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to gather characters usage statistics in text file using Unix commands?

Tags:

text

unix

command-line

statistics

ocr

Hali

People also ask

2 Answers

Kamal

tchrist

Recent Activity

Donate For Us