I have got a text file created using OCR software - about one megabyte in size. Some uncommon characters appears all over document and most of them are OCR errors.
I would like find all characters used in document to easily spot errors (like UNIQ
command but for characters, not for lines).
I am on Ubuntu. What Unix command I should use to display all characters used in text file?
The most easiest way to count the number of lines, words, and characters in text file is to use the Linux command “wc” in terminal. The command “wc” basically means “word count” and with different optional parameters one can use it to count the number of lines, words, and characters in a text file.
Use the wc command to count the number of lines, words, and bytes in the files specified by the File parameter. If a file is not specified for the File parameter, standard input is used. The command writes the results to standard output and keeps a total count for all named files.
Approach: The characters can be counted easily by reading the characters in the file using getc() method.
This should do what you're looking for:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
The premise is that the sed
puts each character in the file onto a line by itself, then the usual sort | uniq -c
sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.
Also, you could append | sort -n
to the end of the whole sequence to sort the output by how many times each character occurred. Example:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
1
1 e
1 h
1 o
2 l
This will do it:
#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <[email protected]>
use open ":utf8";
++$seen{ ord() } for split //;
END {
for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
printf "%04X %d\n", $cp, $seen{$cp};
}
}
Run on itself, that program produces:
$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9
If you want the literal character and/or name of the character, too, that’s easy to add.
If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.
#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <[email protected]>
use strict;
use warnings qw<FATAL all>;
use open ":utf8";
my %cats;
our %Prop_Table;
build_prop_table();
if (@ARGV == 0 && -t STDIN) {
warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING
}
while (<>) {
for (split(//)) {
$cats{Total}++;
if (/\p{ASCII}/) { $cats{ASCII}++ }
else { $cats{Unicode}++ }
my $gcat = get_general_category($_);
$cats{$gcat}++;
my $subcat = get_general_subcategory($_);
$cats{$subcat}++;
}
}
my $width = length $cats{Total};
my $mask = "%*d %s\n";
for my $cat(qw< Total ASCII Unicode >) {
printf $mask, $width => $cats{$cat} || 0, $cat;
}
print "\n";
my @catnames = qw[
L Lu Ll Lt Lm Lo
N Nd Nl No
S Sm Sc Sk So
P Pc Pd Ps Pe Pi Pf Po
M Mn Mc Me
Z Zs Zl Zp
C Cc Cf Cs Co Cn
];
#for my $cat (sort keys %cats) {
for my $cat (@catnames) {
next if length($cat) > 2;
next unless $cats{$cat};
my $prop = length($cat) == 1
? ( " " . q<\p> . $cat )
: ( q<\p> . "{$cat}" . "\t" )
;
my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});
printf $mask, $width => $cats{$cat}, $desc;
}
exit;
sub get_general_category {
my $_ = shift();
return "L" if /\pL/;
return "S" if /\pS/;
return "P" if /\pP/;
return "N" if /\pN/;
return "C" if /\pC/;
return "M" if /\pM/;
return "Z" if /\pZ/;
die "not reached one: $_";
}
sub get_general_subcategory {
my $_ = shift();
return "Lu" if /\p{Lu}/;
return "Ll" if /\p{Ll}/;
return "Lt" if /\p{Lt}/;
return "Lm" if /\p{Lm}/;
return "Lo" if /\p{Lo}/;
return "Mn" if /\p{Mn}/;
return "Mc" if /\p{Mc}/;
return "Me" if /\p{Me}/;
return "Nd" if /\p{Nd}/;
return "Nl" if /\p{Nl}/;
return "No" if /\p{No}/;
return "Pc" if /\p{Pc}/;
return "Pd" if /\p{Pd}/;
return "Ps" if /\p{Ps}/;
return "Pe" if /\p{Pe}/;
return "Pi" if /\p{Pi}/;
return "Pf" if /\p{Pf}/;
return "Po" if /\p{Po}/;
return "Sm" if /\p{Sm}/;
return "Sc" if /\p{Sc}/;
return "Sk" if /\p{Sk}/;
return "So" if /\p{So}/;
return "Zs" if /\p{Zs}/;
return "Zl" if /\p{Zl}/;
return "Zp" if /\p{Zp}/;
return "Cc" if /\p{Cc}/;
return "Cf" if /\p{Cf}/;
return "Cs" if /\p{Cs}/;
return "Co" if /\p{Co}/;
return "Cn" if /\p{Cn}/;
die "not reached two: <$_> " . sprintf("U+%vX", $_);
}
sub build_prop_table {
for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {
L Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark (combining characters, including diacritics)
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other (means not L/N/P/S/Z)
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
End_of_Property_List
my($short_prop, $long_prop) = $line =~ m{
\b
( \p{Lu} \p{Ll} ? )
\s +
( \p{Lu} [\p{L&}_] + )
\b
}x;
$Prop_Table{$short_prop} = $long_prop;
}
}
For example:
$ unicats book.txt
2357232 Total
2357199 ASCII
33 Unicode
1604949 \pL Letter
74455 \p{Lu} Uppercase_Letter
1530485 \p{Ll} Lowercase_Letter
9 \p{Lo} Other_Letter
10676 \pN Number
10676 \p{Nd} Decimal_Number
19679 \pS Symbol
10705 \p{Sm} Math_Symbol
8365 \p{Sc} Currency_Symbol
603 \p{Sk} Modifier_Symbol
6 \p{So} Other_Symbol
111899 \pP Punctuation
2996 \p{Pc} Connector_Punctuation
6145 \p{Pd} Dash_Punctuation
11392 \p{Ps} Open_Punctuation
11371 \p{Pe} Close_Punctuation
79995 \p{Po} Other_Punctuation
548529 \pZ Separator
548529 \p{Zs} Space_Separator
61500 \pC Other
61500 \p{Cc} Control
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With