Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode character usage statistics [closed]

Tags:

unicode

I am looking for some statistical data on the usage of Unicode characters in textual documents (with any markup). Googling brought no results.

Background: I am currently developing a finite state machine-based text processing tool. Statistical data on characters might help searching for the right transitions. For instance latin characters are probably most used so it might make sense to check for those first.

Did anyone by chance gathered or saw such statistics?

(I'm not focused on specific languages or locales. Think general-purpose parser like an XML parser.)

like image 762
lexicore Avatar asked Mar 04 '14 22:03

lexicore


1 Answers

To sum up current findings and ideas:

  • Tom Christiansen gathered such statistics for PubMed Open Access Corpus (see this question). I have asked if he could share these statistics, waiting for the answer.
  • As @Boldewyn and @nwellnhof suggested, I could run the analysis of the complete Wikipedia dump or CommonCrawl data. I think these are good suggestions, I'll probably go with the CommonCrawl.

So sorry, this is not an answer, but a good research direction.

UPDATE: I have written a small Hadoop job and ran it on one of the CommonCrawl segments. I have posted my results in a spreadsheet here. Below are the first 50 characters:

0x000020    14627262     
0x000065    7492745 e
0x000061    5144406 a
0x000069    4791953 i
0x00006f    4717551 o
0x000074    4566615 t
0x00006e    4296796 n
0x000072    4293069 r
0x000073    4025542 s
0x00000a    3140215 
0x00006c    2841723 l
0x000064    2132449 d
0x000063    2026755 c
0x000075    1927266 u
0x000068    1793540 h
0x00006d    1628606 m
0x00fffd    1579150 
0x000067    1279990 g
0x000070    1277983 p
0x000066    997775  f
0x000079    949434  y
0x000062    851830  b
0x00002e    844102  .
0x000030    822410  0
0x0000a0    797309  
0x000053    718313  S
0x000076    691534  v
0x000077    682472  w
0x000031    648470  1
0x000041    624279  @
0x00006b    555419  k
0x000032    548220  2
0x00002c    513342  ,
0x00002d    510054  -
0x000043    498244  C
0x000054    495323  T
0x000045    455061  E
0x00004d    426545  M
0x000050    423790  P
0x000049    405276  I
0x000052    393218  R
0x000044    381975  D
0x00004c    365834  L
0x000042    353770  B
0x000033    334689  E
0x00004e    325299  N
0x000029    302497  /
0x000028    301057  (
0x000035    298087  5
0x000046    295148  F

To be honest, I have no idea if these results are representative. As I said, I only analysed one segment. Looks quite plausible for me. One can also easily spot that the markup is already stripped off - so the distribution is not directly suitable for my XML parser. But it gives valuable hints on which character ranges to check first.

like image 126
lexicore Avatar answered Sep 23 '22 23:09

lexicore