Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse freedict files (*.dict and *.index)

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.

The *.index files look following:

00databasealphabet  QdGI    l
00databasedictfmt1121   B   b
00databaseinfo  c   5o
00databaseshort 6E  u
00databaseurl   6y  c
00databaseutf8  A   B
a   BHO M
a bad risc  BHa u
a bag of nerves BII 2
[...]

and the *.dict files:

[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
 to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
 to merge (with)
[...]

I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.

like image 751
BloodyD Avatar asked Oct 01 '15 12:10

BloodyD


2 Answers

It is too late. However, i hope that it can be useful for others like me.

JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files. https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py

like image 153
Xuân-Lợi Vũ Avatar answered Oct 24 '22 06:10

Xuân-Lợi Vũ


dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.

If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:

  1. The headword for looking up the definition.
  2. The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
  3. The length of the definition in bytes, base64 encoded.

For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.

Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).

Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.

After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

like image 25
micha137 Avatar answered Oct 24 '22 04:10

micha137