How to parse freedict files (*.dict and *.index)

Question

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.

The *.index files look following:

00databasealphabet  QdGI    l
00databasedictfmt1121   B   b
00databaseinfo  c   5o
00databaseshort 6E  u
00databaseurl   6y  c
00databaseutf8  A   B
a   BHO M
a bad risc  BHa u
a bag of nerves BII 2
[...]

and the *.dict files:

[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
 to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
 to merge (with)
[...]

I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.

Xuân-Lợi Vũ · Accepted Answer

It is too late. However, i hope that it can be useful for others like me.

JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files. https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py

micha137 · Answer

dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.

If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:

The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.

For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.

Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).

Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.

After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

How to parse freedict files (.dict and .index)

Tags:

java

python

translation

language-translation

BloodyD

2 Answers

Xuân-Lợi Vũ

micha137

Recent Activity

Donate For Us