I'm trying to isolate the single words in a pdf file, but when reading the file using the pdf-reader gem the text arrives fractured, like this
"A lit"
"tle "
"bit of tex"
"t"
So I'm planning to put these together using some heuristics. For this, I need a library which checks if a given string is a valid english word, like
"tree".is_english? # => true
"askdjfah".is_english? # => false
Does this exist? Ideally, it would also work with german text.
If not, is there some freely available dictionary online? I guess I could write my own tree structure to do the lookup, if i had to.
If you have the unix tool look installed on your system, you can check whether a word is a word easily. Example:
strings = %w{ cat dog tree trees treez }
strings.each do |string|
if system("look #{string} > /dev/null 2>&1")
puts "#{string} is a word"
else
puts "#{string} is not a word"
end
end
Here's more information on look: http://docstore.mik.ua/orelly/unix/upt/ch27_18.htm
Since look uses the word dictionary in /usr/dict/words, I think it's possible to install a German word dictionary. Look for the wgerman package in Debian. I'm not sure how to install it on other systems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With