I have come up with a method to determine encoding (or at least a guess at it) for a file that I pass in:
def encoding_type(file_path)
File.read(file_path).encoding.name
end
The problem with this is that I have a file that is 15GB, so that means the entire file is being read into memory.
Is there anyway to accomplish what I am doing in this method without needing to read the entire file into memory?
The file -mime
command will return the mime type and encoding of the file:
file -mime myfile
myfile: text/plain; charset=iso-8859-1
def detect_charset(file_path)
`file --mime #{file_path}`.strip.split('charset=').last
rescue => e
Rails.logger.warn "Unable to determine charset of #{file_path}"
Rails.logger.warn "Error: #{e.message}"
end
The method you suggest in your question will not do what you think. It will simply set the file to the Encoding.default_internal
encoding, possibly after transcoding it from Encoding.default_external
. These are both usually UTF-8. The encoding is going to always be Encoding.default_internal
after you run that code, it is not guessing or determining the encoding from the actual file.
If you have a file and you really don't know what encoding it is, you indeed will have to guess. There's no way to be 100% sure you've gotten it right as the author intended (and some files are corrupt and mixed encoding or not legal in any encoding).
There are libraries with heuristics meant to try and guess (they won't be right all the time).
Here's one, which I've never actually used myself, but the likelyist prospect I found in 10 minutes of googling: https://github.com/oleander/rchardet There might be other ruby gems for this. You could also use ruby system() to call a linux command line utility that tries to do this as well, someone above mentions the Linux file
command.
If you don't want to load the entire file in to test it, you can certainly just load part of it in. Probably the chardet library will work more reliably the more it's got, but, sure, just read the first X bytes of the file in and then ask chardet to guess it's encoding.
require 'chardet19'
first1000bytes = File.read(file, 1000)
cd = CharDet.detect(first1000bytes)
cd.encoding
cd.confidence
You can also always check to see if any string in ruby is valid for the encoding it's set at:
str.valid_encoding?
So you could simply go through a variety of encodings and see if it's valid:
orig_encoding = str.encoding
str.force_encoding("ISO-8859-1").valid_encoding?
str.force_encoding("UTF-8").valid_encoding?
str.force_enocding(orig_encoding) # put it back to what it was
But it's certainly possible for a file to be valid in more than one encoding, or to be valid in a given encoding but read as nonsense by humans in that encoding.
If you have your best guess encoding, but it's still not valid_encoding?
for that encoding, it may just have a few bad bytes in it. You can remove them with String.scrub in ruby 2.1, or with this pure-ruby backport of String.scrub in other ruby versions.
Hope this helps give you some idea of what you're dealing with and what your options are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With