Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining encoding for a file in Ruby

I have come up with a method to determine encoding (or at least a guess at it) for a file that I pass in:

def encoding_type(file_path)
 File.read(file_path).encoding.name
end

The problem with this is that I have a file that is 15GB, so that means the entire file is being read into memory.

Is there anyway to accomplish what I am doing in this method without needing to read the entire file into memory?

like image 869
Jackson Avatar asked Jul 22 '14 20:07

Jackson


2 Answers

The file -mime command will return the mime type and encoding of the file:

file -mime myfile

myfile: text/plain; charset=iso-8859-1

def detect_charset(file_path)
  `file --mime #{file_path}`.strip.split('charset=').last
rescue => e 
  Rails.logger.warn "Unable to determine charset of #{file_path}"
  Rails.logger.warn "Error: #{e.message}"
end
like image 54
Darren Hicks Avatar answered Oct 02 '22 12:10

Darren Hicks


The method you suggest in your question will not do what you think. It will simply set the file to the Encoding.default_internal encoding, possibly after transcoding it from Encoding.default_external. These are both usually UTF-8. The encoding is going to always be Encoding.default_internal after you run that code, it is not guessing or determining the encoding from the actual file.

If you have a file and you really don't know what encoding it is, you indeed will have to guess. There's no way to be 100% sure you've gotten it right as the author intended (and some files are corrupt and mixed encoding or not legal in any encoding).

There are libraries with heuristics meant to try and guess (they won't be right all the time).

Here's one, which I've never actually used myself, but the likelyist prospect I found in 10 minutes of googling: https://github.com/oleander/rchardet There might be other ruby gems for this. You could also use ruby system() to call a linux command line utility that tries to do this as well, someone above mentions the Linux file command.

If you don't want to load the entire file in to test it, you can certainly just load part of it in. Probably the chardet library will work more reliably the more it's got, but, sure, just read the first X bytes of the file in and then ask chardet to guess it's encoding.

 require 'chardet19'

 first1000bytes = File.read(file, 1000)
 cd = CharDet.detect(first1000bytes)
 cd.encoding
 cd.confidence

You can also always check to see if any string in ruby is valid for the encoding it's set at:

 str.valid_encoding?

So you could simply go through a variety of encodings and see if it's valid:

 orig_encoding = str.encoding

 str.force_encoding("ISO-8859-1").valid_encoding?
 str.force_encoding("UTF-8").valid_encoding?

 str.force_enocding(orig_encoding) # put it back to what it was

But it's certainly possible for a file to be valid in more than one encoding, or to be valid in a given encoding but read as nonsense by humans in that encoding.

If you have your best guess encoding, but it's still not valid_encoding? for that encoding, it may just have a few bad bytes in it. You can remove them with String.scrub in ruby 2.1, or with this pure-ruby backport of String.scrub in other ruby versions.

Hope this helps give you some idea of what you're dealing with and what your options are.

like image 45
jrochkind Avatar answered Oct 02 '22 13:10

jrochkind