I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z]
, [A-Z]
, [0-9]
, _
and -
.
For instance:
my§document$is°° very&interesting___thisIs%nice445.doc.pdf
should become
my_document_is_____very_interesting___thisIs_nice445_doc.pdf
and then ideally
my_document_is_very_interesting_thisIs_nice445_doc.pdf
Is there a nice and elegant way for doing this?
I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning
. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf
into _doc.pdf
, as you requested. And, of course, it doesn't collapse the underscores into one.
Here's my solution:
def sanitize_filename(filename) # Split the name when finding a period which is preceded by some # character, and is followed by some character other than a period, # if there is no following period that is followed by something # other than a period (yeah, confusing, I know) fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m # We now have one or two parts (depending on whether we could find # a suitable period). For each of these parts, replace any unwanted # sequence of characters with an underscore fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' } # Finally, join the parts with a period and return the result return fn.join '.' end
You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:
A
–Z
, a
–z
, 0
–9
and -
should be collapsed into a single _
(i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#'
would become '_'
– rather than '___'
from the parts '$%'
, '__'
and '°#'
)The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.
My results from testing the function:
1.9.3p125 :006 > sanitize_filename 'my§document$is°° very&interesting___thisIs%nice445.doc.pdf' => "my_document_is_very_interesting_thisIs_nice445_doc.pdf"
which I think is what you requested. I hope this is nice and elegant enough.
From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:
def sanitize_filename(filename) returning filename.strip do |name| # NOTE: File.basename doesn't work right with Windows paths on Unix # get only the filename, not the whole path name.gsub!(/^.*(\\|\/)/, '') # Strip out the non-ascii character name.gsub!(/[^0-9A-Za-z.\-]/, '_') end end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With