Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to clean up a string to use as a filename in PERL?

Tags:

regex

perl

I have a job application form where people fill in their name and contact info and attach a resume.

The the contact info gets emailed and the resume attached.

I would like to change the name of the file to that it is a combination of the competition number and their name.

How can I clean up my generated filename so that I can guarantee it has no invalid characters in it. So far I can remove all the spaces and lowercase the string.

I'd like to remove any punctuation ( like apostrophes ) and non-alphabetical characters ( like accents ).

For example if "André O'Hara" submitted his resume for job 555 using this form, I would be happy if all the questionable characters were removed and I ended up with a file name like:

555-andr-ohara-resume.doc

What regex can I use to remove all non-alphabetical characters ?

Here is my code so far:

 # Create a cleaned up version of competition number + First Name + Last Name number to name the file
 my $hr_generated_filename = $cgi->param("competition")  . "-" . $cgi->param("first") . "-" . $cgi->param("last");

 # change to all lowercase
 $hr_generated_filename = lc( $hr_generated_filename );

 # remove all whitespace
 $hr_generated_filename =~ s/\s+//g;

 push @{ $msg->{attach} }, {
    Type        => 'application/octet-stream',
    Filename    => $hr_generated_filename.".$file-extension",
    Data        => $data,
    Disposition => 'attachment',
    Encoding    => 'base64',
 };
like image 518
jeph perro Avatar asked Aug 18 '10 19:08

jeph perro


1 Answers

If you are trying to "white-list" characters, your basic approach should be to use a character class complement:

[...] defines a character class in Perl regexes, which will match any characters defined inside (including ranges such as a-z). If you add a ^, it becomes a complement, so it matches any characters not defined inside the brackets.

$hr_generated_filename =~ s/[^A-Za-z0-9\-\.]//g;

That will remove anything that is not an un-accented Latin letter, a number, a dash, or a dot. To add to your white-list, just add characters inside the [^...].

like image 176
Bounderby Avatar answered Oct 21 '22 13:10

Bounderby