I'm saving the recording of a set of sentences to a corresponding set of audio files.
Sentences include:
Ich weiß es nicht!
¡No lo sé!
Ég veit ekki!
How would you recommend I convert the sentence to a human readable filename which will later be served on an online server. I'm not sure right now as to what languages I might be dealing with in the future.
UPDATE:
Please note that two sentences can't clash with each other. For example:
É bär icke dej.
E bår icke dej.
can't resolve to the same filename as these will overwrite each other. This is the problem with the slugify function mentioned here: Turn a string into a valid filename?
The best I have come up with is to use urllib.parse.quote. However I think the resulting output is harder to read than I would have hoped. Any suggestions?:
Ich%20wei%C3%9F%20es%20nicht%21
%C2%A1No%20lo%20s%C3%A9%21
%C3%89g%20veit%20ekki%21
What about unidecode?
import unidecode
a = [u'Ich weiß es nicht!', u'¡No lo sé!', u'Ég veit ekki!']
for s in a:
print(unidecode.unidecode(s).replace(' ', '_'))
This gives pure ASCII strings that can readily be processed if they still contain unwanted characters. Keeping spaces distinct in the form of underscores helps with readability.
Ich_weiss_es_nicht!
!No_lo_se!
Eg_veit_ekki!
If uniqueness is a problem, a hash or something like that might be added to the strings.
Edit:
Some clarification seems to be required with respect to the hashing. Many hash functions are explicitely designed for giving very different outputs for close inputs. For example, the built-in hash function of python gives:
In [1]: hash('¡No lo sé!')
Out[1]: 6428242682022633791
In [2]: hash('¡No lo se!')
Out[2]: 4215591310983444451
With that you can do something like
unidecode.unidecode(s).replace(' ', '_') + '_' + str(hash(s))[:10]
in order to get not too long strings. Even with such shortened hashes, clashes are pretty unlikely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With