Is there a convenience function for truncating strings to a certain length?
It would equivalent to something like this
test_str = "test"
if length(test_str) > 8
out_str = test_str[1:8]
else
out_str = test_str
end
In the naive ASCII world:
truncate_ascii(s,n) = s[1:min(sizeof(s),n)]
would do. If it's preferable to share memory with original string and avoid copying SubString can be used:
truncate_ascii(s,n) = SubString(s,1,min(sizeof(s),n))
But in a Unicode world (and it is a Unicode world) this is better:
truncate_utf8(s,n) = SubString(s,1, (eo=endof(s) ; neo=0 ;
for i=1:n
if neo<eo neo=nextind(s,neo) ; else break ; end ;
end ; neo) )
Finally, @IsmaelVenegasCastelló reminded us of grapheme complexity (arrrgh), and then this is what's needed:
function truncate_grapheme(s,n)
eo = endof(s) ; tt = 0 ; neo=0
for i=1:n
if (neo<eo)
tt = nextind(s,neo)
while neo>0 && tt<eo && !Base.UTF8proc.isgraphemebreak(s[neo],s[tt])
(neo,tt) = (tt,nextind(s,tt))
end
neo = tt
else
break
end
end
return SubString(s,1,neo)
end
These last two implementations try to avoid calculating the length
(which can be slow) or allocating/copying, or even just looping n
times when the length
is shorter.
This answer draws on contributions of @MichaelOhlrogge, @FengyangWang, @Oxinabox and @IsmaelVenegasCastelló
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With