if I have the string "UGGUGUUAUUAAUGGUUU"
how to I turn it into a list split up by every 3 characters into ["UGG", "UGU", "UAU", "UAA", "UGG", "UUU"]
?
Python String split() Method The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.
split() The method split() splits a String into multiple Strings given the delimiter that separates them. The returned object is an array which contains the split Strings. We can also pass a limit to the number of elements in the returned array.
If your string contains only ASCII characters and your string's byte_size
is a multiple of 3, there's a really elegant solution using a lesser known Elixir feature: binary comprehensions:
iex(1)> string = "UGGUGUUAUUAAUGGUUU"
"UGGUGUUAUUAAUGGUUU"
iex(2)> for <<x::binary-3 <- string>>, do: x
["UGG", "UGU", "UAU", "UAA", "UGG", "UUU"]
This splits the string into chunks of 3 bytes. This will be much faster than splitting on codepoints or graphemes but will not work correctly if your string contains non-ASCII characters. (In that case I'd go with @michalmuskala's answer.)
Edit: Patrick Oscity's answer reminded me that this can also work for codepoints:
iex(1)> string = "αβγδεζηθικλμνξοπρςστυφχψ"
"αβγδεζηθικλμνξοπρςστυφχψ"
iex(2)> for <<a::utf8, b::utf8, c::utf8 <- string>>, do: <<a::utf8, b::utf8, c::utf8>>
["αβγ", "δεζ", "ηθι", "κλμ", "νξο", "πρς", "στυ", "φχψ"]
"UGGUGUUAUUAAUGGUUU"
|> String.codepoints
|> Enum.chunk_every(3)
|> Enum.map(&Enum.join/1)
I am also wondering if there's a more elegant version
This can be achieved using the Stream.unfold/2
function. In a way, it's the opposite of reduce
- reduce allows us collapsing a collection into a single value, unfold is about expanding a single value into a collection.
As a generator for Stream.unfold/2
we need a function that returns a tuple - first element is the next member of the generated collection, and the second is the accumulator we're going to pass into the next iteration. This describes exactly the function String.split_at/2
. Finally, we need a termination condition - String.split_at("", 3)
will return {"", ""}
. We're not interested in empty strings, so it should be enough to process our generated stream until we encounter the empty string - this can be achieved with Enum.take_while/2
.
string
|> Stream.unfold(&String.split_at(&1, 3))
|> Enum.take_while(&(&1 != ""))
Another possibility would be using Regex.scan/2
:
iex> string = "abcdef"
iex> Regex.scan(~r/.{3}/, string)
[["abc"], ["def"]]
# In case the number of characters is not evenly divisible by 3
iex> string = "abcdefg"
iex> Regex.scan(~r/.{1,3}/, string)
[["abc"], ["def"], ["g"]]
# If you need to handle unicode characters, you can add the `u` modifier
iex> string = "🙈🙉🙊abc"
iex> Regex.scan(~r/.{1,3}/u, string)
[["🙈🙉🙊"], ["abc"]]
Or using a recursive function, which is a bit verbose but should IMO be the best performing solution using eager evaluation:
defmodule Split do
def tripels(string), do: do_tripels(string, [])
defp do_tripels(<<x::utf8, y::utf8, z::utf8, rest::binary>>, acc) do
do_tripels(rest, [<<x::utf8, y::utf8, z::utf8>> | acc])
end
defp do_tripels(_rest, acc) do
Enum.reverse(acc)
end
end
# in case you actually want the rest in the result, change the last clause to
defp do_tripels(rest, acc) do
Enum.reverse([rest | acc])
end
Please try
List.flatten(Regex.scan(~r/.../, "UGGUGUUAUUAAUGGUUU"))
You will get
["UGG", "UGU", "UAU", "UAA", "UGG", "UUU"]
Source from documentation:
scan method
flatten method
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With