I have a very long DataArray of strings, and I would like to to generate a DataFrame in which one column is all the unique strings and the second is the number of occurrences. Right now I'm doing something like
using DataFrames
df = DataFrame()
df[:B]=[ "a", "c", "c", "D", "E"]
uniqueB = unique(df[:B])
println(uniqueB)
howMany=zeros(size(uniqueB))
for i=1:size(uniqueB,1)
howMany[i] = count(j->(j==uniqueB[i]), df[:B])
end
answer = DataFrame()
answer[:Letters] = uniqueB
answer[:howMany] = howMany
answer
but it seems like there should be a much easier way to do this, possibly with a single line. (I know I could also make this a bit faster with somewhat more code by searching the result in each iteration rather than the source.) A possibly related question is here but it doesn't look like hist is overloaded for non-numerical bins. Any thoughts?
If you want a full frame, you can group by B and call nrow
on each group:
julia> by(df, :B, nrow)
4x2 DataFrames.DataFrame
| Row | B | x1 |
|-----|-----|----|
| 1 | "D" | 1 |
| 2 | "E" | 1 |
| 3 | "a" | 1 |
| 4 | "c" | 2 |
Even outside the DataFrame context, though, you can always use DataStructures.counter
rather than reimplementing it yourself:
julia> using DataStructures
julia> counter(df[:B])
DataStructures.Accumulator{ASCIIString,Int32}(Dict("D"=>1,"a"=>1,"c"=>2,"E"=>1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With