I would like to average across "Rows" in a column. That is rows that have the same value in another column.
For example :
e= {{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2},
{69, 7, 30, 38, 16, 70, 97, 50, 97, 31, 81, 96, 60, 52, 35, 6,
24, 65, 76, 100}}
I would like to average all the Value in the second column that have the same value in the first one.
So Here : The Average for Col 1 = 1 & Col 1 = 2
And then create a third column with the result of this operation. So the values in that columns should be the same for the first 10 lines an next 10.
Many Thanks for any help you could provide !
LA
Output Ideal Format :
Interesting problem. This is the first thing that came into my mind:
e[[All, {1}]] /. Reap[Sow[#2, #] & @@@ e, _, # -> Mean@#2 &][[2]];
ArrayFlatten[{{e, %}}] // TableForm
To get rounding you may simply add Round@
before Mean
in the code above: Round@Mean@#2
Here is a slightly faster method, but I actually prefer the Sow
/Reap
one above:
#[[1, 1]] -> Round@Mean@#[[All, 2]] & /@ GatherBy[e, First];
ArrayFlatten[{{e, e[[All, {1}]] /. %}}] // TableForm
If you have many different elements in the first column, either of the solutions above can be made faster by applying Dispatch
to the rule list that is produced, before the replacement (/.
) is done. This command tells Mathematica to build and use an optimized internal format for the rules list.
Here is a variant that is slower, but I like it enough to share anyway:
Module[{q},
Reap[{#, Sow[#2,#], q@#} & @@@ e, _, (q@# = Mean@#2) &][[1]]
]
Also, general tips, you can replace:
Table[RandomInteger[{1, 100}], {20}]
with RandomInteger[{1, 100}, 20]
and Join[{c}, {d}] // Transpose
with Transpose[{c, d}]
.
What the heck, I'll join the party. Here is my version:
Flatten/@Flatten[Thread/@Transpose@{#,Mean/@#[[All,All,2]]}&@GatherBy[e,First],1]
Should be fast enough I guess.
EDIT
In response to the critique of @Mr.Wizard (my first solution was reordering the list), and to explore a bit the high-performance corner of the problem, here are 2 alternative solutions:
getMeans[e_] :=
Module[{temp = ConstantArray[0, Max[#[[All, 1, 1]]]]},
temp[[#[[All, 1, 1]]]] = Mean /@ #[[All, All, 2]];
List /@ temp[[e[[All, 1]]]]] &[GatherBy[e, First]];
getMeansSparse[e_] :=
Module[{temp = SparseArray[{Max[#[[All, 1, 1]]] -> 0}]},
temp[[#[[All, 1, 1]]]] = Mean /@ #[[All, All, 2]];
List /@ Normal@temp[[e[[All, 1]]]]] &[GatherBy[e, First]];
The first one is the fastest, trading memory for speed, and can be applied when keys are all integers, and your maximal "key" value (2 in your example) is not too large. The second solution is free from the latter limitation, but is slower. Here is a large list of pairs:
In[303]:=
tst = RandomSample[#, Length[#]] &@
Flatten[Map[Thread[{#, RandomInteger[{1, 100}, 300]}] &,
RandomSample[Range[1000], 500]], 1];
In[310]:= Length[tst]
Out[310]= 150000
In[311]:= tst[[;; 10]]
Out[311]= {{947, 52}, {597, 81}, {508, 20}, {891, 81}, {414, 47},
{849, 45}, {659, 69}, {841, 29}, {700, 98}, {858, 35}}
The keys can be from 1 to 1000 here, 500 of them, and there are 300 random numbers for each key. Now, some benchmarks:
In[314]:= (res0 = getMeans[tst]); // Timing
Out[314]= {0.109, Null}
In[317]:= (res1 = getMeansSparse[tst]); // Timing
Out[317]= {0.219, Null}
In[318]:= (res2 = tst[[All, {1}]] /.
Reap[Sow[#2, #] & @@@ tst, _, # -> Mean@#2 &][[2]]); // Timing
Out[318]= {5.687, Null}
In[319]:= (res3 = tst[[All, {1}]] /.
Dispatch[
Reap[Sow[#2, #] & @@@ tst, _, # -> Mean@#2 &][[2]]]); // Timing
Out[319]= {0.391, Null}
In[320]:= res0 === res1 === res2 === res3
Out[320]= True
We can see that the getMeans
is the fastest here, getMeansSparse
the second fastest, and the solution of @Mr.Wizard is somewhat slower, but only when we use Dispatch
, otherwise it is much slower. Mine and @Mr.Wizard's solutions (with Dispatch) are similar in spirit, the speed difference is due to (sparse) array indexing being more efficient than hash look-up. Of course, all this matters only when your list is really large.
EDIT 2
Here is a version of getMeans
which uses Compile
with a C target and returns numerical values (rather than rationals). It is about twice faster than getMeans
, and the fastest of my solutions.
getMeansComp =
Compile[{{e, _Integer, 2}},
Module[{keys = e[[All, 1]], values = e[[All, 2]], sums = {0.} ,
lengths = {0}, , i = 1, means = {0.} , max = 0, key = -1 ,
len = Length[e]},
max = Max[keys];
sums = Table[0., {max}];
lengths = Table[0, {max}];
means = sums;
Do[key = keys[[i]];
sums[[key]] += values[[i]];
lengths[[key]]++, {i, len}];
means = sums/(lengths + (1 - Unitize[lengths]));
means[[keys]]], CompilationTarget -> "C", RuntimeOptions -> "Speed"]
getMeansC[e_] := List /@ getMeansComp[e];
The code 1 - Unitize[lengths]
protects against division by zero for unused keys. We need every number in a separate sublist, so we should call getMeansC
, not getMeansComp
directly. Here are some measurements:
In[180]:= (res1 = getMeans[tst]); // Timing
Out[180]= {0.11, Null}
In[181]:= (res2 = getMeansC[tst]); // Timing
Out[181]= {0.062, Null}
In[182]:= N@res1 == res2
Out[182]= True
This can probably be considered a heavily optimized numerical solution. The fact that the fully general, brief and beautiful solution of @Mr.Wizard is only about 6-8 times slower speaks very well for the latter general concise solution, so, unless you want to squeeze every microsecond out of it, I'd stick with @Mr.Wizard's one (with Dispatch
). But it's important to know how to optimize code, and also to what degree it can be optimized (what can you expect).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With