Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mathematica : Conditional Operations on Lists

I would like to average across "Rows" in a column. That is rows that have the same value in another column.

For example :

e= {{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}, 
   {69, 7, 30, 38, 16, 70, 97, 50, 97, 31, 81, 96, 60, 52, 35, 6, 
    24, 65, 76, 100}}

enter image description here

I would like to average all the Value in the second column that have the same value in the first one.

So Here : The Average for Col 1 = 1 & Col 1 = 2

And then create a third column with the result of this operation. So the values in that columns should be the same for the first 10 lines an next 10.

Many Thanks for any help you could provide !

LA

Output Ideal Format :

enter image description here

like image 824
500 Avatar asked Dec 07 '22 21:12

500


2 Answers

Interesting problem. This is the first thing that came into my mind:

e[[All, {1}]] /. Reap[Sow[#2, #] & @@@ e, _, # -> Mean@#2 &][[2]];

ArrayFlatten[{{e, %}}] // TableForm

To get rounding you may simply add Round@ before Mean in the code above: Round@Mean@#2

Here is a slightly faster method, but I actually prefer the Sow/Reap one above:

#[[1, 1]] -> Round@Mean@#[[All, 2]] & /@ GatherBy[e, First];

ArrayFlatten[{{e, e[[All, {1}]] /. %}}] // TableForm

If you have many different elements in the first column, either of the solutions above can be made faster by applying Dispatch to the rule list that is produced, before the replacement (/.) is done. This command tells Mathematica to build and use an optimized internal format for the rules list.

Here is a variant that is slower, but I like it enough to share anyway:

Module[{q},
  Reap[{#, Sow[#2,#], q@#} & @@@ e, _, (q@# = Mean@#2) &][[1]]
]

Also, general tips, you can replace:

Table[RandomInteger[{1, 100}], {20}] with RandomInteger[{1, 100}, 20]

and Join[{c}, {d}] // Transpose with Transpose[{c, d}].

like image 91
Mr.Wizard Avatar answered Jan 01 '23 14:01

Mr.Wizard


What the heck, I'll join the party. Here is my version:

Flatten/@Flatten[Thread/@Transpose@{#,Mean/@#[[All,All,2]]}&@GatherBy[e,First],1]

Should be fast enough I guess.

EDIT

In response to the critique of @Mr.Wizard (my first solution was reordering the list), and to explore a bit the high-performance corner of the problem, here are 2 alternative solutions:

getMeans[e_] := 
Module[{temp = ConstantArray[0, Max[#[[All, 1, 1]]]]},
  temp[[#[[All, 1, 1]]]] = Mean /@ #[[All, All, 2]];
  List /@ temp[[e[[All, 1]]]]] &[GatherBy[e, First]];

getMeansSparse[e_] := 
Module[{temp = SparseArray[{Max[#[[All, 1, 1]]] -> 0}]},
  temp[[#[[All, 1, 1]]]] = Mean /@ #[[All, All, 2]];
  List /@ Normal@temp[[e[[All, 1]]]]] &[GatherBy[e, First]];

The first one is the fastest, trading memory for speed, and can be applied when keys are all integers, and your maximal "key" value (2 in your example) is not too large. The second solution is free from the latter limitation, but is slower. Here is a large list of pairs:

In[303]:= 
tst = RandomSample[#, Length[#]] &@
   Flatten[Map[Thread[{#, RandomInteger[{1, 100}, 300]}] &, 
      RandomSample[Range[1000], 500]], 1];

In[310]:= Length[tst]

Out[310]= 150000

In[311]:= tst[[;; 10]]

Out[311]= {{947, 52}, {597, 81}, {508, 20}, {891, 81}, {414, 47}, 
{849, 45}, {659, 69}, {841, 29}, {700, 98}, {858, 35}}

The keys can be from 1 to 1000 here, 500 of them, and there are 300 random numbers for each key. Now, some benchmarks:

In[314]:= (res0 = getMeans[tst]); // Timing

Out[314]= {0.109, Null}

In[317]:= (res1 = getMeansSparse[tst]); // Timing

Out[317]= {0.219, Null}

In[318]:= (res2 =  tst[[All, {1}]] /. 
 Reap[Sow[#2, #] & @@@ tst, _, # -> Mean@#2 &][[2]]); // Timing

Out[318]= {5.687, Null}

In[319]:= (res3 = tst[[All, {1}]] /. 
 Dispatch[
  Reap[Sow[#2, #] & @@@ tst, _, # -> Mean@#2 &][[2]]]); // Timing

Out[319]= {0.391, Null}

In[320]:= res0 === res1 === res2 === res3

Out[320]= True

We can see that the getMeans is the fastest here, getMeansSparse the second fastest, and the solution of @Mr.Wizard is somewhat slower, but only when we use Dispatch, otherwise it is much slower. Mine and @Mr.Wizard's solutions (with Dispatch) are similar in spirit, the speed difference is due to (sparse) array indexing being more efficient than hash look-up. Of course, all this matters only when your list is really large.

EDIT 2

Here is a version of getMeans which uses Compile with a C target and returns numerical values (rather than rationals). It is about twice faster than getMeans, and the fastest of my solutions.

getMeansComp = 
 Compile[{{e, _Integer, 2}},
   Module[{keys = e[[All, 1]], values = e[[All, 2]], sums = {0.} ,
      lengths = {0}, , i = 1, means = {0.} , max = 0, key = -1 , 
      len = Length[e]},
    max = Max[keys];
    sums = Table[0., {max}];
    lengths = Table[0, {max}];
    means = sums;
    Do[key = keys[[i]];
      sums[[key]] += values[[i]];
      lengths[[key]]++, {i, len}];
    means = sums/(lengths + (1 - Unitize[lengths]));
    means[[keys]]], CompilationTarget -> "C", RuntimeOptions -> "Speed"]

getMeansC[e_] := List /@ getMeansComp[e];

The code 1 - Unitize[lengths] protects against division by zero for unused keys. We need every number in a separate sublist, so we should call getMeansC, not getMeansComp directly. Here are some measurements:

In[180]:= (res1 = getMeans[tst]); // Timing

Out[180]= {0.11, Null}

In[181]:= (res2 = getMeansC[tst]); // Timing

Out[181]= {0.062, Null}

In[182]:= N@res1 == res2

Out[182]= True

This can probably be considered a heavily optimized numerical solution. The fact that the fully general, brief and beautiful solution of @Mr.Wizard is only about 6-8 times slower speaks very well for the latter general concise solution, so, unless you want to squeeze every microsecond out of it, I'd stick with @Mr.Wizard's one (with Dispatch). But it's important to know how to optimize code, and also to what degree it can be optimized (what can you expect).

like image 28
Leonid Shifrin Avatar answered Jan 01 '23 12:01

Leonid Shifrin