Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional Data Manipulation in Mathematica

I am trying to prepare the best tools for efficient Data Analysis in Mathematica. I have a approximately 300 Columns & 100 000 Rows.

What would be the best tricks to :

"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.

One of the trickiest examples I could think of is :

Given a data structure,

Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y

I also welcome any general advice on data manipulation.

like image 671
500 Avatar asked May 25 '11 20:05

500


2 Answers

For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:

Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];

ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules =  Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
    indexRules  =  Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
     With[{selF = Apply[Function, Hold[condition] /. colRules]},
       Select[{rows}, selF @@ # &][[All, cnames /. indexRules]]]];

What happens here is that the function used in Select gets generated automatically from your specifications. For example (using @Yoda's example):

rows = Array[#1 #2 &, {5, 15}];

We need to define the column names (must be strings or symbols without values):

In[425]:= 
colnames = "c" <> ToString[#] & /@ Range[15]

Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", 
"c13", "c14", "c15"}

(in practice, usually names are more descriptive, of course). Here is the table then:

table = Prepend[rows, colnames];

Here is the select statement you need (I picked x = 4 and y=2):

select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
    where["c2" == 4 && "c8" != 2]]

{{2, 4, 6, 12, 14, 16, 18, 30}}

Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as

In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]

Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}

and similar.

Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).

EDIT

On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.

like image 188
Leonid Shifrin Avatar answered Oct 13 '22 17:10

Leonid Shifrin


data = RandomInteger[{1, 20}, {40, 20}]

x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]

==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}

Some useful commands to get pieces of matrices and list are Span (;;), Drop, Take, Select, Cases and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,

Part ([[...]]) in combination with ;; can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).

Select can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part ([[..]]) function.

like image 45
Sjoerd C. de Vries Avatar answered Oct 13 '22 17:10

Sjoerd C. de Vries