Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with heterogenous data in a statically typed language (F#)

Tags:

One of F#'s claims is that it allows for interactive scripting and data manipulation / exploration. I've been playing around with F# trying to get a sense for how it compares with Matlab and R for data analysis work. Obviously F# does not have all practical functionality of these ecosystems, but I am more interested in the general advantages / disadvantages of the underlying language.

For me the biggest change, even over the functional style, is that F# is statically typed. This has some appeal, but also often feels like a straightjacket. For instance, I have not found a convenient way to deal with heterogeneous rectangular data -- think dataframe in R. Assume I'm reading a CSV file with names (string) and weights (float). Typically I load data in, perform some transformations, add variables, etc, and then run analysis. In R, the first part might look like:

df <- read.csv('weights.csv')
df$logweight <- log(df$weight)

In F#, it's not clear what structure I should use to do this. As far as I can tell I have two options: 1) I can define a class first that is strongly typed (Expert F# 9.10) or 2) I can use a heterogeneous container such as ArrayList. A statically typed class doesn't seem feasible because I need to add variables in an ad-hoc manner (logweight) after loading the data. A heterogeneous container is also inconvenient because every time I access a variable I will need to unbox it. In F#:

let df = readCsv("weights.csv")
df.["logweight"] = log(double df.["weight"])

If this were once or twice, it might be okay, but specifying a type every time I use a variable doesn't seem reasonable. I often deal with surveys with 100s of variables that are added/dropped, split into new subsets and merged with other dataframes.

Am I missing some obvious third choice? Is there some fun and light way to interact and manipulate heterogeneous data? If I need to do data analysis on .Net, my current sense is that I should use IronPython for all the data exploration / transformation / interaction work, and only use F#/C# for numerically intensive parts. Is F# inherently the wrong tool for quick and dirty heterogeneous data work?

like image 430
Tristan Avatar asked Nov 25 '09 18:11

Tristan


1 Answers

Is F# inherently the wrong tool for quick and dirty heterogeneous data work?

For completely ad hoc, exploratory data mining, I wouldn't recommend F# since the types would get in your way.

However, if your data is very well defined, then you can hold disparate data types in the same container by mapping all of your types to a common F# union:

> #r "FSharp.PowerPack";;

--> Referenced 'C:\Program Files\FSharp-1.9.6.16\bin\FSharp.PowerPack.dll'

> let rawData =
    "Name: Juliet
     Age: 23
     Sex: F
     Awesome: True"

type csv =
    | Name of string
    | Age of int
    | Sex of char
    | Awesome of bool

let parseData data =
    String.split ['\n'] data
    |> Seq.map (fun s ->
        let parts = String.split [':'] s
        match parts.[0].Trim(), parts.[1].Trim() with
        | "Name", x -> Name(x)
        | "Age", x -> Age(int x)
        | "Sex", x -> Sex(x.[0])
        | "Awesome", x -> Awesome(System.Convert.ToBoolean(x))
        | data, _ -> failwithf "Unknown %s" data)
    |> Seq.to_list;;

val rawData : string =
  "Name: Juliet
     Age: 23
     Sex: F
     Awesome: True"
type csv =
  | Name of string
  | Age of int
  | Sex of char
  | Awesome of bool
val parseData : string -> csv list

> parseData rawData;;
val it : csv list = [Name "Juliet"; Age 23; Sex 'F'; Awesome true]

csv list is strongly typed and you can pattern match over it, but you have to define all of your union constructors up front.

I personally prefer this approach, since is orders of magnitude better than working with an untyped ArrayList. However, I'm not really sure what you're requirements are, and I don't know a good way to represent ad-hoc variables (except maybe as a Map{string, obj}) so YMMV.

like image 83
Juliet Avatar answered Oct 12 '22 03:10

Juliet