I have a CSV file with two columns, text and count. The goal is to transform the file from this:
some text once,1
some text twice,2
some text thrice,3
To this:
some text once,1
some text twice,1
some text twice,1
some text thrice,1
some text thrice,1
some text thrice,1
repeating each line count times and spreading the count over that many lines.
This seems to me like a good candidate for Seq.unfold, generating the additional lines, as we read the file. I have the following generator function:
let expandRows (text:string, number:int32) =
if number = 0
then None
else
let element = text // "element" will be in the generated sequence
let nextState = (element, number-1) // threaded state replacing looping
Some (element, nextState)
FSI yields a the following function signature:
val expandRows : text:string * number:int32 -> (string * (string * int32)) option
Executing the following in FSI:
let expandedRows = Seq.unfold expandRows ("some text thrice", 3)
yields the expected:
val it : seq<string> = seq ["some text thrice"; "some text thrice"; "some text thrice"]
The question is: how do I plug this into the context of a larger ETL pipeline? For example:
File.ReadLines(inFile)
|> Seq.map createTupleWithCount
|> Seq.unfold expandRows // type mismatch here
|> Seq.iter outFile.WriteLine
The error below is on expandRows in the context of the pipeline.
Type mismatch.
Expecting a 'seq<string * int32> -> ('a * seq<string * int32>) option'
but given a 'string * int32 -> (string * (string * int32)) option'
The type 'seq<string * int 32>' does not match the type 'string * int32'
I was expecting that expandRows was returning seq of string, as in my isolated test. As that is neither the "Expecting" or the "given", I'm confused. Can someone point me in the right direction?
A gist for the code is here: https://gist.github.com/akucheck/e0ff316e516063e6db224ab116501498
Answer: 50° Celsius is equal to 122° Fahrenheit.
Seq.map
produces a sequence, but Seq.unfold
does not take a sequence, it takes a single value. So you can't directly pipe the output of Seq.map
into Seq.unfold
. You need to do it element by element instead.
But then, for each element your Seq.unfold
will produce a sequence, so the ultimate result will be a sequence of sequences. You can collect all those "subsequences" in a single sequence with Seq.collect
:
File.ReadLines(inFile)
|> Seq.map createTupleWithCount
|> Seq.collect (Seq.unfold expandRows)
|> Seq.iter outFile.WriteLine
Seq.collect
takes a function and an input sequence. For every element of the input sequence, the function is supposed to produce another sequence, and Seq.collect
will concatenate all those sequences in one. You may think of Seq.collect
as Seq.map
and Seq.concat
combined in one function. Also, if you're coming from C#, Seq.collect
is called SelectMany
over there.
In this case, since you simply want to repeat a value a number of times, there's no reason to use Seq.unfold
. You can use Seq.replicate
instead:
// 'a * int -> seq<'a>
let expandRows (text, number) = Seq.replicate number text
You can use Seq.collect
to compose it:
File.ReadLines(inFile)
|> Seq.map createTupleWithCount
|> Seq.collect expandRows
|> Seq.iter outFile.WriteLine
In fact, the only work performed by this version of expandRows
is to 'unpack' a tuple and compose its values into curried form.
While F# doesn't come with such a generic function in its core library, you can easily define it (and other similarly useful functions):
module Tuple2 =
let curry f x y = f (x, y)
let uncurry f (x, y) = f x y
let swap (x, y) = (y, x)
This would enable you to compose your pipeline from well-known functional building blocks:
File.ReadLines(inFile)
|> Seq.map createTupleWithCount
|> Seq.collect (Tuple2.swap >> Tuple2.uncurry Seq.replicate)
|> Seq.iter outFile.WriteLine
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With