I'm trying to parse a large-ish CSV file in J, and here's the line-splitting routing that I came up with:
splitlines =: 3 : 0
NB. y is the input string
nl_positions =. (y = (10 { a.)) NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.@#) nl_positions NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do. NB. For each newline
to_drop =. i_index { prev_idx NB. The number of characters from the start of the string to skip
to_take =. i - to_drop NB. The number of characters in the current line
result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)
It's really slow, though. The performance monitor shows that line 8 takes the longest, likely because of all the memory allocation when dropping and taking elements and extending the result list:
Time (seconds)
┌────────┬────────┬─────┬─────────────────────────────────────────┐
│all │here │rep │splitlines │
├────────┼────────┼─────┼─────────────────────────────────────────┤
│0.000011│0.000011│ 1│monad │
│0.003776│0.003776│ 1│[1] nl_positions=.(y=(10{a.)) │
│0.012429│0.012429│ 1│[2] nl_idx=.(#i.@#)nl_positions │
│0.000144│0.000144│ 1│[3] prev_idx =.(#nl_idx){.0,nl_idx │
│0.000002│0.000002│ 1│[4] result=.'' │
│0.027566│0.027566│ 1│[5] for_i. nl_idx do. │
│0.940466│0.940466│20641│[6] to_drop=.i_index{prev_idx │
│0.011238│0.011238│20641│[7] to_take=.i-to_drop │
│4.310495│4.310495│20641│[8] result=.result,<(to_take{.to_drop}.y)│
│0.006926│0.006926│20641│[9] end. │
│5.313052│5.313052│ 1│total monad │
└────────┴────────┴─────┴─────────────────────────────────────────┘
Is there a better way to do this? I'm looking for a way to:
for loop with one single array instructionIf I understand correctly, you are currently just wanting to split a string containing multiple lines, into the separate lines. (I imagine splitting the lines into fields will be the next step at some later stage?)
The key primitive that does the heavy lifting for most of what you are wanting to do is cut (;.). For example:
<;._2 InputString NB. box each segment terminated by the last character in the string
<;._1 InputString NB. box each segment of InputString starting with the first character in the string
cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces
Other related resources that you may find useful are: splitstring, freads, the tables/dsv and tables/csv addons. freads and splitstring are both available in the standard library (post J6).
'b' freads 'myfile.txt' NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
'","' splitstring InputString NB. boxed sub-strings of input string delimited by left argument
The tables/dsv and tables/csv addons can be installed using the Package Manager. Once installed they can be used to split lines and fields within lines as follows:
require 'tables/csv'
readcsv 'myfile.csv'
',' readdsv 'myfile.txt'
TAB readdsv 'myfile.txt'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With