Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently split a string into lines in J?

Tags:

j

I'm trying to parse a large-ish CSV file in J, and here's the line-splitting routing that I came up with:

splitlines =: 3 : 0
                                     NB. y is the input string
nl_positions =. (y = (10 { a.))      NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.@#) nl_positions      NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do.                                  NB. For each newline
    to_drop =. i_index { prev_idx                  NB. The number of characters from the start of the string to skip
    to_take =. i - to_drop                         NB. The number of characters in the current line
    result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)

It's really slow, though. The performance monitor shows that line 8 takes the longest, likely because of all the memory allocation when dropping and taking elements and extending the result list:

 Time (seconds)
┌────────┬────────┬─────┬─────────────────────────────────────────┐
│all     │here    │rep  │splitlines                               │
├────────┼────────┼─────┼─────────────────────────────────────────┤
│0.000011│0.000011│    1│monad                                    │
│0.003776│0.003776│    1│[1] nl_positions=.(y=(10{a.))            │
│0.012429│0.012429│    1│[2] nl_idx=.(#i.@#)nl_positions          │
│0.000144│0.000144│    1│[3] prev_idx =.(#nl_idx){.0,nl_idx       │
│0.000002│0.000002│    1│[4] result=.''                           │
│0.027566│0.027566│    1│[5] for_i. nl_idx do.                    │
│0.940466│0.940466│20641│[6] to_drop=.i_index{prev_idx            │
│0.011238│0.011238│20641│[7] to_take=.i-to_drop                   │
│4.310495│4.310495│20641│[8] result=.result,<(to_take{.to_drop}.y)│
│0.006926│0.006926│20641│[9] end.                                 │
│5.313052│5.313052│    1│total monad                              │
└────────┴────────┴─────┴─────────────────────────────────────────┘

Is there a better way to do this? I'm looking for a way to:

  1. Slice a list without memory allocation
  2. Maybe replace the whole for loop with one single array instruction
like image 939
Mihai Avatar asked Apr 27 '26 06:04

Mihai


1 Answers

If I understand correctly, you are currently just wanting to split a string containing multiple lines, into the separate lines. (I imagine splitting the lines into fields will be the next step at some later stage?)

The key primitive that does the heavy lifting for most of what you are wanting to do is cut (;.). For example:

   <;._2 InputString   NB. box each segment terminated by the last character in the string
   <;._1 InputString   NB. box each segment of InputString starting with the first character in the string
   cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces

Other related resources that you may find useful are: splitstring, freads, the tables/dsv and tables/csv addons. freads and splitstring are both available in the standard library (post J6).

   'b' freads 'myfile.txt'  NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
   '","' splitstring InputString  NB. boxed sub-strings of input string delimited by left argument

The tables/dsv and tables/csv addons can be installed using the Package Manager. Once installed they can be used to split lines and fields within lines as follows:

   require 'tables/csv'
   readcsv 'myfile.csv'
   ',' readdsv 'myfile.txt'
   TAB readdsv 'myfile.txt'
like image 174
Tikkanz Avatar answered May 06 '26 22:05

Tikkanz