Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting lines in chunks into tab delimited

Tags:

newline

csv

sed

awk

I have the following lines in 2 chunks (actually there are ~10K of that). And in this example each chunk contain 3 lines. The chunks are separated by an empty line. So the chunks are like "paragraphs".

xox
91-233
chicago

koko
121-111
alabama

I want to turn it into tab-delimited lines, like so:

xox  91-233  chicago
koko 121-111 alabama

How can I do that?

I tried tr "\n" "\t", but it doesn't do what I want.

like image 872
neversaint Avatar asked Dec 02 '22 14:12

neversaint


2 Answers

$ awk -F'\n' '{$1=$1} 1' RS='\n\n' OFS='\t' file
xox     91-233  chicago
koko    121-111 alabama 

How it works

Awk divides input into records and it divides each record into fields.

  • -F'\n'

    This tells awk to use a newline as the field separator.

  • $1=$1

    This tells awk to assign the first field to the first field. While this seemingly does nothing, it causes awk to treat the record as changed. As a consequence, the output is printed using our assigned value for ORS, the output record separator.

  • 1

    This is awk's cryptic shorthand for print the line.

  • RS='\n\n'

    This tells awk to treat two consecutive newlines as a record separator.

  • OFS='\t'

    This tells awk to use a tab as the field separator on output.

like image 116
John1024 Avatar answered Dec 06 '22 11:12

John1024


This answer offers the following:
* It works with blocks of nonempty lines of any size, separated by any number of empty lines; John1024's helpful answer (which is similar and came first) works with blocks of lines separated by exactly one empty line.
* It explains the awk command used in detail.

A more idiomatic (POSIX-compliant) awk solution:

awk -v RS= -F '\n' -v OFS='\t' '$1=$1""' file
  • -v RS= tells awk to operate in paragraph mode: consider each run of nonempty lines a single record; RS is the input record separator.

    • Note: The implication is that this solution considers one or more empty lines as separating paragraphs (line blocks); empty means: no line-internal characters at all, not even whitespace.
  • -F '\n' tells awk to consider each line of an input paragraph its own field (breaks the multiline input record into fields by lines); -F sets FS, the input field separator.

  • -v OFS='\t' tells awk to separate fields with \t (tab chars.) on output; OFS is the output field separator.

  • $1=$1"" looks like a no-op, but, due to assigning to field variable $1 (the record's first field), tells awk to rebuild the input record, using OFS as the field separator, thereby effectively replacing the \n separators with \t.

    • The trailing "" is to guard against the edge case of the first line in a paragraph evaluating to 0 in a numeric context; appending "" forces treatment as a string, and any nonempty string - even if it contains "0" - is considered true in a Boolean context - see below.
  • Given that $1 is by definition nonempty and given that assignments in awk pass their value through, the result of assignment $1=$1"" is also a nonempty string; since the assignment is used as a pattern (a condition), and a nonempty string is considered true, and there is no associated action block ({ ... }), the implied action is to print the - rebuilt - input record, which now consists of the input lines separated with tabs, terminated by the default output record separator (ORS), \n.

like image 44
mklement0 Avatar answered Dec 06 '22 11:12

mklement0