Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge two columns of a text file in Linux

Tags:

linux

bash

I have a text file with several columns of text and values. This structure:

CAR       38
     DOG  42
CAT       89
CAR       23
     APE  18

If column 1 has a String, column 2 doesn't (or it's actually an emptry String). And the other way around: if column 1 is empty, column 2 has a String. In other words, the "object" (CAR, CAT, DOG etc.) occurs in either column 1 or column 2, but never both.

I'm looking for an efficient way to consolidate column 1 and 2 so that the file looks like this instead:

CAR  38
DOG  42
CAT  89
CAR  23
APE  18

I can do this in a Bash script by using while and if, but I'm sure there is a simpler way of doing it. Can someone help?

Cheers! Z

like image 516
Zooma Avatar asked Apr 09 '15 19:04

Zooma


2 Answers

Try this:

column -t file

Output:

CAR  38
DOG  42
CAT  89
CAR  23
APE  18
like image 185
Cyrus Avatar answered Oct 04 '22 15:10

Cyrus


Note: If:

  • you're looking for output with auto-sized, left-aligned fixed-width columns (the longest field value determines the width, with shorter values getting right-padded with spaces)
  • and are happy with two spaces as the column separator
  • and are using files small enough to read into memory as a whole,

use Cyrus's simpler, column-based answer.

See below for how the column-based approach compares to the awk-based approach below in terms of performance and resource consumption.


awk is your friend here:

awk -v OFS='  ' '{ print $1, $2 }' file
  • awk splits lines into field by whitespace by default, so, with your input, lines such as CAR 38 and DOG 42 are parsed the same (CAR and DOG become field 1, $1, and 38 and 42 become field 2, $2).
  • -v OFS=' ' sets the output-field separator to two spaces (default is a single space); note that there'll be no padding of output values to create aligned output.

To create aligned output with fields of varying width, use Awk's printf function, which gives you more control over the output; for instance the following outputs a 10-char-wide left-aligned 1st column, and a 2-char-wide right-aligned 2nd column:

awk '{ printf "%-10s  %2s\n", $1, $2 }' file
  • Note that the column widths must be known in advance.
  • By contrast, column -t conveniently determines the column widths automatically, by parsing all data first, but that has performance and resource-consumption implications; see below.

Performance / resource-consumption comparison between the column -t and the Awk approach:

  • column -t needs to analyze all input data up front, in a first pass, so as to be able to determine the maximum input column widths; from what I can tell, it does so by reading the input as a whole into memory first, which can be problematic with large input files.
  • By contrast, the Awk solution reads lines one by one - but relies on knowing the column widths ahead of time.

Thus,

  • column -t will consume memory proportional to the input size, whereas awk will use a constant amount of memory.
  • column -t is typically slower, depending on the Awk implementation used; mawk is much faster, gawk a little faster, BSD awk is slower(!); results based on a 10-million line input file; commands run on OSX 10.10.2 and Ubuntu 14.04.
like image 39
mklement0 Avatar answered Oct 04 '22 15:10

mklement0