Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine several files, with a separator, into one file

I have several (~300,000) files of individual JSON objects that I want to combine into a single file that is a JSON array. How can I do this on linux assuming they are all in the location "~/data_files"?

FileA

{
  name: "Test",
  age: 23
}

FileB

{
  name: "Foo",
  age: 5
}

FileC

{
  name: "Bar",
  age: 5
}

Example Output: (begins and ends with brackets, and added commas between objects)

[
    {
      name: "Test",
      age: 23
    },
    {
      name: "Foo",
      age: 5
    },
    {
      name: "Bar",
      age: 5
    }
]

What I've tried:

I know I can use cat to combine a bunch of files, not sure how to do it for all files in a directory yet, but trying to figure that out. Also trying to figure out how to have the , between files I'm concatenating, haven't seen a command for it yet.

like image 208
Don P Avatar asked May 07 '16 22:05

Don P


3 Answers

Since you seem a little new to unix I'll try to give you a solution that is simple and doesn't introduce too many new concepts. I'll leave clever and novel to the other posters. This solution will be very efficient since all I'm doing is streaming files into files.

To start with we will create a new file in our home directory with a square bracket in it.
echo "[" > ~/tmp.json

Now we loop through all the files in your data_files directory and append them to our new file. The >> will add them to whats already there. If you used a > then the file would get overwritten each time. The echo will add a comma when the cat has finished outputting the file.
for i in ~/data_files/*; do cat $i;echo ","; done >> ~/tmp.json

So now we have your 300k files in one file called tmp.json, with each entry seperated by a comma, but the last line of the file is also a comma and that is not what we want.
The sed command below behaves like cat except that '$d' tells it to omit the last line of the file.
So we create a new file with all but the last line of our temporary file.
sed '$d' ~/tmp.json > ~/finished.json

We need to close our square bracket
echo "]" >> ~/finished.json

And finally we delete our temporary file rm ~/tmp.json

And we are done.

[
{
    name: "Test",
    age: 23
}
,
{
    name: "Foo",
    age: 5
}
,
{
    name: "Bar",
    age: 5
}
]

A quick glance at this post about pretty printing json will point you at a command line tool that will take your finished.json file and turn it into exactly the output you asked for.

like image 189
Niall Cosgrove Avatar answered Sep 30 '22 05:09

Niall Cosgrove


a simple for loop and couple of sed will do

$ echo "[" > all; 
  for f in file{A,B,C}; 
  do 
     sed 's/^/\t/;$s/$/,/' "$f" >> all; 
  done; 
  sed -i '$s/,/\n]/' all

$ cat all
[
 {
   name: "Test",
   age: 23
 },
 {
   name: "Foo",
   age: 5
 },
 {
   name: "Bar",
   age: 5
 }
]

or the same to stdout

$ echo "["; for f in file{A,B,C}; do sed 's/^/\t/;$s/$/,/' "$f"; done |
sed `'$s/,/\n]/'`

to run for all files in the directory change file{A,B,C} to *

like image 36
karakfa Avatar answered Sep 30 '22 06:09

karakfa


This script should work even if the number of files is 300K+. Also this script is faster than sed solution since input files are not modified.

#!/bin/sh
tmp="/dev/shm/${USER}.find.tmp"
out='all.json'
find . -maxdepth 1 -name file\* > ${tmp}
echo '[' > ${out}
for f in $(head -n -1 ${tmp})
do
  cat ${f} >> ${out}
  echo ',' >> ${out}
done
f=$(tail -n 1 ${tmp})
cat ${f} >> ${out}
echo ']' >> ${out}
rm -f -- ${tmp}
like image 30
Andrey Avatar answered Sep 30 '22 04:09

Andrey