I have several (~300,000) files of individual JSON objects that I want to combine into a single file that is a JSON array. How can I do this on linux assuming they are all in the location "~/data_files"?
FileA
{
name: "Test",
age: 23
}
FileB
{
name: "Foo",
age: 5
}
FileC
{
name: "Bar",
age: 5
}
Example Output: (begins and ends with brackets, and added commas between objects)
[
{
name: "Test",
age: 23
},
{
name: "Foo",
age: 5
},
{
name: "Bar",
age: 5
}
]
What I've tried:
I know I can use cat
to combine a bunch of files, not sure how to do it for all files in a directory yet, but trying to figure that out. Also trying to figure out how to have the ,
between files I'm concatenating, haven't seen a command for it yet.
Since you seem a little new to unix I'll try to give you a solution that is simple and doesn't introduce too many new concepts. I'll leave clever and novel to the other posters. This solution will be very efficient since all I'm doing is streaming files into files.
To start with we will create a new file in our home directory with a square bracket in it.echo "[" > ~/tmp.json
Now we loop through all the files in your data_files directory
and append them to our new file. The >>
will add them to whats already there. If you used a >
then the file would get overwritten each time.
The echo
will add a comma when the cat
has finished outputting the file.
for i in ~/data_files/*; do cat $i;echo ","; done >> ~/tmp.json
So now we have your 300k files in one file called tmp.json, with each entry seperated by a comma, but the last line of the file is also a comma and that is not what we want.
The sed
command below behaves like cat
except that '$d'
tells it to omit the last line of the file.
So we create a new file with all but the last line of our temporary file.sed '$d' ~/tmp.json > ~/finished.json
We need to close our square bracketecho "]" >> ~/finished.json
And finally we delete our temporary file
rm ~/tmp.json
And we are done.
[
{
name: "Test",
age: 23
}
,
{
name: "Foo",
age: 5
}
,
{
name: "Bar",
age: 5
}
]
A quick glance at this post about pretty printing json will point you at a command line tool that will take your finished.json file and turn it into exactly the output you asked for.
a simple for loop and couple of sed will do
$ echo "[" > all;
for f in file{A,B,C};
do
sed 's/^/\t/;$s/$/,/' "$f" >> all;
done;
sed -i '$s/,/\n]/' all
$ cat all
[
{
name: "Test",
age: 23
},
{
name: "Foo",
age: 5
},
{
name: "Bar",
age: 5
}
]
or the same to stdout
$ echo "["; for f in file{A,B,C}; do sed 's/^/\t/;$s/$/,/' "$f"; done |
sed `'$s/,/\n]/'`
to run for all files in the directory change file{A,B,C}
to *
This script should work even if the number of files is 300K+. Also this script is faster than sed
solution since input files are not modified.
#!/bin/sh
tmp="/dev/shm/${USER}.find.tmp"
out='all.json'
find . -maxdepth 1 -name file\* > ${tmp}
echo '[' > ${out}
for f in $(head -n -1 ${tmp})
do
cat ${f} >> ${out}
echo ',' >> ${out}
done
f=$(tail -n 1 ${tmp})
cat ${f} >> ${out}
echo ']' >> ${out}
rm -f -- ${tmp}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With