I have a large data file in text format and I want to convert it to csv by specifying each column length.
number of columns = 5
column length
[4 2 5 1 1]
sample observations:
aasdfh9013512
ajshdj 2445df
Expected Output
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
A flat (or fixed width) file is a plain text file where each field value is the same width and padded with spaces. It is much easier to read than CSV files but takes up more space than CSV. We examine the CSV value and remember the longest width for each field. We then output each field padded to the longest width.
Go to File > Save As. Click Browse. In the Save As dialog box, under Save as type box, choose the text file format for the worksheet; for example, click Text (Tab delimited) or CSV (Comma delimited). Note: The different formats support different feature sets.
To efficiently parse fixed width files with Python, we can use the Pandas' read_fwf method. to define the col_specification list with the column specifications for filename. txt. Then we call read.
GNU awk (gawk) supports this directly with FIELDWIDTHS
, e.g.:
gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile
Output:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
Adding a Generic way of handling this(alternative to FIELDSWIDTH option) in awk
(where we need not to harcode sub string positions, this will work as per position nuber provided by user wherever comma needs to be inserted) could be as follows, written and tested in GNU awk
. To use this, we have to define values(like OP showed in samples), position numbers where we need to insert commas, awk
variable name is colLength
give position numbers with space between them.
awk -v colLengh="4 2 5 1 1" '
BEGIN{
num=split(colLengh,arr,OFS)
}
{
j=sum=0
while(++j<=num){
if(length($0)>sum){
sub("^.{"arr[j]+sum"}","&,")
}
sum+=arr[j]+1
}
}
1
' Input_file
Explanation: Simple explanation would be, creating awk
variable named colLengh
where we need to define position numbers wherever we need to insert commas. Then in BEGIN
section creating array arr
which has value of indexes where we need to insert commas in it.
In main program section first of all nullifying variables j
and sum
here. Then running a while
loop from j=1 to till value of j becomes equal to num. In each run substituting from starting of current line(if length of current line is greater than sum else it doesn't make sense to perform substitution to I have put addiotnal check here) everything with everything + ,
as per need. Eg: sub
function will become .{4}
for first time loop runs then it becomes, .{7}
because its 7th position we need to insert comma and so on. So sub
will substitute those many characters from starting to till generated numbers with matched value + ,
. At last in this program mentioning 1
will print edited/non-edited lines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With