Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert a fixed width file from text to csv

I have a large data file in text format and I want to convert it to csv by specifying each column length.

number of columns = 5

column length

[4 2 5 1 1]

sample observations:

aasdfh9013512
ajshdj 2445df

Expected Output

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
like image 949
Ashish Avatar asked Feb 17 '15 12:02

Ashish


People also ask

What is fixed width CSV?

A flat (or fixed width) file is a plain text file where each field value is the same width and padded with spaces. It is much easier to read than CSV files but takes up more space than CSV. We examine the CSV value and remember the longest width for each field. We then output each field padded to the longest width.

How do I convert TXT to CSV?

Go to File > Save As. Click Browse. In the Save As dialog box, under Save as type box, choose the text file format for the worksheet; for example, click Text (Tab delimited) or CSV (Comma delimited). Note: The different formats support different feature sets.

How do I read a fixed width file in Python?

To efficiently parse fixed width files with Python, we can use the Pandas' read_fwf method. to define the col_specification list with the column specifications for filename. txt. Then we call read.


2 Answers

GNU awk (gawk) supports this directly with FIELDWIDTHS, e.g.:

gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile

Output:

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
like image 89
Thor Avatar answered Oct 17 '22 12:10

Thor


Adding a Generic way of handling this(alternative to FIELDSWIDTH option) in awk(where we need not to harcode sub string positions, this will work as per position nuber provided by user wherever comma needs to be inserted) could be as follows, written and tested in GNU awk. To use this, we have to define values(like OP showed in samples), position numbers where we need to insert commas, awk variable name is colLength give position numbers with space between them.

awk -v colLengh="4 2 5 1 1" '
BEGIN{
  num=split(colLengh,arr,OFS)
}
{
  j=sum=0
  while(++j<=num){
    if(length($0)>sum){
      sub("^.{"arr[j]+sum"}","&,")
    }
    sum+=arr[j]+1
  }
}
1
' Input_file

Explanation: Simple explanation would be, creating awk variable named colLengh where we need to define position numbers wherever we need to insert commas. Then in BEGIN section creating array arr which has value of indexes where we need to insert commas in it.

In main program section first of all nullifying variables j and sum here. Then running a while loop from j=1 to till value of j becomes equal to num. In each run substituting from starting of current line(if length of current line is greater than sum else it doesn't make sense to perform substitution to I have put addiotnal check here) everything with everything + , as per need. Eg: sub function will become .{4} for first time loop runs then it becomes, .{7} because its 7th position we need to insert comma and so on. So sub will substitute those many characters from starting to till generated numbers with matched value + ,. At last in this program mentioning 1 will print edited/non-edited lines.

like image 24
RavinderSingh13 Avatar answered Oct 17 '22 13:10

RavinderSingh13