I have a text file ( huge ) with all numbers separated with a combination of spaces and tabs between them and with comma for decimal and after decimal separation, while the first column is scientific formatted and the next ones are numbers but with commas. I just put the first row here as numbers :
0,0000000E00 -2,7599284 -1,3676726 -1,7231264 -1,0558825 -1,8871096 -3,0763804 -3,2206187 -3,2308111 -2,3147060 -3,9572818 -4,0232415 -4,2180738
the file is so huge that a notepad++ can't process it to convert the "," to "."
So what I do is :
with open(file) as fp:
line = fp.readline()
cnt = 1
while line:
digits=re.findall(r'([\d.:]+)', line)
s=line
s = s.replace('.','').replace(',','.')
number = float(s)
cnt += 1
I tried even to use digits, but that causes to divide the first column in two numbers :
and eventually the error I get when using .replace command. what I would have prefered was to convert the commas to dots regardless of disturbing formats like scientific. I appreciate your help
ValueError: could not convert string to float: ' 00000000E00
\t-29513521 \t-17002219 \t-22375536 \t-14994097
\t-24163610 \t-34076621 \t-31233623 \t-32341597
\t-24724552 \t-42434935 \t-43454237 \t-44885144
\n'
I also put how the input looks like in txt and how I need it in output ( in csv format )
input seems like this :
first line :
between 1st and 2nd column : 3 spaces + 1 Tab
between rest of columns : 6 spaces + 1 Tab
second line and on :
between 1st and 2nd column : 2 spaces + 1 Tab
between rest of columns : 6 spaces + 1 Tab
this is a screen shot of the txt input file : Attention : there is one space in the beginning of each line
and what I want as output is csv file with separated columns with " ; "
You may try reading the entire file into a Python string, and then doing a global replacement of comma to dot:
data = ""
with open('nums.csv', 'r') as file:
data = file.read().replace(',', '.').replace(' ', ';')
with open("nums_out.csv", "w") as out_file:
out_file.write(data)
For a possibly more robust solution, should there exist the possibility that two columns could be separated by multiple whitespace characters, use re.sub
:
data = ""
with open('nums.csv', 'r') as file:
data = file.read().replace(',', '.')
data = re.sub(r'(?<=\n|^)[^\S\r\n]+', '', data)
data = re.sub('(?<=\S)[^\S\r\n]+', ';', data)
If you're working with tabular data in python, you'll want to use the pandas package. It's a large package, so if this is a one-off, the overhead of installing it might not be worth it.
Pandas has a read_csv function that deals with this easily, and the result can be exported to csv:
import pandas as pd
dataframe = pd.read_csv("input.txt", sep="\s+", decimal=",")
dataframe.to_csv("output.csv", sep=";", header=False, index=False)
Note: if your original file has no header, also pass header=None
to the read_csv
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With