I'm trying to handle a file, and I need to remove extraneous information in the file; notably, I'm trying to remove brackets []
including text inside and between bracket []
[]
blocks, Saying that everything between these blocks including them itself but print everything outside it.
$ cat smb
Hi this is my config file.
Please dont delete it
[homes]
browseable = No
comment = Your Home
create mode = 0640
csc policy = disable
directory mask = 0750
public = No
writeable = Yes
[proj]
browseable = Yes
comment = Project directories
csc policy = disable
path = /proj
public = No
writeable = Yes
[]
This last second line.
End of the line.
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
$ cat test.py
with open("smb", "r") as file:
for line in file:
start = line.find( '[' )
end = line.find( ']' )
if start != -1 and end != -1:
result = line[start+1:end]
print(result)
Output:
$ ./test.py
homes
proj
with one regex
import re
with open("smb", "r") as f:
txt = f.read()
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '', txt, flags=re.DOTALL)
print(txt)
regex explanation:
(\n\[)
find a sequence where there is a linebreak followed by a [
(\[]\n)
find a sequence where there are [] followed by a linebreak
(.*?)
remove everything in the middle of (\n\[)
and (\[]\n)
re.DOTALL
is used to prevent unnecessary backtracking
!!! PANDAS UPDATE !!!
The same solution with the same logic can be carried out with pandas
import re
import pandas as pd
# read each line in the file (one raw -> one line)
txt = pd.read_csv('smb', sep = '\n', header=None)
# join all the line in the file separating them with '\n'
txt = '\n'.join(txt[0].to_list())
# apply the regex to clean the text (the same as above)
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '\n', txt, flags=re.DOTALL)
print(txt)
Read the file into a string,
extract = '''Hi this is my config file.
Please dont delete it
[homes]
browseable = No
comment = Your Home
create mode = 0640
csc policy = disable
directory mask = 0750
public = No
writeable = Yes
[proj]
browseable = Yes
comment = Project directories
csc policy = disable
path = /proj
public = No
writeable = Yes
[]
This last second line.
End of the line.
'''.split('\n[')[0][:-1]
will give you,
Hi this is my config file.
Please dont delete it
.split('\n[')
splits the string by the occurrence of '\n['
set of characters and [0]
selects the upper description lines.
with open("smb", "r") as f:
extract = f.read()
tail = extract.split(']\n')
extract = extract.split('\n[')[0][:-1]+[tail[len(tail)-1]
will read and output,
Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
Since you tagged pandas
, let's try that:
df = pd.read_csv('smb', sep='----', header=None)
# mark rows starts with `[`
s = df[0].str.startswith('[')
# drop the lines between `[`
df = df.drop(np.arange(s.idxmax(),s[::-1].idxmax()+1))
# write to file if needed
df.to_csv('clean.txt', header=None, index=None)
Output (df
):
0
0 Hi this is my config file.
1 Please dont delete it
18 This last second line.
19 End of the line.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With