Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python remove Square brackets and extraneous information between them

I'm trying to handle a file, and I need to remove extraneous information in the file; notably, I'm trying to remove brackets [] including text inside and between bracket [] [] blocks, Saying that everything between these blocks including them itself but print everything outside it.

Below is my text File with data sample:

$ cat smb
Hi this is my config file.
Please dont delete it

[homes]
  browseable                     = No
  comment                        = Your Home
  create mode                    = 0640
  csc policy                     = disable
  directory mask                 = 0750
  public                         = No
  writeable                      = Yes

[proj]
  browseable                     = Yes
  comment                        = Project directories
  csc policy                     = disable
  path                           = /proj
  public                         = No
  writeable                      = Yes

[]

This last second line.
End of the line.

Desired Output:

Hi this is my config file.
Please dont delete it
This last second line.
End of the line.

What i have tried based on my understanding and re-search:

$ cat test.py
with open("smb", "r") as file:
  for line in file:
    start = line.find( '[' )
    end = line.find( ']' )
    if start != -1 and end != -1:
      result = line[start+1:end]
      print(result)

Output:

$ ./test.py
   homes
   proj
like image 397
kulfi Avatar asked May 06 '20 15:05

kulfi


3 Answers

with one regex

import re

with open("smb", "r") as f: 
    txt = f.read()
    txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '', txt, flags=re.DOTALL)

print(txt)

regex explanation:

(\n\[) find a sequence where there is a linebreak followed by a [

(\[]\n) find a sequence where there are [] followed by a linebreak

(.*?) remove everything in the middle of (\n\[) and (\[]\n)

re.DOTALL is used to prevent unnecessary backtracking


!!! PANDAS UPDATE !!!

The same solution with the same logic can be carried out with pandas

import re
import pandas as pd

# read each line in the file (one raw -> one line)
txt = pd.read_csv('smb',  sep = '\n', header=None)
# join all the line in the file separating them with '\n'
txt = '\n'.join(txt[0].to_list())
# apply the regex to clean the text (the same as above)
txt = re.sub(r'(\n\[)(.*?)(\[]\n)', '\n', txt, flags=re.DOTALL)

print(txt)
like image 168
Marco Cerliani Avatar answered Oct 20 '22 00:10

Marco Cerliani


Read the file into a string,

extract = '''Hi this is my config file.
Please dont delete it

[homes]
  browseable                     = No
  comment                        = Your Home
  create mode                    = 0640
  csc policy                     = disable
  directory mask                 = 0750
  public                         = No
  writeable                      = Yes

[proj]
  browseable                     = Yes
  comment                        = Project directories
  csc policy                     = disable
  path                           = /proj
  public                         = No
  writeable                      = Yes

[]

This last second line.
End of the line.
'''.split('\n[')[0][:-1]

will give you,

Hi this is my config file.
Please dont delete it

.split('\n[') splits the string by the occurrence of '\n[' set of characters and [0] selects the upper description lines.

with open("smb", "r") as f: 
     extract = f.read()
     tail = extract.split(']\n')
     extract = extract.split('\n[')[0][:-1]+[tail[len(tail)-1]

will read and output,

Hi this is my config file.
Please dont delete it
This last second line.
End of the line.
like image 5
Akash Sonthalia Avatar answered Oct 20 '22 00:10

Akash Sonthalia


Since you tagged pandas, let's try that:

df = pd.read_csv('smb', sep='----', header=None)

# mark rows starts with `[`
s = df[0].str.startswith('[')

# drop the lines between `[`
df = df.drop(np.arange(s.idxmax(),s[::-1].idxmax()+1))

# write to file if needed
df.to_csv('clean.txt', header=None, index=None)

Output (df):

                             0
0   Hi this is my config file.
1        Please dont delete it
18      This last second line.
19            End of the line.
like image 4
Quang Hoang Avatar answered Oct 19 '22 23:10

Quang Hoang