I have been having difficulty with organizing a function that will handle strings in the manner I want. I have looked into a handful previous questions 1, 2, 3 among others that I have sorted through. Here is the set up, I have well structured but variable data that needs to be split from a string read from the file, to an array of strings. The following showcases some examples of the data I am dealing with
('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','[email protected]',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','[email protected]',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
...
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),
I want to split these strings based on commas, however, there are commas occasionally contained within the strings which causes problems. In addition to this, developing an accurate re.split(regex, line)
becomes difficult becomes the number of items in each line changes throughout the read.
Some solutions that I have tried up to this point.
def splitLine(text, fields, delimiter):
return_line = []
regex_string = "(.*?),"
for i in range(0,len(fields)-1):
regex_string+=("(.*)")
if i < len(fields)-2:
regex_string+=delimiter
return_line = re.split(regex_string, text)
return return_line
This will give a result where we have the following output
regex_string
return_line
However the main problem with this is that it occasionally lumps two fields together. In the case the 3rd value in the array.
(.*?),(.*),(.*),(.*),(.*),(.*)
['', '\t(222', "'Vy1asdfnuJkA','Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']
Where the ideal result would look like:
['', '\t(222', "'Vy1asdfnuJkA'", "'Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']
It is a small change, but it has a huge influence on the result. I tried manipulating the regex string to better suit what I was trying to do, but with each case I solved, another broke it unfortunately.
Another case which I played around with came from user Aaron Cronin in this post 4, which looks like below
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
The results of this look like so:
["\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n"]
The main problem is that it comes out as the same string. Which puts me in a feedback loop.
The ideal result would look like:
[\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n]
Any help is appreciated, I am not sure what the best approach is in this scenario. I am happy to clarify any questions that arise as well. I tried to be as complete as possible.
Use ast
's literal_eval
!
from ast import literal_eval
s = """('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','[email protected]',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','[email protected]',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),
"""
for line in s.split("\n"):
line = line.strip().rstrip(",").replace("NULL", "None")
if line:
print list(literal_eval(line)) #list(..) is just an example
Output:
['Vdfbr76', 'gsdf', 'gsfd', '', None]
['Vkdfb23l', 'gsfd', 'gsfg', '[email protected]', None]
['4asg0124e', 'Lead Actor/SFX MUA/Prop designer', 'John Smith', '[email protected]', None]
['asdguIux', 'Director, Camera Operator, Editor, VFX', 'John Smith', '', None]
[492, 'E1asegaZ1ox', 'Nysdag_5YmD', '145872325372620', 1, 'long, string, with, commas']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With