Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split complicated strings in Python dynamically

I have been having difficulty with organizing a function that will handle strings in the manner I want. I have looked into a handful previous questions 1, 2, 3 among others that I have sorted through. Here is the set up, I have well structured but variable data that needs to be split from a string read from the file, to an array of strings. The following showcases some examples of the data I am dealing with

('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','[email protected]',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','[email protected]',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
...
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),

I want to split these strings based on commas, however, there are commas occasionally contained within the strings which causes problems. In addition to this, developing an accurate re.split(regex, line) becomes difficult becomes the number of items in each line changes throughout the read.

Some solutions that I have tried up to this point.

def splitLine(text, fields, delimiter):
    return_line = []

    regex_string = "(.*?),"

    for i in range(0,len(fields)-1):

        regex_string+=("(.*)")

        if i < len(fields)-2:
            regex_string+=delimiter

    return_line = re.split(regex_string, text)

    return return_line

This will give a result where we have the following output

 regex_string
 return_line

However the main problem with this is that it occasionally lumps two fields together. In the case the 3rd value in the array.

(.*?),(.*),(.*),(.*),(.*),(.*)
['', '\t(222', "'Vy1asdfnuJkA','Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']

Where the ideal result would look like:

['', '\t(222', "'Vy1asdfnuJkA'", "'Ndfbyz3_YMD'", "'14541242640005471'", '2', "'Hello World!')", '', '\n']

It is a small change, but it has a huge influence on the result. I tried manipulating the regex string to better suit what I was trying to do, but with each case I solved, another broke it unfortunately.

Another case which I played around with came from user Aaron Cronin in this post 4, which looks like below

def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False

for char in text:
    if char in delimiter and level == 0 and not is_quoted:
        result.append(buff)
        buff = ""
    else:
        buff += char

        if char in opens:
            level += 1
        if char in closes:
            level -= 1
        if char in quotes:
            is_quoted = not is_quoted

if not buff == "":
    result.append(buff)

return result

The results of this look like so:

["\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n"]

The main problem is that it comes out as the same string. Which puts me in a feedback loop.

The ideal result would look like:

[\t('Vk3NIasef366l','gsdasdf','gsfasfd','',NULL),\n]

Any help is appreciated, I am not sure what the best approach is in this scenario. I am happy to clarify any questions that arise as well. I tried to be as complete as possible.

like image 907
msleevi Avatar asked Jan 05 '23 20:01

msleevi


1 Answers

Use ast's literal_eval!

from ast import literal_eval

s = """('Vdfbr76','gsdf','gsfd','',NULL),
('Vkdfb23l','gsfd','gsfg','[email protected]',NULL),
('4asg0124e','Lead Actor/SFX MUA/Prop designer','John Smith','[email protected]',NULL),
('asdguIux','Director, Camera Operator, Editor, VFX','John Smith','',NULL),
(492,'E1asegaZ1ox','Nysdag_5YmD','145872325372620',1,'long, string, with, commas'),
"""

for line in s.split("\n"):
    line = line.strip().rstrip(",").replace("NULL", "None")
    if line:
        print list(literal_eval(line))  #list(..) is just an example

Output:

['Vdfbr76', 'gsdf', 'gsfd', '', None]
['Vkdfb23l', 'gsfd', 'gsfg', '[email protected]', None]
['4asg0124e', 'Lead Actor/SFX MUA/Prop designer', 'John Smith', '[email protected]', None]
['asdguIux', 'Director, Camera Operator, Editor, VFX', 'John Smith', '', None]
[492, 'E1asegaZ1ox', 'Nysdag_5YmD', '145872325372620', 1, 'long, string, with, commas']
like image 115
UltraInstinct Avatar answered Jan 15 '23 17:01

UltraInstinct