Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Restrict separator to only some tabs when using pandas read_csv

I'm reading some tab-delimited data into a pandas Dataframe using read_csv, but I have tabs occurring within the column data which means I can't just use "\t" as a separator. Specifically, the last entries in each line are a set of tab delimited optional tags which match [A-Za-z][A-Za-z0-9]:[A-Za-z]:.+ There are no guarantees about how many tags there will be or which ones will be present, and different sets of tags may occur on different lines. Example data looks like this (all white spaces are tabs in my data):

C42TMACXX:5:2316:15161:76101    163 1   @<@DFFADDDF:DD  NH:i:1  HI:i:1  AS:i:200    nM:i:0
C42TMACXX:5:2316:15161:76101    83  1   CCCCCACDDDCB@B  NH:i:1  HI:i:1  nM:i:1
C42TMACXX:5:1305:26011:74469    163 1   CCCFFFFFHHHHGJ  NH:i:1  HI:i:1  AS:i:200    nM:i:0

I am proposing to try to read the tags in as a single column, and I thought I could do this by passing in a regular expression for the separator which excludes tabs which occur in the context of the tags.

Following http://www.rexegg.com/regex-best-trick.html I wrote the following regex for this: [A-Za-z][A-Za-z0-9]:[A-Za-z]:[^\t]+\t..:|(\t). I tested it on an online regular expression tester and it seems to just match the tabs I want as separators.

But when I run

df = pd.read_csv(myfile.txt, sep=r"[A-Za-z][A-Za-z0-9]:[A-Za-z]:[^\t]+\t..:|(\t)", 
                 header=None, engine="python")
print(df)

I get the following output for this data:

                          0       1    2   3   4   5               6   7   8 \
0  C42TMACXX:5:2316:15161:76101  \t  163  \t   1  \t  @<@DFFADDDF:DD  \t NaN   
1  C42TMACXX:5:2316:15161:76101  \t   83  \t   1  \t  CCCCCACDDDCB@B  \t NaN   
2  C42TMACXX:5:1305:26011:74469  \t  163  \t   1  \t  CCCFFFFFHHHHGJ  \t NaN   

   9    10  11      12  13    14  
0 NaN  i:1  \t     NaN NaN   i:0  
1 NaN  i:1  \t  nM:i:1 NaN  None  
2 NaN  i:1  \t     NaN NaN   i:0  

What I was expecting / want is:

                          0        1  2               3                      4
0  C42TMACXX:5:2316:15161:76101  163  1  @<@DFFADDDF:DD  NH:i:1 HI:i:1 AS:i:200 nM:i:0   
1  C42TMACXX:5:2316:15161:76101  83   1  CCCCCACDDDCB@B  NH:i:1 HI:i:1 nM:i:1   
2  C42TMACXX:5:1305:26011:74469  163  1  CCCFFFFFHHHHGJ  NH:i:1 HI:i:1 AS:i:200 nM:i:0

How do achieve that?

In case it's relevant, I'm using pandas 0.17.1 and my real data files are of the order of 100 million+ lines.

like image 773
kmt Avatar asked Nov 08 '22 22:11

kmt


1 Answers

I took a quick look at pandas docs and it seems the regex used as a separator cannot use groups.

C42TMACXX:5:2316:15161:76101    163 1   @<@DFFADDDF:DD  NH:i:1  HI:i:1  AS:i:200    nM:i:0
C42TMACXX:5:2316:15161:76101    83  1   CCCCCACDDDCB@B  NH:i:1  HI:i:1  nM:i:1
C42TMACXX:5:1305:26011:74469    163 1   CCCFFFFFHHHHGJ  NH:i:1  HI:i:1  AS:i:200    nM:i:0
                              ^    ^  ^                ^           

You need to match only the 4 first tabs but you can't without using groups.

A solution is to isolate the wanted \t by using lookaheads and lookbehinds.

Here is a regex which should work:

(?<=\d)\t(?=\d)|\t(?=[A-Z@<:]{14})|(?<=[A-Z@<:]{14})\t

Explanation

(?<=\d)\t(?=\d) : a tab precedeed by (?<=...) a digit and followed by (?=...) a digit

=> match the 1st and 2nd tabs

| OR

\t(?=[A-Z@<:]{14}) : a tab followed by 14 consecutive characters present in the set LETTER,@,< or :

=> match the 3rd tab

| OR

(?<=[A-Z@<:]{14})\t : a tab precedeed by the same 14 characters set

=> match the 4th tab

Demo

Note

If you need to allow more characters in the 14 consecutive characters pattern, just add them to the set.

like image 152
Stephane Janicaud Avatar answered Nov 15 '22 11:11

Stephane Janicaud