I'm reading some tab-delimited data into a pandas Dataframe using read_csv, but I have tabs occurring within the column data which means I can't just use "\t" as a separator. Specifically, the last entries in each line are a set of tab delimited optional tags which match [A-Za-z][A-Za-z0-9]:[A-Za-z]:.+ There are no guarantees about how many tags there will be or which ones will be present, and different sets of tags may occur on different lines. Example data looks like this (all white spaces are tabs in my data):
C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0
C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1
C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0
I am proposing to try to read the tags in as a single column, and I thought I could do this by passing in a regular expression for the separator which excludes tabs which occur in the context of the tags.
Following http://www.rexegg.com/regex-best-trick.html I wrote the following regex for this: [A-Za-z][A-Za-z0-9]:[A-Za-z]:[^\t]+\t..:|(\t). I tested it on an online regular expression tester and it seems to just match the tabs I want as separators.
But when I run
df = pd.read_csv(myfile.txt, sep=r"[A-Za-z][A-Za-z0-9]:[A-Za-z]:[^\t]+\t..:|(\t)",
header=None, engine="python")
print(df)
I get the following output for this data:
0 1 2 3 4 5 6 7 8 \
0 C42TMACXX:5:2316:15161:76101 \t 163 \t 1 \t @<@DFFADDDF:DD \t NaN
1 C42TMACXX:5:2316:15161:76101 \t 83 \t 1 \t CCCCCACDDDCB@B \t NaN
2 C42TMACXX:5:1305:26011:74469 \t 163 \t 1 \t CCCFFFFFHHHHGJ \t NaN
9 10 11 12 13 14
0 NaN i:1 \t NaN NaN i:0
1 NaN i:1 \t nM:i:1 NaN None
2 NaN i:1 \t NaN NaN i:0
What I was expecting / want is:
0 1 2 3 4
0 C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0
1 C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1
2 C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0
How do achieve that?
In case it's relevant, I'm using pandas 0.17.1 and my real data files are of the order of 100 million+ lines.
I took a quick look at pandas docs and it seems the regex used as a separator cannot use groups.
C42TMACXX:5:2316:15161:76101 163 1 @<@DFFADDDF:DD NH:i:1 HI:i:1 AS:i:200 nM:i:0
C42TMACXX:5:2316:15161:76101 83 1 CCCCCACDDDCB@B NH:i:1 HI:i:1 nM:i:1
C42TMACXX:5:1305:26011:74469 163 1 CCCFFFFFHHHHGJ NH:i:1 HI:i:1 AS:i:200 nM:i:0
^ ^ ^ ^
You need to match only the 4 first tabs but you can't without using groups.
A solution is to isolate the wanted \t
by using lookaheads and lookbehinds.
Here is a regex which should work:
(?<=\d)\t(?=\d)|\t(?=[A-Z@<:]{14})|(?<=[A-Z@<:]{14})\t
Explanation
(?<=\d)\t(?=\d)
: a tab precedeed by (?<=...)
a digit and followed by (?=...)
a digit
=> match the 1st and 2nd tabs
|
OR
\t(?=[A-Z@<:]{14})
: a tab followed by 14 consecutive characters present in the set LETTER,@,< or :
=> match the 3rd tab
|
OR
(?<=[A-Z@<:]{14})\t
: a tab precedeed by the same 14 characters set
=> match the 4th tab
Demo
Note
If you need to allow more characters in the 14 consecutive characters pattern, just add them to the set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With