Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find Lines with N occurrences of a char

Tags:

regex

I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:

“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…

The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.

The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).

So the “offending” lines look like this:

“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…

You can see the problem here.

Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.

Based upon that, what would be the best regex I could use to identify these lines?

My first thought is: Tell me which lines contain more than 4 quotes (“).

But I couldn’t come up with anything (I’m not good at Regex beyond the basics).

like image 762
Martin Marconcini Avatar asked Dec 18 '22 00:12

Martin Marconcini


1 Answers

this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"

you can experiment with replacing 4 with 5 or more, depending on your data.

like image 58
zed_0xff Avatar answered Mar 20 '23 03:03

zed_0xff