Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way to clean out control characters in a file?

Previously, I had been cleaning out data using the code snippet below

import unicodedata, re, io

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
    return cc_re.sub('', s)

cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)

There are newline characters in the file that i want to keep.

The following records the time taken for cc_re.sub('', s) to substitute the first few lines (1st column is the time taken and 2nd column is len(s)):

0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208

As @ashwinichaudhary suggested, using s.translate(dict.fromkeys(control_chars)) and the same time taken log outputs:

0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209

But the code is really slow for my 1GB of text. Is there any other way to clean out controlled characters?

like image 997
alvas Avatar asked May 11 '15 13:05

alvas


1 Answers

found a solution working character by charater, I bench marked it using a 100K file:

import unicodedata, re, io
from time import time

# This is to generate randomly a file to test the script

from string import lowercase
from random import random

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars

fnam = 'filename.txt'

out=io.open(fnam, 'w')

for line in range(1000000):
    out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()


# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
    return cc_re.sub('', s)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
    return ''.join(c for c in s if not c in control_chars)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line = rm_control_chars_1(line)
        cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

the output is:

114.625444174
0.0149750709534

I tried on a file of 1Gb (only for the second one) and it lasted 186s.

I also wrote this other version of the same script, slightly faster (176s), and more memory efficient (for very large files not fitting in RAM):

t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        out.write(rm_control_chars_1(line))
out.close()
print time() - t0
like image 165
fransua Avatar answered Sep 28 '22 04:09

fransua