Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove special characters from txt files using Python

Tags:

python

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
    for name in files:
        [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern

So far my code is this. This counts the number of unique words and total words from D:\report\shakeall\*.txt

The problem is, for example, this code recognizes code code. and code! different words. So, this can't be an answer to an exact number of unique words.

I'd like to remove special characters from 42 text files using Windows text editor

Or make an exception rule that solve this problem.

If using the latter, how shoud I make up my code?

Make it to directly modify text files? Or make an exception that doesn't count special characters?

like image 743
rocksland Avatar asked Dec 20 '22 17:12

rocksland


2 Answers

import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)

It will change every non alphanumeric char to white space.

like image 177
NIlesh Sharma Avatar answered Feb 15 '23 01:02

NIlesh Sharma


I'm pretty new and I doubt this is very elegant at all, but one option would be to take your string(s) after reading them in and running them through string.translate() to strip out the punctuation. Here is the Python documentation for it for version 2.7 (which i think you're using).

As far as the actual code, it might be something like this (but maybe someone better than me can confirm/improve on it):

fileString.translate(None, string.punctuation)

where "fileString" is the string that your open(fp) read in. "None" is provided in place of a translation table (which would normally be used to actually change some characters into others), and the second parameter, string.punctuation (a Python string constant containing all the punctuation symbols) is a set of characters that will be deleted from your string.

In the event that the above doesn't work, you could modify it as follows:

inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)

There are a couple of other answers to similar questions i found via a quick search. I'll link them here, too, in case you can get more from them.

Removing Punctuation From Python List Items

Remove all special characters, punctuation and spaces from string

Strip Specific Punctuation in Python 2.x


Finally, if what I've said is completely wrong please comment and i'll remove it so that others don't try what I've said and become frustrated.

like image 22
Weston Odom Avatar answered Feb 15 '23 00:02

Weston Odom