Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read special characters from .txt file in python

The goal of this code is to find the frequency of words used in a book.

I am tying to read in the text of a book but the following line keeps throwing my code off:

precious protégés. No, gentlemen; he'll always show 'em a clean pair

specifically the é character

I have looked at the following documentation, but I don't quite understand it: https://docs.python.org/3.4/howto/unicode.html

Heres my code:

import string
# Create word dictionary from the comprehensive word list 
word_dict = {}
def create_word_dict ():

  # open words.txt and populate dictionary
  word_file = open ("./words.txt", "r")
  for line in word_file:
    line = line.strip()
    word_dict[line] = 1

# Removes punctuation marks from a string
def parseString (st):
  st = st.encode("ascii", "replace")
  new_line = ""
  st = st.strip()
  for ch in st:
    ch = str(ch)
    if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':

      print (ch)
      new_line += ch
    else:
      new_line += ""
  # now remove all instances of 's or ' at end of line
  new_line = new_line.strip()
  print (new_line)
  if (new_line[-1] == "'"):
    new_line = new_line[:-1]
  new_line.replace("'s", "")
  # Conversion from ASCII codes back to useable text
  message = new_line
  decodedMessage = ""
  for item in message.split():
    decodedMessage += chr(int(item))
  print (decodedMessage)
  return new_line

# Returns a dictionary of words and their frequencies
def getWordFreq (file):

  # Open file for reading the book.txt
  book = open (file, "r")

  # create an empty set for all Capitalized words
  cap_words = set()

  # create a dictionary for words
  book_dict = {}
  total_words = 0

  # remove all punctuation marks other than '[not s]
  for line in book:
    line = line.strip()
    if (len(line) > 0):
      line = parseString (line)

    word_list = line.split()

    # add words to the book dictionary
    for word in word_list:
      total_words += 1
      if (word in book_dict):
        book_dict[word] = book_dict[word] + 1
      else:
        book_dict[word] = 1
  print (book_dict)

  # close the file
  book.close()

def main():
  wordFreq1 = getWordFreq ("./Tale.txt")
  print (wordFreq1)

main()

The error that I received is as follows:

Traceback (most recent call last):
  File "Books.py", line 80, in <module>
    main()
  File "Books.py", line 77, in main
    wordFreq1 = getWordFreq ("./Tale.txt")
  File "Books.py", line 60, in getWordFreq
    line = parseString (line)
  File "Books.py", line 36, in parseString
    decodedMessage += chr(int(item))
OverflowError: Python int too large to convert to C long                      
like image 794
Daniel Schulze Avatar asked Nov 30 '14 23:11

Daniel Schulze


2 Answers

When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try

word_file = open ("./words.txt", "r", encoding='utf-8')
like image 193
Vladimir Shevyakov Avatar answered Sep 20 '22 06:09

Vladimir Shevyakov


The best way I could think of is to read each character as an ASCII value, into an array, and then take the char value. For example, 97 is ASCII for "a" and if you do char(97) it will output "a". Check out some online ASCII tables that provide values for special characters also.

like image 33
24GHz Avatar answered Sep 19 '22 06:09

24GHz