Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined?

I have around 10k .bytes files in my directory and I want to use count vectorizer to get n_gram counts (i.e fit on train and transform on test set). In those 10k files I have 8k files as train and 2k as test.

files = 
['bfiles/GhHS0zL9cgNXFK6j1dIJ.bytes',
 'bfiles/8qCPkhNr1KJaGtZ35pBc.bytes',
 'bfiles/bLGq2tnA8CuxsF4Py9RO.bytes',
 'bfiles/C0uidNjwV8lrPgzt1JSG.bytes',
 'bfiles/IHiArX1xcBZgv69o4s0a.bytes',
    ...............................
    ...............................]

print(open(files[0]).read())
    'A4 AC 4A 00 AC 4F 00 00 51 EC 48 00 57 7F 45 00 2D 4B 42 45 E9 77 51 4D 89 1D 19 40 30 01 89 45 E7 D9 F6 47 E7 59 75 49 1F ....'

I can't do something like below and pass everything to CountVectorizer.

file_content = []
for file in file:
    file_content.append(open(file).read())

I can't append each file text to a big nested lists of files and then use CountVectorizer because the all combined text file size exceeds 150gb. I don't have resources to do that because CountVectorizer use huge amount of memory.

I need a more efficient way of solving this, Is there some other way I can achieve what I want without loading everything into memory at once. Any help is much appreciated.

All I could achieve was read one file and then use CountVectorizer but I don't know how to achieve what I'm looking for.

cv = CountVectorizer(ngram_range=(1, 4))
temp = cv.fit_transform([open(files[0]).read()])
temp
<1x451500 sparse matrix of type '<class 'numpy.int64'>'
    with 335961 stored elements in Compressed Sparse Row format>
like image 647
user_12 Avatar asked Sep 06 '19 19:09

user_12


1 Answers

You can build a solution using the following flow:

1) Loop through you files and create a set of all tokens in your files. In the example below this is done using Counter, but you can use python sets to achieve the same result. The bonus here is that Counter will also give you the total number of occurrences of each term.

2) Fit CountVectorizer with the set/list of tokens. You can instantiate CountVectorizer with ngram_range=(1, 4). Below this is avoided in order to limit the number of features in df_new_data.

3) Transform new data as usual.

The example below works on small data. I hope you can adapt the code to suit your needs.

import glob
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Create a list of file names
pattern = 'C:\\Bytes\\*.csv'
csv_files = glob.glob(pattern)

# Instantiate Counter and loop through the files chunk by chunk 
# to create a dictionary of all tokens and their number of occurrence
counter = Counter()
c_size = 1000
for file in csv_files:
    for chunk in pd.read_csv(file, chunksize=c_size, index_col=0, header=None):
        counter.update(chunk[1])

# Fit the CountVectorizer to the counter keys
vectorizer = CountVectorizer(lowercase=False)
vectorizer.fit(list(counter.keys()))

# Loop through your files chunk by chunk and accummulate the counts
counts = np.zeros((1, len(vectorizer.get_feature_names())))
for file in csv_files:
    for chunk in pd.read_csv(file, chunksize=c_size,
                             index_col=0, header=None):
        new_counts = vectorizer.transform(chunk[1])
        counts += new_counts.A.sum(axis=0)

# Generate a data frame with the total counts
df_new_data = pd.DataFrame(counts, columns=vectorizer.get_feature_names())

df_new_data
Out[266]: 
      00     01     0A     0B     10     11     1A     1B     A0     A1  \
0  258.0  228.0  286.0  251.0  235.0  273.0  259.0  249.0  232.0  233.0   

      AA     AB     B0     B1     BA     BB  
0  248.0  227.0  251.0  254.0  255.0  261.0  

Code for the generation of the data:

import numpy as np
import pandas as pd

def gen_data(n): 
    numbers = list('01')
    letters = list('AB')
    numlet = numbers + letters
    x = np.random.choice(numlet, size=n)
    y = np.random.choice(numlet, size=n)
    df = pd.DataFrame({'X': x, 'Y': y})
    return df.sum(axis=1)

n = 2000
df_1 = gen_data(n)
df_2 = gen_data(n)

df_1.to_csv('C:\\Bytes\\df_1.csv')
df_2.to_csv('C:\\Bytes\\df_2.csv')

df_1.head()
Out[218]: 
0    10
1    01
2    A1
3    AB
4    1A
dtype: object
like image 66
KRKirov Avatar answered Nov 15 '22 13:11

KRKirov