Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove numeric characters present in Countvectorizer?

How can I eliminate numeric characters coming inside countvectorizer My code

cv = CountVectorizer(min_df=50, stop_words='english',max_features = 5000,analyzer='word') 


    cv_fit_addr=cv.fit_transform(data['Adj_Addr'])



 print(cv.get_feature_names())

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1001', '1002', '1003', '1004', '1005', '1008', '101', '1010', '102', '103', '104', '105', '106', '107', '108', '109', '10f', '10th', '11', '1101', '1102', '1103', '1104', '1105', '1106', '1108', '111', '1111', '113', '114', '116', '118', '11f', '11th', '12', '120', '1201', '1202', '1203', '1204', '1206', '1208', '121', '122', '123', '125', '126', '128', '12a', '12f', '12th', '13', '1301', '1302', '1303', '1305', '1308', '132', '133', '135', '138', '139', '13f', '13th', '14', '141', '143', '148', '14f', '14th', '15', '150', '1501', '1502', '1503', '1505', '151', '153', '15f', '15th', '16', '160', '1601', '1602', '1603', '1608', '165', '168', '169', '16f', '16th', '17', '1701', '1702', '1703', '1705', '17f', '17th', '18', '1801', '1803', '181', '182', '183', '188', '18f', '18th', '19', '1901', '1902', '191', '193', '19f', '19th', '1a', '1b', '1f', '1st', '20', '200', '2001', '2003', '201', '202', '203', '204', '205', '206', '208', '20f', '20th', '21', '210', '2101', '2103', '211', '21f', '21st', '22', '220', '2201', '223', '228', '22f', '22nd', '23', '2301', '231', '23f', '23rd', '24', '248', '25', '255', '25f', '25th', '26', '2601', '26f', '26th', '27', '2701', '27f', '28', '28f', '29', '29th', '2a', '2b', '2f', '2g', '2nd', '30', '301', '302', '303', '305', '306', '307', '308', '30f', '31', '311', '32', '33', '338', '34', '35', '36', '37', '370', '38', '388', '39', '392', '3a', '3b', '3f', '3rd', '40', '401', '402', '403', '404', '405', '406', '407', '41', '418', '42', '43', '44', '45', '46', '47', '479', '48', '489', '49', '491', '4a', '4f', '4th', '50', '500', '501', '502', '503', '505', '509', '51', '510', '511', '52', '53', '538', '54', '55', '555', '56', '57', '576', '58', '582', '59', '592', '5a', '5b', '5f', '5th', '60', '601', '602', '603', '605', '607', '608', '609', '61', '610', '611', '62', '625', '63', '64', '65', '66', '67', '68', '681', '69', '6a', '6f', '6th', '70', '701', '702', '703', '704', '705', '706', '707', '71', '712', '72', '73', '74', '75', '76', '760', '77', '777', '778', '78', '788', '79', '7a', '7f', '7th', '80', '800', '801', '802', '803', '804', '805', '806', '807', '808', '81', '810', '82', '83', '833', '838', '84', '852', '87', '88', '883', '89', '8a', '8f', '8th', '901', '902', '903', '904', '905', '906', '907', '908', '909', '912', '92', '93', '94', '95', '979', '98', '99', '9a', '9f', '9th', 'a1', 'a2', 'a3', 'a5', 'aberdeen', 'academic', 'accessories', 'ace', 'admiralty', 'advanced', 'ag', 'aia', 'air', 'airport', 'alexandra', 'alliance', 'allied', 'alpha', 'america', 'ap', 'apartment', 'apparel', 'argyle', 'arrow', 'art', 'ashley', 'asia', 'asset', 'associates', 'atl', 'attn', 'au', 'austin', 'avenue', 'aviation', 'axa', 'b1', 'b2', 'ba', 'bank', 'baptist', 'bay', 'bea', 'bear', 'beauty', 'bel', 'berth', 'best', 'beverly', 'billion', 'bio', 'biotech', 'bldg', 'block', 'blue', 'bo', 'bonham', 'br', 'branch', 'bright', 'broadway', 'bu', 'building', 'bun', 'business', 'c1', 'ca', 'cable', 'cambridge', 'cameron', 'canton', 'capital', 'cargo', 'castle', 'causeway', 'cc', 'cct', 'cent', 'centra'
like image 782
NgBrandon Avatar asked Oct 30 '17 07:10

NgBrandon


People also ask

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

What does a count Vectorizer do?

Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

Is CountVectorizer case sensitive?

The default value of for lowercase in CountVectorizer is True . This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won't match against the content in the documents.


1 Answers

You could pass a custom text preprocessor that removes the digits to the CountVectorizer object like:

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text

count_vectorizer = CountVectorizer(preprocessor=preprocess_text)
like image 73
Franco Piccolo Avatar answered Nov 04 '22 15:11

Franco Piccolo