How to remove numeric characters present in Countvectorizer?

Tags:

How can I eliminate numeric characters coming inside countvectorizer My code

cv = CountVectorizer(min_df=50, stop_words='english',max_features = 5000,analyzer='word') 


    cv_fit_addr=cv.fit_transform(data['Adj_Addr'])



 print(cv.get_feature_names())

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1001', '1002', '1003', '1004', '1005', '1008', '101', '1010', '102', '103', '104', '105', '106', '107', '108', '109', '10f', '10th', '11', '1101', '1102', '1103', '1104', '1105', '1106', '1108', '111', '1111', '113', '114', '116', '118', '11f', '11th', '12', '120', '1201', '1202', '1203', '1204', '1206', '1208', '121', '122', '123', '125', '126', '128', '12a', '12f', '12th', '13', '1301', '1302', '1303', '1305', '1308', '132', '133', '135', '138', '139', '13f', '13th', '14', '141', '143', '148', '14f', '14th', '15', '150', '1501', '1502', '1503', '1505', '151', '153', '15f', '15th', '16', '160', '1601', '1602', '1603', '1608', '165', '168', '169', '16f', '16th', '17', '1701', '1702', '1703', '1705', '17f', '17th', '18', '1801', '1803', '181', '182', '183', '188', '18f', '18th', '19', '1901', '1902', '191', '193', '19f', '19th', '1a', '1b', '1f', '1st', '20', '200', '2001', '2003', '201', '202', '203', '204', '205', '206', '208', '20f', '20th', '21', '210', '2101', '2103', '211', '21f', '21st', '22', '220', '2201', '223', '228', '22f', '22nd', '23', '2301', '231', '23f', '23rd', '24', '248', '25', '255', '25f', '25th', '26', '2601', '26f', '26th', '27', '2701', '27f', '28', '28f', '29', '29th', '2a', '2b', '2f', '2g', '2nd', '30', '301', '302', '303', '305', '306', '307', '308', '30f', '31', '311', '32', '33', '338', '34', '35', '36', '37', '370', '38', '388', '39', '392', '3a', '3b', '3f', '3rd', '40', '401', '402', '403', '404', '405', '406', '407', '41', '418', '42', '43', '44', '45', '46', '47', '479', '48', '489', '49', '491', '4a', '4f', '4th', '50', '500', '501', '502', '503', '505', '509', '51', '510', '511', '52', '53', '538', '54', '55', '555', '56', '57', '576', '58', '582', '59', '592', '5a', '5b', '5f', '5th', '60', '601', '602', '603', '605', '607', '608', '609', '61', '610', '611', '62', '625', '63', '64', '65', '66', '67', '68', '681', '69', '6a', '6f', '6th', '70', '701', '702', '703', '704', '705', '706', '707', '71', '712', '72', '73', '74', '75', '76', '760', '77', '777', '778', '78', '788', '79', '7a', '7f', '7th', '80', '800', '801', '802', '803', '804', '805', '806', '807', '808', '81', '810', '82', '83', '833', '838', '84', '852', '87', '88', '883', '89', '8a', '8f', '8th', '901', '902', '903', '904', '905', '906', '907', '908', '909', '912', '92', '93', '94', '95', '979', '98', '99', '9a', '9f', '9th', 'a1', 'a2', 'a3', 'a5', 'aberdeen', 'academic', 'accessories', 'ace', 'admiralty', 'advanced', 'ag', 'aia', 'air', 'airport', 'alexandra', 'alliance', 'allied', 'alpha', 'america', 'ap', 'apartment', 'apparel', 'argyle', 'arrow', 'art', 'ashley', 'asia', 'asset', 'associates', 'atl', 'attn', 'au', 'austin', 'avenue', 'aviation', 'axa', 'b1', 'b2', 'ba', 'bank', 'baptist', 'bay', 'bea', 'bear', 'beauty', 'bel', 'berth', 'best', 'beverly', 'billion', 'bio', 'biotech', 'bldg', 'block', 'blue', 'bo', 'bonham', 'br', 'branch', 'bright', 'broadway', 'bu', 'building', 'bun', 'business', 'c1', 'ca', 'cable', 'cambridge', 'cameron', 'canton', 'capital', 'cargo', 'castle', 'causeway', 'cc', 'cct', 'cent', 'centra'

782

asked Oct 30 '17 07:10

NgBrandon

1 Answers

You could pass a custom text preprocessor that removes the digits to the CountVectorizer object like:

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text

count_vectorizer = CountVectorizer(preprocessor=preprocess_text)

answered Nov 04 '22 15:11

Franco Piccolo

Related questions
                            
                                How to verify structure a neural network in keras model?
                            
                                Zappa not packaging nested source directories
                            
                                return the index using pandas series.sample()?
                            
                                Python program outputting different results, even though no random is used
                            
                                How do you append the values of the first column to all other columns in a pandas dataframe
                            
                                Using Python Selenium Webdriver to open Electron Application
                            
                                How to get original values after using factorize() in Python?
                            
                                Anaconda Prompt Corrupts after Installation
                            
                                Why does `head` need `()` and `shape` does not?
                            
                                Python PCA plot using Hotelling's T2 for a confidence interval
                            
                                How to create custom transport for asyncio?
                            
                                Python: Comparing two JSON objects in pytest
                            
                                Fastest way to loop over Pandas DataFrame for API calls
                            
                                Python/pandas - Using DataFrame.apply with function returning dictionary
                            
                                Lambda not supporting NLTK file size
                            
                                Getting Labels on top of Bar in Polar/Radial Bar Chart in Matplotlib, Python3
                            
                                Count unique elements along an axis of a NumPy array
                            
                                Sqlalchemy get row in timeslot
                            
                                Connecting to Oracle RDS
                            
                                Can I parameterize a pytest fixture with other fixtures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove numeric characters present in Countvectorizer?

Tags:

python

pandas

nltk

scikit-learn

NgBrandon

People also ask

1 Answers

Franco Piccolo

Recent Activity

Donate For Us