I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this. <pre class="prettyprint"><code>with open(fname, "r") as fp: for line in fp: line = line.strip() line = line.decode('utf-8', 'ignore') line = line.encode('utf-8', 'ignore') </code></pre> somehow I still get an error <pre class="prettyprint"><code>bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin </code></pre> I don't get it. Is there some simple way to do it? UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

Try below code line instead of last two lines. Hope it helps: <pre class="prettyprint"><code>line=line.decode('utf-8','ignore').encode("utf-8") </code></pre>

Delete every non utf-8 symbols from string

Tags:

python

encode

mongodb

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.

with open(fname, "r") as fp:     for line in fp:         line = line.strip()         line = line.decode('utf-8', 'ignore')         line = line.encode('utf-8', 'ignore')

somehow I still get an error

bson.errors.InvalidStringData: strings in documents must be valid UTF-8:  1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin

I don't get it. Is there some simple way to do it?

UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

437

asked Oct 24 '14 05:10

Darth Kotik

2 Answers

Try below code line instead of last two lines. Hope it helps:

line=line.decode('utf-8','ignore').encode("utf-8")

answered Oct 13 '22 09:10

Irshad Bhat

For python 3, as mentioned in a comment in this thread, you can do:

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')

The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.

If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').

answered Oct 13 '22 10:10

AlexG

Related questions
                            
                                How do you merge images into a canvas using PIL/Pillow?
                            
                                @Patch decorator is not compatible with pytest fixture
                            
                                Spark DataFrame TimestampType - how to get Year, Month, Day values from field?
                            
                                How to count unique ID after groupBy in pyspark
                            
                                type hinting within a class [duplicate]
                            
                                global variable warning in python [duplicate]
                            
                                Loading .RData files into Python
                            
                                memoize to disk - python - persistent memoization
                            
                                Obtain Latitude and Longitude from a GeoTIFF File
                            
                                Extract all keys from a list of dictionaries
                            
                                Does Python csv writer always use DOS end-of-line characters?
                            
                                Find dictionary items whose key matches a substring
                            
                                Installing python dateutil
                            
                                How to fix " AttributeError at /api/doc 'AutoSchema' object has no attribute 'get_link' " error in Django
                            
                                Python Pandas iterate over rows and access column names
                            
                                How to use 'User' as foreign key in Django 1.5
                            
                                Not a Valid Choice for Dynamic Select Field WTFORMS
                            
                                Zero pad numpy array
                            
                                'if' statement in jinja2 template
                            
                                How to force migrations to a DB if some tables already exist in Django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With