Assuming I have a column called df.Text
which contains text (more that 1 sentence) and I want to use polyglot Detector
to detect the language and store the value in a new column df['Text-Lang']
how do I ensure I also capture the other details like code
and confidence
testEng ="This is English"
lang = Detector(testEng)
print(lang.language)
returns
name: English code: en confidence: 94.0 read bytes: 1920
but
df['Text-Lang','Text-LangConfidence']= df.Text.apply(Detector)
ends with
AttributeError: 'float' object has no attribute 'encode' and Detector is not able to detect the language reliably.
Am I applying the Detector function incorrectly or storing the output incorrectly or something else?
First, if you only need polyglot
for language detection, you'd better use pycld2
directly, that is what used behind the scenes. It has much cleaner API.
Saying that, the error you state comes from one of the values in your Text
column, which is a real number. So you will have to convert values like that into strings.
The next problem you will stumble upon is minimal text length. polyglot
will throw exception if the text is too short. You have to silence the exception by passing quiet=True
.
Now, applying Detector
will return an object. So you will have to parse it to extract the information you want. To extract language names, you will have to import icu
module (it is a dependency of polyglot
, so you have it installed already):
import icu
df.Text = df.Text.astype(str)
df['poly_obj'] = df.Text.apply(lambda x: Detector(x, quiet=True))
df['Text-lang'] = df['poly_obj'].apply(lambda x: icu.Locale.getDisplayName(x.language.locale))
df['Text-LangConfidence'] = df['poly_obj'].apply( lambda x: x.language.confidence)
After that you can drop the poly_obj
column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With