Can I revert StringDocument back into a string ? (TextAnalysis.jl)

Question

I'm making a spam classifier using a Naive Bayes Classifier model from the Julia TextAnalysis.jl package.

The text pre-processing functions (like remove_corrupt_utf8!(sd) where sd is a StringDocument) can only be applied to Document types (specific to the package) and not to string type.

Is there any way I can convert this StringDocument back into a string to put back into my dataframe.

Current code:

#global messageLis = []
for row in eachrow(data)
    message = row.v2
    #push!(messageLis, message)
    StringDoc = StringDocument(message)
    remove_corrupt_utf8!(StringDoc) #to remove the corrupt characters (if any) in the message so that model doesnt fail
    #convert StringDoc back into a string so that text is preprocessed from the dataframe itself.
end

Any help would be appreciated.

David Varela · Accepted Answer

Use text to access the processed string:

julia> str = StringDocument("here are some punctuations !!!...");

julia> prepare!(str, strip_punctuation)

julia> text(str)
"here are some punctuations "

Can I revert StringDocument <Type> back into a string ? (TextAnalysis.jl)

Tags:

nlp

julia

PseudoCodeNerd

1 Answers

David Varela

Recent Activity

Donate For Us