I understand that ANN input must be normalized, standardized, etc. Leaving the peculiarities and models of various ANN's aside, how can I preprocess UTF-8 encoded text within the range of {0,1} or alternatively between the range {-1,1} before it is given as input to neural networks? I have been searching for this on google but can't find any information (I may be using the wrong term). <ol> <li>Does that make sense?</li> <li>Isn't that how text is preprocessed for neural networks?</li> <li>Are there any alternatives?</li> </ol> <h3>Update on November 2013</h3> I have long accepted as correct the answer of Pete. However, I have serious doubts, mostly due to recent research I've been doing on Symbolic knowledge and ANN's. Dario Floreano and Claudio Mattiussi in their book explain that such processing is indeed possible, by using distributed encoding. Indeed if you try a google scholar search, there exists a plethora of neuroscience articles and papers on how distrubuted encoding is hypothesized to be used by brains in order to encode Symbolic Knowledge. Teuvo Kohonen, in his paper "Self Organizing Maps" explains: <blockquote> One might think that applying the neural adaptation laws to a symbol set (regarded as a set of vectorial variables) might create a topographic map that displays the "logical distances" between the symbols. However, there occurs a problem which lies in the different nature of symbols as compared with continuous data. For the latter, similarity always shows up in a natural way, as the metric differences between their continuous encodings. This is no longer true for discrete, symbolic items, such as words, for which no metric has been defined. It is in the very nature of a symbol that its meaning is dissociated from its encoding. </blockquote> However, Kohonen did manage to deal with Symbolic Information in SOMs! Furthermore, Prof Dr Alfred Ultsch in his paper "The Integration of Neural Networks with Symbolic Knowledge Processing" deals exactly with how to process Symbolic Knowledge (such as text) in ANN's. Ultsch offers the following methodologies for processing Symbolic Knowledge: Neural Approximative Reasoning, Neural Unification, Introspection and Integrated Knowledge Acquisition. Albeit little information can be found on those in google scholar or anywhere else for that matter. Pete in his answer is right about semantics. Semantics in ANN's are usually disconnected. However, following reference, provides insight how researchers have used RBMs, trained to recognize similarity in semantics of different word inputs, thus it shouldn't be impossible to have semantics, but would require a layered approach, or a secondary ANN if semantics are required. Natural Language Processing With Subsymbolic Neural Networks, Risto Miikkulainen, 1997 Training Restricted Boltzmann Machines on Word Observations, G.E.Dahl, Ryan.P.Adams, H.Rarochelle, 2012 <h3>Update on January 2021</h3> The field of NLP and Deep Learning has seen a resurgence in research in the past few years and since I asked that Question. There are now Machine-learning models which address what I was trying to achieve in many different ways. For anyone arriving to this question wondering on how to pre-process text in Deep Learning or Neural Networks, here's a few helpful topics, none of which are Academic, but simple to understand and which should get you started on solving similar tasks: <ul> <li>Vector Space Models</li> <li>Transformers</li> <li>Recurrent and Convolutional Networks for Text Classification</li> <li>Word Embedding</li> <li>Text Pre-processing</li> </ul> At the time I was asking that question, RNN, CNN and VSM were about to start being used, nowadays most Deep Learning frameworks support extensive NLP support. Hope the above helps.

In response to your comments, no, your proposed scheme doesn't quite make sense. An artificial neuron output by its nature represents a continuous or at least a binary value. It does not makes sense to map between a huge discrete enumeration (like UTF-8 characters) and the continuous range represented by a floating point value. The ANN will necessarily act like 0.1243573 is an extremely good approximation to 0.1243577 when those numbers could easily be mapped to the newline character and the character "a", for example, which would not be good approximations for each other at all. Quite frankly, there is no reasonable representation for "general unicode string" as inputs to an ANN. A reasonable representation depends on the specifics of what you're doing. It depends on your answers to the following questions: <ul> <li>Are you expecting words to show up in the input strings as opposed to blocks of characters? What words are you expecting to show up in the strings?</li> <li>What is the length distribution of the input strings?</li> <li>What is the expected entropy of the input strings?</li> <li>Is there any domain specific knowledge you have about what you expect the strings to look like?</li> </ul> and most importantly <ul> <li>What are you trying to do with the ANN. This is not something you can ignore.</li> </ul> Its possible you might have a setup for which there is no translation that will actually allow you to do what you want with the neural network. Until you answer those questions (you skirt around them in your comments above), it's impossible to give a good answer. I can give an example answer, that would work if you happened to give certain answers to the above questions. For example, if you are reading in strings with arbitrary length but composed of a small vocabulary of words separated by spaces, then I would suggest a translation scheme where you make N inputs, one for each word in the vocabulary, and use a recurrent neural network to feed in the words one at a time by setting the corresponding input to 1 and all the others to 0.

processing strings of text for neural network input

Tags:

textinput

neural-network

preprocessor

normalize

standardized

I understand that ANN input must be normalized, standardized, etc. Leaving the peculiarities and models of various ANN's aside, how can I preprocess UTF-8 encoded text within the range of {0,1} or alternatively between the range {-1,1} before it is given as input to neural networks? I have been searching for this on google but can't find any information (I may be using the wrong term).

Does that make sense?
Isn't that how text is preprocessed for neural networks?
Are there any alternatives?

Update on November 2013

I have long accepted as correct the answer of Pete. However, I have serious doubts, mostly due to recent research I've been doing on Symbolic knowledge and ANN's.

Dario Floreano and Claudio Mattiussi in their book explain that such processing is indeed possible, by using distributed encoding.

Indeed if you try a google scholar search, there exists a plethora of neuroscience articles and papers on how distrubuted encoding is hypothesized to be used by brains in order to encode Symbolic Knowledge.

Teuvo Kohonen, in his paper "Self Organizing Maps" explains:

One might think that applying the neural adaptation laws to a symbol set (regarded as a set of vectorial variables) might create a topographic map that displays the "logical distances" between the symbols. However, there occurs a problem which lies in the different nature of symbols as compared with continuous data. For the latter, similarity always shows up in a natural way, as the metric differences between their continuous encodings. This is no longer true for discrete, symbolic items, such as words, for which no metric has been defined. It is in the very nature of a symbol that its meaning is dissociated from its encoding.

However, Kohonen did manage to deal with Symbolic Information in SOMs!

Furthermore, Prof Dr Alfred Ultsch in his paper "The Integration of Neural Networks with Symbolic Knowledge Processing" deals exactly with how to process Symbolic Knowledge (such as text) in ANN's. Ultsch offers the following methodologies for processing Symbolic Knowledge: Neural Approximative Reasoning, Neural Unification, Introspection and Integrated Knowledge Acquisition. Albeit little information can be found on those in google scholar or anywhere else for that matter.

Pete in his answer is right about semantics. Semantics in ANN's are usually disconnected. However, following reference, provides insight how researchers have used RBMs, trained to recognize similarity in semantics of different word inputs, thus it shouldn't be impossible to have semantics, but would require a layered approach, or a secondary ANN if semantics are required.

Natural Language Processing With Subsymbolic Neural Networks, Risto Miikkulainen, 1997 Training Restricted Boltzmann Machines on Word Observations, G.E.Dahl, Ryan.P.Adams, H.Rarochelle, 2012

Update on January 2021

The field of NLP and Deep Learning has seen a resurgence in research in the past few years and since I asked that Question. There are now Machine-learning models which address what I was trying to achieve in many different ways.

For anyone arriving to this question wondering on how to pre-process text in Deep Learning or Neural Networks, here's a few helpful topics, none of which are Academic, but simple to understand and which should get you started on solving similar tasks:

Vector Space Models
Transformers
Recurrent and Convolutional Networks for Text Classification
Word Embedding
Text Pre-processing

At the time I was asking that question, RNN, CNN and VSM were about to start being used, nowadays most Deep Learning frameworks support extensive NLP support. Hope the above helps.

440

asked Feb 09 '13 00:02

Ælex

2 Answers

I'll go ahead and summarize our discussion as the answer here.

Your goal is to be able to incorporate text into your neural network. We have established that traditional ANNs are not really suitable for analyzing text. The underlying explanation for why this is so is based around the idea that ANNs operate on inputs that are generally a continuous range of values and the nearness of two values for an input means some sort of nearness in their meaning. Words do not have this idea of nearness and so, there's no real numerical encoding for words that can make sense as input to an ANN.

On the other hand, a solution that might work is to use a more traditional semantic analysis which could, perhaps produce sentiment ranges for a list of topics and then those topics and their sentiment values could possibly be used as input for an ANN.

136

answered Sep 21 '22 14:09

Pete

In response to your comments, no, your proposed scheme doesn't quite make sense. An artificial neuron output by its nature represents a continuous or at least a binary value. It does not makes sense to map between a huge discrete enumeration (like UTF-8 characters) and the continuous range represented by a floating point value. The ANN will necessarily act like 0.1243573 is an extremely good approximation to 0.1243577 when those numbers could easily be mapped to the newline character and the character "a", for example, which would not be good approximations for each other at all.

Quite frankly, there is no reasonable representation for "general unicode string" as inputs to an ANN. A reasonable representation depends on the specifics of what you're doing. It depends on your answers to the following questions:

Are you expecting words to show up in the input strings as opposed to blocks of characters? What words are you expecting to show up in the strings?
What is the length distribution of the input strings?
What is the expected entropy of the input strings?
Is there any domain specific knowledge you have about what you expect the strings to look like?

and most importantly

What are you trying to do with the ANN. This is not something you can ignore.

Its possible you might have a setup for which there is no translation that will actually allow you to do what you want with the neural network. Until you answer those questions (you skirt around them in your comments above), it's impossible to give a good answer.

I can give an example answer, that would work if you happened to give certain answers to the above questions. For example, if you are reading in strings with arbitrary length but composed of a small vocabulary of words separated by spaces, then I would suggest a translation scheme where you make N inputs, one for each word in the vocabulary, and use a recurrent neural network to feed in the words one at a time by setting the corresponding input to 1 and all the others to 0.

answered Sep 19 '22 14:09

Jeremy Salwen

Related questions
                            
                                Can I add numbers with the C/C++ preprocessor?
                            
                                VB.NET Preprocessor Directives
                            
                                C++ Preprocessor Standard Behaviour
                            
                                iOS - detect if app is running from Xcode [duplicate]
                            
                                Version Numbers in a project with Qt
                            
                                How can I pass a preprocessor to TfidfVectorizer? - sklearn - python
                            
                                CLR/CLI linker fails with error LNK2022 - Custom attributes are not consistent
                            
                                What is the preprocessor macro to test whether an application extension is being built?
                            
                                Target Preprocessor Macros are ignored by the preprocessor in Xcode
                            
                                Is it legal to use #elif with #ifdef?
                            
                                Do 'true' and 'false' have their usual meaning in preprocessor conditionals?
                            
                                how to expand VC++ macro references using Visual Studio?
                            
                                Maven example of annotation preprocessing and generation of classes in same compile process?
                            
                                Why include guards?
                            
                                How to use @apply directive of tailwind in any .scss file instead of only using it main tailwind file(in React)?
                            
                                How do I write universal Swift code for both iOS and macOS. In cocoa I could use #ifdef, what do I do now?
                            
                                Defining preprocessor symbols for CLion analyzer
                            
                                What is the difference between - 1) Preprocessor,linker, 2)Header file,library? Is my understanding correct?
                            
                                When to use tensorflow datasets api versus pandas or numpy
                            
                                Need a python module for stemming of text documents

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With