The below code breaks the sentence into individual tokens and the output is as below
"cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies"
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
print(token.text)
What I would ideally want is, to read 'cloud computing' together as it is technically one word.
Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?
1.-First you need to create a list with the text of the documents 2.-Then you join the text lists in just one document 3.-now you use the spacy parser to transform the text document in a Spacy document 4.-You use the Zuzana's answer's to create de bigrams This is the example code:
The bigrams here are: Trigrams: Trigram is 3 consecutive words in a sentence. For the above example trigrams will be: From the above bigrams and trigram, some are relevant while others are discarded which do not contribute value for further processing. Let us say from a document we want to find out the skills required to be a “Data Scientist”.
The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial.
Trigrams: Trigram is 3 consecutive words in a sentence. For the above example trigrams will be: From the above bigrams and trigram, some are relevant while others are discarded which do not contribute value for further processing. Let us say from a document we want to find out the skills required to be a “Data Scientist”.
Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:
Detect the noun chunks https://spacy.io/usage/linguistic-features#noun-chunks
Merge the noun chunks
Do dependency parsing again, it would parse "cloud computing" as single entity now.
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
... noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
...
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]
If you have a spacy doc
, you can pass it to textacy:
ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))
Warning: This just and extension of the right answer made by Zuzana.
My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: How do you do it when you have a list of documents?
1.-First you need to create a list with the text of the documents 2.-Then you join the text lists in just one document 3.-now you use the spacy parser to transform the text document in a Spacy document 4.-You use the Zuzana's answer's to create de bigrams
This is the example code:
Step 1
doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)
This will print this text:
['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']
then step 2 and 3:
doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)
and will print this:
all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams
Finally step 4 (Zuzana's answer)
ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)
will print this:
[make bigrams, make bigrams, make bigrams]
I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).
Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,
"I_like".split("_")[0] -> I;
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it.
skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.
To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.
I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With