Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running nltk.download in Azure Synapse notebook ValueError: I/O operation on closed file

I'm experimenting with NLTK in an Azure Synapse notebook. When I try and run nltk.download('stopwords') I get the following error:

ValueError: I/O operation on closed file
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 782, in download
    show(msg.message)

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 775, in show
    subsequent_indent=prefix + prefix2 + " " * 4,

  File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1616860588116_0001/container_1616860588116_0001_01_000001/tmp/9026485902214290372", line 536, in write
    super(UnicodeDecodingStringIO, self).write(s)

ValueError: I/O operation on closed file

If I try and just run nltk.download() I get the following error:

EOFError: EOF when reading a line
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 765, in download
    self._interactive_download()

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1117, in _interactive_download
    DownloaderShell(self).run()

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1143, in run
    user_input = input("Downloader> ").strip()

EOFError: EOF when reading a line

I'm hoping someone could give me some help on what may be causing this and how to get around it. I haven't been able to find much information on where to go from here.

Edit: The code I am using to generate the error is the following:

import nltk
nltk.download('stopwords')

Update I ended up opening a support request with Microsoft and this was their response:

Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK

They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the follwoing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
  3. Run the below code to import them

.

import os
import sys
import nltk
from pyspark import SparkFiles

#add stopwords from storage
sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)

#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')

nltk.corpus.stopwords.words('english')

Thanks!

like image 983
User181 Avatar asked Sep 04 '25 03:09

User181


1 Answers

I ended up opening a support request with Microsoft and this was their response:

Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK

They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
  3. Run the below code to import them

....

import os
import sys
import nltk
from pyspark import SparkFiles
    
#add stopwords from storage
    sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)
    
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')
    
nltk.corpus.stopwords.words('english')

Thanks!

like image 152
User181 Avatar answered Sep 07 '25 17:09

User181