I am currently looking at a keras program that tries to generate text data using a CNN. In the code provided to me by my professor, I use the function:
path = get_file('input.txt', origin='https://www.dropbox.com/s/2z0zdn54cqu3cqj/input.txt?dl=0')
This is imported using the function:
from keras.utils.data_utils import get_file
Now the original text corpus provided to us was working just fine. However, whenever I changed the file origin inside the get_file function, and renamed the file name to be saved as, I started getting HTML code. Is there a particular reason for this? For example, I pull HTML code, even though I used https://github.com/nlp-compromise/nlp-corpus/blob/master/poe/man_of_crowd.txt and https://raw.githubusercontent.com/nlp-compromise/nlp-corpus/master/poe/man_of_crowd.txt(The second link is the raw file).
For the first link,
https://github.com/nlp-compromise/nlp-corpus/blob/master/poe/man_of_crowd.txt, even though it appears that it resolves to a text file resource, it's a HTML page on GitHub, which is why you get HTML code when downloading from this link.
As for the second raw link, https://raw.githubusercontent.com/nlp-compromise/nlp-corpus/master/poe/man_of_crowd.txt which actually points to the text file resource, when you download the file using:
>> from keras.utils.data_utils import get_file
>> path = get_file('man_of_crowd.txt',
'https://raw.githubusercontent.com/nlp-compromise/nlp-corpus/master/poe/man_of_crowd.txt')
Downloading data from https://raw.githubusercontent.com/nlp-compromise/nlp-corpus/master/poe/man_of_crowd.txt
16384/20391 [=======================>......] - ETA: 0s
It actually downloads as a text file with path:
>> print(path)
/home/<username>/.keras/datasets/man_of_crowd.txt
The keras util function really uses a six wrapper for urllib.request. The code for get_file method can be found at their GitHub repository, here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With