Python Download file with Pandas / Urllib

Tags:

I am trying to download a CSV file with Python 3.x The path of the file is: https://www.nseindia.com/content/fo/fo_mktlots.csv

I have found three ways to do it. Of the three only one method works. I wanted to know why or what I am doing wrong.

Method 1: (Unsuccessful)

import pandas as pd

mytable = pd.read_table("https://www.nseindia.com/content/fo/fo_mktlots.csv",sep=",")
print(mytable)

But I am getting the following error :

- HTTPError: HTTP Error 403: Forbidden

Method 2: (Unsuccessful)

from urllib.request import Request, urlopen

url='https://www.nseindia.com/content/fo/fo_mktlots.csv'

url_request  = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url_request ).read()

Got the same error as before :

 - HTTPError: HTTP Error 403: Forbidden

Method 3: (Successful)

import requests
import pandas as pd

url = 'https://www.nseindia.com/content/fo/fo_mktlots.csv'

r = requests.get(url)
df = pd.read_csv(StringIO(r.text))

I am also able to open the file with Excel VBA as below:

Workbooks.Open Filename:="https://www.nseindia.com/content/fo/fo_mktlots.csv"

Also, is there any other method to do the same?

346

asked Jan 29 '17 08:01

Harsh Goyal

1 Answers

The website tries to prevent content scraping.

The issue is not about what you are doing wrong, it is more about how the web server is configured and how it behaves in various situations.

But to overcome the scraping protection, create well defined http request headers, the best way to do so is to send a complete set of http headers a real web browser does.

Here it works with a minimal set:

>>> myHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36', 'Referer': 'https://www.nseindia.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
>>> url_request  = Request(url, headers=myHeaders)
>>> html = urlopen(url_request ).read()
>>> len(html)
42864
>>>

You can pass urllib to pandas:

>>> import pandas as pd
...
>>> url_request  = Request(url, headers=myHeaders)
>>> data = urlopen(url_request )
>>> my_table = pd.read_table(data)
>>> len(my_table)
187

148

answered Sep 21 '22 10:09

Maurice Meyer

Related questions
                            
                                Naming abstract classes and Interfaces in TypeScript [closed]
                            
                                pop() is not a function - nodejs
                            
                                Why aren't scanf("%*[^\n]\n"); and scanf("%*[^\n]%*c"); clearing a hanging newline?
                            
                                how to pass variadic params to virtual function
                            
                                Completely wrong value for matrix multiplication with System.Numerics
                            
                                What exactly is the client secret for Google OAuth2?
                            
                                How to capture the named pipe traffic in windows
                            
                                Unexpected token when deserializing object in JsonConvert.DeserializeObject
                            
                                Mongoose Promise error
                            
                                Speed of MIN/MAX vs GROUP BY for aggregating constant-in-group values
                            
                                Given a theorem "P(t) ⟶ (∃x . P(x))" with an object logic implication, why is the proof goal "P(t) ⟹ (∃x . P(x))" given with meta-logic implication?
                            
                                High CPU usage for a python while loop: even when sleeping 97% of the time. Why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With