I have some text files with the following format:
000423|东阿阿胶| 300|1|0.15000| |
000425|徐工机械| 600|1|0.15000| |
000503|海虹控股| 400|1|0.15000| |
000522|白云山A| |2| | 1982.080|
000527|美的电器| 900|1|0.15000| |
000528|柳 工| 300|1|0.15000| |
when I use read_csv to load them into DataFrame, it doesn't generate correct dtype for some columns. For example, the first column is parsed as int, not unicode str, the third column is parsed as unicode str, not int, because of one missing data... Is there a way to preset the dtype of the DataFrame, just like the numpy.genfromtxt does?
Updates:
I used read_csv
like this which caused the problem:
data = pandas.read_csv(StringIO(etf_info), sep='|', skiprows=14, index_col=0,
skip_footer=1, names=['ticker', 'name', 'vol', 'sign',
'ratio', 'cash', 'price'], encoding='gbk')
In order to solve both the dtype and encoding problems, I need to use unicode()
and numpy.genfromtxt
first:
etf_info = unicode(urllib2.urlopen(etf_url).read(), 'gbk')
nd_data = np.genfromtxt(StringIO(etf_info), delimiter='|',
skiprows=14, skip_footer=1, dtype=ETF_DTYPE)
data = pandas.DataFrame(nd_data, index=nd_data['ticker'],
columns=['name', 'vol', 'sign',
'ratio', 'cash', 'price'])
It would be nice if read_csv
can add dtype
and usecols
settings. Sorry for my greed. ^_^
Simply put: no, not yet. More work (read: more active developers) is needed on this particular area. If you could post how you're using read_csv
it might help. I suspect that the whitespace between the bars may be the problem
EDIT: this is now obsolete. This behavior is covered natively by read_csv
You can now use dtype in read_csv.
PS: Kudos to Wes McKinney for answering, it feels quite awkward to contradict the "past Wes".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With