I am scraping table data from google finance through pd.read_html and then saving that data to excel through df.to_excel()
as seen below:
dfs = pd.read_html('https://www.google.com/finance?q=NASDAQ%3AGOOGL&fstype=ii&ei=9YBMWIiaLo29e83Rr9AM', flavor='html5lib')
xlWriter = pd.ExcelWriter(output.xlsx, engine='xlsxwriter')
for i, df in enumerate(dfs):
df.to_excel(xlWriter, sheet_name='Sheet{}'.format(i))
xlWriter.save()
However, the numbers that are saved to excel are stored as text with the little green triangle in the corner of the cell. When moving over this data to excel, how do I store them as actual values and not text?
In Jupyter Notebook, just select the 'Excel Spreadsheet (. xlsx)' option under 'Download As' in the File menu. This should output ExcelTest. xlsx in the same folder as the ipynb file specified.
In addition to the other solutions where the string data is converted to numbers when creating or using the dataframe it is also possible to do it using options to the xlsxwriter
engine:
# Versions of Pandas >= 1.3.0:
writer = pd.ExcelWriter('output.xlsx',
engine='xlsxwriter',
engine_kwargs={'options': {'strings_to_numbers': True}})
# Versions of Pandas < 1.3.0:
writer = pd.ExcelWriter('output.xlsx',
engine='xlsxwriter',
options={'strings_to_numbers': True})
From the docs:
strings_to_numbers
: Enable theworksheet.write()
method to convert strings to numbers, where possible, usingfloat()
in order to avoid an Excel warning about "Numbers Stored as Text".
Consider converting numeric columns to floats since the pd.read_html
reads web data as string types (i.e., objects). But before converting to floats, you need to replace hyphens to NaNs:
import pandas as pd
import numpy as np
dfs = pd.read_html('https://www.google.com/finance?q=NASDAQ%3AGOOGL' +
'&fstype=ii&ei=9YBMWIiaLo29e83Rr9AM', flavor='html5lib')
xlWriter = pd.ExcelWriter('Output.xlsx', engine='xlsxwriter')
workbook = xlWriter.book
for i, df in enumerate(dfs):
for col in df.columns[1:]: # UPDATE ONLY NUMERIC COLS
df.loc[df[col] == '-', col] = np.nan # REPLACE HYPHEN WITH NaNs
df[col] = df[col].astype(float) # CONVERT TO FLOAT
df.to_excel(xlWriter, sheet_name='Sheet{}'.format(i))
xlWriter.save()
That is probably because the Data Types of those columns where the warning is showing are objects
and not Numeric Types, such as int
or float
.
In order to check the Data Types of each column of the DataFrame, use dtypes
, such as
print(df.dtypes)
In my case, the column that was stored as object instead of a numeric value, was PRECO_ES
As, in my particular case, the decimal numbers are relevant, I have converted it, using astype
, to float, as following
df['PRECO_ES'] = df['PRECO_ES'].astype(float)
If we check again the Data Types, we get the following
Then, all you have to do is export the DataFrame to Excel
#Export the DataFRame (df) to XLS
xlsFile = "Preco20102019.xls"
df.to_excel(xlsFile)
#Export the DataFRame (df) to CSV
csvFile = "Preco20102019.csv"
df.to_csv(csvFile)
If I then open the Excel file, I can see that the warning is not showing anymore, as the values are stored as numeric and not as text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With