I'm working on the following code for performing Random Forest Classification on train and test sets;
from sklearn.ensemble import RandomForestClassifier
from numpy import genfromtxt, savetxt
def main():
dataset = genfromtxt(open('filepath','r'), delimiter=' ', dtype='f8')
target = [x[0] for x in dataset]
train = [x[1:] for x in dataset]
test = genfromtxt(open('filepath','r'), delimiter=' ', dtype='f8')
rf = RandomForestClassifier(n_estimators=100)
rf.fit(train, target)
predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(test))]
savetxt('filepath', predicted_probs, delimiter=',', fmt='%d,%f',
header='Id,PredictedProbability', comments = '')
if __name__=="__main__":
main()
However I get the following error on execution;
----> dataset = genfromtxt(open('C:/Users/user/Desktop/pgm/Cora/a_train.csv','r'), delimiter='', dtype='f8')
ValueError: Some errors were detected !
Line #88 (got 1435 columns instead of 1434)
Line #93 (got 1435 columns instead of 1434)
Line #164 (got 1435 columns instead of 1434)
Line #169 (got 1435 columns instead of 1434)
Line #524 (got 1435 columns instead of 1434)
...
...
...
Any suggestions as to how avoid it?? Thanks.
genfromtxt
will give this error if the number of columns is unequal.
I can think of 3 ways around it:
1. Use the usecols
parameter
np.genfromtxt('yourfile.txt',delimiter=',',usecols=np.arange(0,1434))
However - this may mean that you lose some data (where rows are longer than 1434 columns) - whether or not that matters is down to you.
2. Adjust your input data file so that it has an equal number of columns.
3. Use something other than genfromtxt:
.............like this
An exception is raised if an inconsistency is detected in the number of columns.A number of reasons and solutions are possible.
Add invalid_raise = False
to skip the offending lines.
dataset = genfromtxt(open('data.csv','r'), delimiter='', invalid_raise = False)
If your data contains Names, make sure that the field name doesn’t contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like size or shape), which would confuse the interpreter.
deletechars
Gives a string combining all the characters that must be deleted from the name. By default, invalid characters are
~!@#$%^&*()-=+~\|]}[{';: /?.>,<.
excludelist
Gives a list of the names to exclude, such as
return, file, print…
If one of the input name is part of this list, an underscore character ('_') will be appended to it.
case_sensitive
Whether the names should be case-sensitive (
case_sensitive=True
), converted to upper case (case_sensitive=False
orcase_sensitive='upper'
) or to lower case (case_sensitive='lower'
).
data = np.genfromtxt("data.txt", dtype=None, names=True,\
deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)
Reference: numpy.genfromtxt
You have too many columns in one of your rows. For example
>>> import numpy as np
>>> from StringIO import StringIO
>>> s = """
... 1 2 3 4
... 1 2 3 4 5
... """
>>> np.genfromtxt(StringIO(s),delimiter=" ")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/numpy/lib/npyio.py", line 1654, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
In my case, the error aroused due to having a special symbol in the row.
Error cause: having special characters like
Example csv file
1,hello,',this',fails
import numpy as numpy data = numpy.genfromtxt(file, delimiter=delimeter) #Error
Environment Note:
OS: Ubuntu
csv editor: LibreOffice
IDE: Pycharm
None of the previous answers worked for me so for future googlers here is another one :
Error was : "Line #88 (got 1435 columns instead of 1)"
Discovered that my csv file was a utf8 encoded text file with a BOM(a character marking the encoding on the first line of the file. Most text editors will hide this character)
I simply opened it in notepad in windows,"saved as" again and selected "ANSI" at the bottom of the save box.
Fixed it for me.
I had this error. The cause was a single entry in my data that had a space. This caused it to see it as an extra row. Make sure all spacing is consistent throughout all the data.
It seems like the header that includes the column names have 1 more column than the data itself (1435 columns on header vs. 1434 on data).
You could either:
1) Eliminate 1 column from the header that doesn't make sense with data
OR
2) Use the skip header from genfromtxt()
for example, np.genfromtxt('myfile', skip_header=*how many lines to skip*, delimiter=' ')
more information found in the documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With