Pyspark Error:- dataType <class 'pyspark.sql.types.StringType'> should be an instance of <class 'pyspark.sql.types.DataType'>

Tags:

I need to extract some data from a pipelinedRDD but while converting it to Dataframe it is giving the following error:

Traceback (most recent call last):

  File "/home/karan/Desktop/meds.py", line 42, in <module>

    relevantToSymEntered(newrdd)

  File "/home/karan/Desktop/meds.py", line 26, in relevantToSymEntered

    mat = spark.createDataFrame(self,StructType([StructField("Prescribed 

medicine",StringType), StructField(["Disease","ID","Symptoms 

Recorded","Severeness"],ArrayType)]))

  File "/home/karan/Downloads/spark-2.4.2-bin-

hadoop2.7/python/pyspark/sql/types.py", line 409, in __init__

    "dataType %s should be an instance of %s" % (dataType, DataType)

AssertionError: dataType <class 'pyspark.sql.types.StringType'> should be an 
instance of <class 'pyspark.sql.types.DataType'>

1. Thing my error is of different type it is TypeError while I got problems with AssertionError.

My problem has nothing to do with casting of data types.

I've already tried using toDF() but it changes the column names which is undesirable.

import findspark
findspark.init('/home/karan/Downloads/spark-2.4.2-bin-hadoop2.7')
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType, StringType, IntegerType, StructField, ArrayType
from pyspark import SparkConf, SparkContext
import pandas as pd

def reduceColoumns(self):
    try:
        filtered=self.rdd.map(lambda x: (x["Prescribed medicine"],list([x["Disease"],x["ID"],x["Symptoms Recorded"],x["Severeness"]])))
    except Exception as e:
        print("Error in CleanData:- ")
        print(e)
    return filtered

def cleanData(self,s):
    try:
        self.zipWithIndex
    except Exception as e:
        print("Error in CleanData:- ")
        print(e)
    return self.filter(lambda x: x[1][0]==s)

def relevantToSymEntered(self):
    mat = spark.createDataFrame(self,StructType([StructField("Prescribed medicine",StringType), StructField(["Disease","ID","Symptoms Recorded","Severeness"],ArrayType)]))
    #mat = mat.rdd.map(lambda x: (x["Prescribed medicine"],list([x["ID"],x["Symptoms Recorded"],x["Severeness"]])))
    print(type(mat))


conf = SparkConf().setMaster("local[*]").setAppName("MovieSimilarities")
sc = SparkContext(conf = conf)
spark=SQLContext(sc)
rdd = spark.read.csv("/home/karan/Desktop/ExportExcel2.csv",header=True,sep=",",multiLine="True")

print(rdd)
newrdd=reduceColoumns(rdd)
x=input("Enter the disease-")
newrdd=cleanData(newrdd,x)
relevantToSymEntered(newrdd)

823

asked May 07 '19 07:05

Karan Agarwal

1 Answers

StructType([StructField("Prescribed medicine",StringType), StructField(["Disease","ID","Symptoms Recorded","Severeness"],ArrayType)])

replace into:

StructType([StructField("Prescribed medicine",StringType()), StructField(["Disease","ID","Symptoms Recorded","Severeness"],ArrayType())]).

You need to instantiate the class.

195

answered Nov 03 '22 15:11

Kiel Haymon

Related questions
                            
                                Where can I find numpy.where() source code? [duplicate]
                            
                                Python type-hint friendly type that constrains possible values
                            
                                Why is `json.dump()` not ending the line with `\n`?
                            
                                Python: logging comments printed to console before other outputs
                            
                                Wrong current working directory when running python code and jupyter extension in vscode
                            
                                Find elements in a list of which all elements in another list are factors, using a list comprehension
                            
                                Homebrew pyenv install error dyld: Library not loaded: /usr/local/opt/readline/lib/libreadline.7.dylib
                            
                                Python pytest does not show assertion differences
                            
                                /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.21' not found required by TensorFlow
                            
                                How to run flask_migrate in Docker
                            
                                Pytest - testing parser Error : Unrecognised arguments
                            
                                Pandas groupby give any non nan values
                            
                                How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?
                            
                                how to avoid using _siftup or _siftdown in heapq
                            
                                redis installation using conda not working ModuleNotFoundError No module named 'redis'
                            
                                Convert string with NaNs to int in pandas
                            
                                Build 2d pyramidal array - Python / NumPy
                            
                                Recursively iterate through a nested dict with list, and replace matched values
                            
                                How to use gradient_override_map in Tensorflow 2.0?
                            
                                Comparing two lists element-wise in python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark Error:- dataType <class 'pyspark.sql.types.StringType'> should be an instance of <class 'pyspark.sql.types.DataType'>

Tags:

python

apache-spark

apache-spark-sql

pyspark

Karan Agarwal

People also ask

1 Answers

Kiel Haymon

Recent Activity

Donate For Us