In an usual structured_kafka_wordcount.py code, When I split lines into words by <code>udf</code> like below, <pre class="prettyprint lang-py prettyprint-override"><code>my_split = udf(lambda x: x.split(' '), ArrayType(StringType())) words = lines.select( explode( my_split(lines.value) ) ) </code></pre> the warning will keep showing: <blockquote> WARN CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread. It may hang when CachedKafkaConsumer's methods are interrupted because of KAFKA-1894 </blockquote> On the other hand, when I split the lines into words by <code>pyspark.sql.functions.split</code>, everything works well. <pre class="prettyprint lang-py prettyprint-override"><code>words = lines.select( explode( split(lines.value, ' ') ) ) </code></pre> Why this happened and how to fix the warning? This is the code I am trying to execute in practice: <pre class="prettyprint lang-py prettyprint-override"><code>pattern = "(.+) message repeated (\\d) times: \\[ (.+)\\]" prog = re.compile(pattern) def _unfold(x): ret = [] result = prog.match(x) if result: log = " ".join((result.group(1), result.group(3))) times = result.group(2) for _ in range(int(times)): ret.append(log) else: ret.append(x) return ret _udf = udf(lambda x: _unfold(x), ArrayType(StringType())) lines = lines.withColumn('value', explode(_udf(lines['value']))) </code></pre>

Other than rejecting Python UDFs *, there is nothing you can do about this problem in you code. As you can read in the exception message <code>UninterruptibleThread</code> is a workaround to Kafka bug (KAFKA-1894) and is designed to prevent infinite loop, when interrupting <code>KafkaConsumer</code>. It is not used with <code>PythonUDFRunner</code> (it probably wouldn't makes sense, to introduce special case there). Personally I wouldn't worry about it unless you experience some related issues. Your Python code will never interact directly with <code>KafkaConsumer</code>. And if you experience any issues, there should fixed upstream - in that case I recommend creating a JIRA ticket. <hr> * Your <code>unfold</code> function can be rewritten with SQL functions, but it will be a hack. Add message count as integer: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import concat_ws, col, expr, coalesce, lit, regexp_extract, when p = "(.+) message repeated (\\d) times: \\[ (.+)\\]" lines = spark.createDataFrame( ["asd message repeated 3 times: [ 12]", "some other message"], "string" ) lines_with_count = lines.withColumn( "message_count", coalesce(regexp_extract("value", p, 2).cast("int"), lit(1))) </code></pre> Use it to <code>explode</code> <pre class="prettyprint lang-py prettyprint-override"><code>exploded = lines_with_count.withColumn( "i", expr("explode(split(repeat('1', message_count - 1),''))") ).drop("message_count", "i") </code></pre> and extract: <pre class="prettyprint lang-py prettyprint-override"><code>exploded.withColumn( "value", when( col("value").rlike(p), concat_ws(" ", regexp_extract("value", p, 1), regexp_extract("value", p, 3)) ).otherwise(col("value"))).show(4, False) # +------------------+ # |value | # +------------------+ # |asd 12 | # |asd 12 | # |asd 12 | # |some other message| # +------------------+ </code></pre>

UDF cause warning: CachedKafkaConsumer is not running in UninterruptibleThread (KAFKA-1894)

Tags:

apache-kafka

apache-spark

apache-spark-sql

pyspark

spark-streaming

In an usual structured_kafka_wordcount.py code,

When I split lines into words by udf like below,

my_split = udf(lambda x: x.split(' '), ArrayType(StringType()))

words = lines.select(
    explode(
        my_split(lines.value)
    )
)

the warning will keep showing:

WARN CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread. It may hang when CachedKafkaConsumer's methods are interrupted because of KAFKA-1894

On the other hand, when I split the lines into words by pyspark.sql.functions.split, everything works well.

words = lines.select(
    explode(
        split(lines.value, ' ') 
    ) 
)

Why this happened and how to fix the warning?

This is the code I am trying to execute in practice:

pattern = "(.+) message repeated (\\d) times: \\[ (.+)\\]"
prog = re.compile(pattern)


def _unfold(x):
    ret = []
    result = prog.match(x)
    if result:
        log = " ".join((result.group(1), result.group(3)))
        times = result.group(2)
        for _ in range(int(times)):
            ret.append(log)
    else:
        ret.append(x)

    return ret

_udf = udf(lambda x: _unfold(x), ArrayType(StringType()))
lines = lines.withColumn('value', explode(_udf(lines['value'])))

389

asked Jan 17 '18 07:01

petertc

1 Answers

Other than rejecting Python UDFs *, there is nothing you can do about this problem in you code. As you can read in the exception message UninterruptibleThread is a workaround to Kafka bug (KAFKA-1894) and is designed to prevent infinite loop, when interrupting KafkaConsumer.

It is not used with PythonUDFRunner (it probably wouldn't makes sense, to introduce special case there).

Personally I wouldn't worry about it unless you experience some related issues. Your Python code will never interact directly with KafkaConsumer. And if you experience any issues, there should fixed upstream - in that case I recommend creating a JIRA ticket.

* Your unfold function can be rewritten with SQL functions, but it will be a hack. Add message count as integer:

from pyspark.sql.functions import concat_ws, col, expr, coalesce, lit, regexp_extract, when

p = "(.+) message repeated (\\d) times: \\[ (.+)\\]"

lines = spark.createDataFrame(
    ["asd message repeated 3 times: [ 12]", "some other message"], "string"
)

lines_with_count = lines.withColumn(
   "message_count", coalesce(regexp_extract("value", p, 2).cast("int"), lit(1)))

Use it to explode

exploded = lines_with_count.withColumn(
     "i", 
      expr("explode(split(repeat('1', message_count - 1),''))")
).drop("message_count", "i")

and extract:

exploded.withColumn(
    "value",
    when(
        col("value").rlike(p),
         concat_ws(" ", regexp_extract("value", p, 1), regexp_extract("value", p, 3))
    ).otherwise(col("value"))).show(4, False)


# +------------------+
# |value             |
# +------------------+
# |asd 12            |
# |asd 12            |
# |asd 12            |
# |some other message|
# +------------------+

128

answered Nov 10 '22 09:11

Alper t. Turker

Related questions
                            
                                How can I convince spark not to make an exchange when the join key is a super-set of the bucketBy key?
                            
                                Can AWS Glue crawl Delta Lake table data?
                            
                                Spark atop of Docker not accepting jobs
                            
                                Why does Spark shuffle store intermediate data on disk?
                            
                                Get all Apache Spark executor logs
                            
                                HashMap as a Broadcast Variable in Spark Streaming?
                            
                                run reduceByKey on huge data in spark
                            
                                Unable to submit Spring boot java application to Spark cluster
                            
                                Write and run pyspark in IntelliJ IDEA
                            
                                Spark Scala filter DataFrame where value not in another DataFrame
                            
                                TypeError: 'JavaPackage' object is not callable
                            
                                Spark Dataset and java.sql.Date
                            
                                Spark pulling data into RDD or dataframe or dataset
                            
                                Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows
                            
                                Spark error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                            
                                Spark is inventing his own AWS secretKey
                            
                                Yarn slave nodes are not communicating with master node?
                            
                                Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]
                            
                                Is there any way to get the output of Spark's Dataset.show() method as a string?
                            
                                How to pivot streaming dataset?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With