by Guest » 24 Dec 2024, 11:57
Ich versuche, die pfeiloptimierte Python-UDF von Spark 4 wie unten zu testen,
Code: Select all
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, udf
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
spark = SparkSession.builder.master('local[*]').appName('test').getOrCreate()
spark.conf.set('spark.sql.execution.pythonUDF.arrow.enabled', True)
spark.conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', True)
rows = [{'name': 'joseph', 'age': 35}, {'name': 'jina', 'age': 30}, {'name': 'julian', 'age': 15}]
schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)])
df = spark.createDataFrame(rows, schema)
@udf(returnType=schema, useArrow=True)
def transform(name: str, age: int):
return name.upper(), age + 10
# Apply UDF to transform both columns
df_trans = df.withColumn("trans", transform(df["name"], df["age"]))
df_trans.show()
Aber die Option useArrow=True führt zu schwerwiegenden Fehlern wie den folgenden:
Code: Select all
Traceback (most recent call last):
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\pandas\utils.py", line 28, in require_minimum_pandas_version
import pandas
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\pandas\__init__.py", line 33, in
require_minimum_pandas_version()
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\pandas\utils.py", line 43, in require_minimum_pandas_version
raise PySparkImportError(
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 2.0.0 must be installed; however, it was not found.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\VSCode_Workspace\pyspark-test\com\aaa\spark\arrow_spark.py", line 19, in
@udf(returnType=schema, useArrow=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\udf.py", line 142, in _create_py_udf
require_minimum_pandas_version()
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\pandas\utils.py", line 43, in require_minimum_pandas_version
raise PySparkImportError(
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 2.0.0 must be installed; however, it was not found.
Wenn ich die Pfeiloption auf „False“ setze, funktionieren diese Python-Codes ohne Fehler. Bitte teilen Sie mir mit, wie ich diese Fehler von Spark 4 beheben kann. Ich möchte die Pfeil-aktivierte UDF von Pyspark bestätigen.
Ich versuche, die pfeiloptimierte Python-UDF von Spark 4 wie unten zu testen,
[code]from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, udf
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
spark = SparkSession.builder.master('local[*]').appName('test').getOrCreate()
spark.conf.set('spark.sql.execution.pythonUDF.arrow.enabled', True)
spark.conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', True)
rows = [{'name': 'joseph', 'age': 35}, {'name': 'jina', 'age': 30}, {'name': 'julian', 'age': 15}]
schema = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)])
df = spark.createDataFrame(rows, schema)
@udf(returnType=schema, useArrow=True)
def transform(name: str, age: int):
return name.upper(), age + 10
# Apply UDF to transform both columns
df_trans = df.withColumn("trans", transform(df["name"], df["age"]))
df_trans.show()
[/code]
Aber die Option useArrow=True führt zu schwerwiegenden Fehlern wie den folgenden:
[code]Traceback (most recent call last):
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\pandas\utils.py", line 28, in require_minimum_pandas_version
import pandas
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\pandas\__init__.py", line 33, in
require_minimum_pandas_version()
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\pandas\utils.py", line 43, in require_minimum_pandas_version
raise PySparkImportError(
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 2.0.0 must be installed; however, it was not found.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\VSCode_Workspace\pyspark-test\com\aaa\spark\arrow_spark.py", line 19, in
@udf(returnType=schema, useArrow=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\udf.py", line 142, in _create_py_udf
require_minimum_pandas_version()
File "C:\spark-4.0.0-preview2-bin-hadoop3\python\pyspark\sql\pandas\utils.py", line 43, in require_minimum_pandas_version
raise PySparkImportError(
pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] Pandas >= 2.0.0 must be installed; however, it was not found.
[/code]
Wenn ich die Pfeiloption auf „False“ setze, funktionieren diese Python-Codes ohne Fehler. Bitte teilen Sie mir mit, wie ich diese Fehler von Spark 4 beheben kann. Ich möchte die Pfeil-aktivierte UDF von Pyspark bestätigen.