pyspark.testing.assertSchemaEqual#

pyspark.testing.assertSchemaEqual(actual, expected, ignoreNullable=True, ignoreColumnOrder=False, ignoreColumnName=False)[source]#

A util function to assert equality between DataFrame schemas actual and expected.

New in version 3.5.0.

Parameters

actualStructType: The DataFrame schema that is being compared or tested.
expectedStructType: The expected schema, for comparison with the actual schema.
ignoreNullablebool, default True: Specifies whether a column’s nullable property is included when checking for schema equality. When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings. When set to False, columns are considered equal only if they have the same nullable setting. .. versionadded:: 4.0.0
ignoreColumnOrderbool, default False: Specifies whether to compare columns in the order they appear in the DataFrame or by column name. If set to False (default), columns are compared in the order they appear in the DataFrames. When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame. .. versionadded:: 4.0.0
ignoreColumnNamebool, default False: Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different. When set to False (default), column names are checked and the function fails if they are different. When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames. .. versionadded:: 4.0.0

Notes

When assertSchemaEqual fails, the error message uses the Python difflib library to display a diff log of the actual and expected schemas.

Examples

>>> from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, DoubleType
>>> s1 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
>>> s2 = StructType([StructField("names", ArrayType(DoubleType(), True), True)])
>>> assertSchemaEqual(s1, s2)  # pass, schemas are identical

Different schemas with ignoreNullable=False would fail.

>>> s3 = StructType([StructField("names", ArrayType(DoubleType(), True), False)])
>>> assertSchemaEqual(s1, s3, ignoreNullable=False)  
Traceback (most recent call last):
...
PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match.
--- actual
+++ expected
- StructType([StructField('names', ArrayType(DoubleType(), True), True)])
?                                                                 ^^^
+ StructType([StructField('names', ArrayType(DoubleType(), True), False)])
?                                                                 ^^^^

>>> df1 = spark.createDataFrame(data=[(1, 1000), (2, 3000)], schema=["id", "number"])
>>> df2 = spark.createDataFrame(data=[("1", 1000), ("2", 5000)], schema=["id", "amount"])
>>> assertSchemaEqual(df1.schema, df2.schema)  
Traceback (most recent call last):
...
PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match.
--- actual
+++ expected
- StructType([StructField('id', LongType(), True), StructField('number', LongType(), True)])
?                               ^^                               ^^^^^
+ StructType([StructField('id', StringType(), True), StructField('amount', LongType(), True)])
?                               ^^^^                              ++++ ^

Compare two schemas ignoring the column order.

>>> s1 = StructType(
...     [StructField("a", IntegerType(), True), StructField("b", DoubleType(), True)]
... )
>>> s2 = StructType(
...     [StructField("b", DoubleType(), True), StructField("a", IntegerType(), True)]
... )
>>> assertSchemaEqual(s1, s2, ignoreColumnOrder=True)

Compare two schemas ignoring the column names.

>>> s1 = StructType(
...     [StructField("a", IntegerType(), True), StructField("c", DoubleType(), True)]
... )
>>> s2 = StructType(
...     [StructField("b", IntegerType(), True), StructField("d", DoubleType(), True)]
... )
>>> assertSchemaEqual(s1, s2, ignoreColumnName=True)