pyspark.sql.functions.hll_sketch_agg#

pyspark.sql.functions.hll_sketch_agg(col, lgConfigK=None)[source]#

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.

New in version 3.5.0.

Parameters
colColumn or str
lgConfigKColumn or int, optional

The log-base-2 of K, where K is the number of buckets or slots for the HllSketch

Returns
Column

The binary representation of the HllSketch.

Examples

>>> df = spark.createDataFrame([1,2,2,3], "INT")
>>> df1 = df.agg(hll_sketch_estimate(hll_sketch_agg("value")).alias("distinct_cnt"))
>>> df1.show()
+------------+
|distinct_cnt|
+------------+
|           3|
+------------+
>>> df2 = df.agg(hll_sketch_estimate(
...     hll_sketch_agg("value", lit(12))
... ).alias("distinct_cnt"))
>>> df2.show()
+------------+
|distinct_cnt|
+------------+
|           3|
+------------+
>>> df3 = df.agg(hll_sketch_estimate(
...     hll_sketch_agg(col("value"), lit(12))).alias("distinct_cnt"))
>>> df3.show()
+------------+
|distinct_cnt|
+------------+
|           3|
+------------+