Skip to content Skip to sidebar Skip to footer

Count Values In Multiple Columns That Contain A Substring Based On Strings Of Lists Pyspark

I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list df.show() +---+-------------+----------

Solution 1:

Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)

#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}

#udf function for checking the strings and summing themfrom pyspark.sql import functions as F
from pyspark.sql import types as t
defcheckDevices(device, deviceModel, name):
    sum = 0for x in columnLists[name]:
        if x in device:
            sum += 1if x in deviceModel:
            sum += 1returnsum

checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())

#populating the sum returned from udf function to respective columnsfor x in columnLists:
    df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))

#finally grouping and sum 
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()

which should give you

+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
|  3|    0|  2|       3|
|  1|    4|  2|       2|
|  2|    2|  0|       1|
+---+-----+---+--------+

Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)

Post a Comment for "Count Values In Multiple Columns That Contain A Substring Based On Strings Of Lists Pyspark"