Count Values In Multiple Columns That Contain A Substring Based On Strings Of Lists Pyspark
I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list df.show() +---+-------------+----------
Solution 1:
Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing themfrom pyspark.sql import functions as F
from pyspark.sql import types as t
defcheckDevices(device, deviceModel, name):
sum = 0for x in columnLists[name]:
if x in device:
sum += 1if x in deviceModel:
sum += 1returnsum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columnsfor x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
which should give you
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)
Post a Comment for "Count Values In Multiple Columns That Contain A Substring Based On Strings Of Lists Pyspark"