How To Get 1000 Records From Dataframe And Write Into A File Using Pyspark?

July 31, 2024 Post a Comment

I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file. Can anyone help me to solve this, thanks in advance.

Solution 1:

You can use maxRecordsPerFile option while writing dataframe.

If you need whole dataframe to write 1000 records in each file then use repartition(1)(or) write 1000 records for each partition use .coalesce(1)

Example:

# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

# 1000 records written per file for dataframe 100 files created for100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)

#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()

df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)

Method-2:

Caluculating number of partitions then repartition the dataframe:

df = spark.range(10000)

#caluculate partitions
no_partitions=df.count()/1000from pyspark.sql.functions import *

#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()

#+-----------+--------+#|partiton_id|count(1)|#+-----------+--------+#|          1|    1001|#|          6|    1000|#|          3|     999|#|          5|    1000|#|          9|    1000|#|          4|     999|#|          8|    1000|#|          7|    1000|#|          2|    1001|#|          0|    1000|#+-----------+--------+

df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)

Solution 2:

Firstly, create a row number column

df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))

Now, run a loop and keep saving the records.

for i in range(0, df.count(), 1000):
   records = df.where(F.col("row_num").between(i, i+999))
   records.toPandas().to_csv("file-{}.csv".format(i))

Python Dummy

How To Get 1000 Records From Dataframe And Write Into A File Using Pyspark?

Solution 1:

Solution 2:

Post a Comment for "How To Get 1000 Records From Dataframe And Write Into A File Using Pyspark?"