Iterate the files from folder and process them in scala

Question 1

I have a couple of files in a folder for different countries. like below

Casedata_GBR_202110_timestamp.csv

Casedata_ARG_202110_timestamp.csv

now i have to process take these files process them by country wise and copy to respective folders. my destination folder structure will be like

2021-->11-->GBR

2021-->11-->ARG

In spark scala/scala help me to write code to process file by country and move to respective country folder.

Question 2

It sounds like you're looking for partitionBy defined on DataFrameWriter. From the scaladoc:

def partitionBy(colNames: String*): DataFrameWriter[T]

Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:

year=2016/month=01/
year=2016/month=02/

Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.

Jarrod Baker · Answer 1 · 2021-11-24T08:25:52

It sounds like you're looking for partitionBy defined on DataFrameWriter. From the scaladoc:

def partitionBy(colNames: String*): DataFrameWriter[T]

Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:

year=2016/month=01/
year=2016/month=02/

Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.

is there anyway to get like 2016/01 and will foreach help to iterate one by one file
i have to process one by one file into dataframe and copy into blob

Iterate the files from folder and process them in scala

In other languages

This page is in other languages

Popular in the category