Copy HDFS file to Azure Synapse ADLS2 using Nifi

Ebeb
2 min readAug 7, 2023

The Nifi pipeline for copying HDFS files to Azure Synapse ADLS2 folder will be as below. This was tested with Cloudera CDP 7.1.7 and Azure Synapse ADLS2.

GetHDFSFileinfo will generate the list of both directory and files. Keep in mind that the GetHDFSFileInfo processor does not maintain any state, so every time it executes it will list all files/directories from the target regardless of whether they were listed before or not. If need to maintain state the ListHDFS processor uses state.

Hadoop Configuration Resources: /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

Kerberos Principal/Kerberos Password: Add these values

Full path: Path of the HDFS directory

Group Results: Since we want a single FlowFile for each object listed from HDFS, you will want to set the “Group Results” property to “None”. You should then see a separate FlowFile produced for each object found. If “Group Results” property is set to “All” it combines all dir/files into a single flowfile which will be difficult to write in Azure.

Destination: Attributes

The RouteonAttribute is needed as the flow file will have both directory and file names. We want to select only the file name to send to FetchHDFS and remove the directories names. You simply add a dynamic property to route any FlowFile produced from GetHDFSFileInfo processor that has the attribute “hdfs.type” set to “file” on to the fetchHDFSprocessor and send all other FlowFiles to the unmatched relationship which you can just auto-terminate.

The FetchHDFS will need to have the HDFS Filename property:

HDFS Filename: ${hdfs.path}/${hdfs.objectName}

The PutAzureDatalakeStorage settings are:

The ADLSCredentialsControllerService uses a SAS token to connect to Azure Synapse ADLS2:

REFERENCE :

https://community.cloudera.com/t5/Support-Questions/Nifi-How-to-use-getHDFSFileInfo-for-the-next-step/td-p/299291

Published

Originally published at http://plenium.wordpress.com on August 7, 2023.

--

--