normalian blog

Let's talk about Microsoft Azure, ASP.NET and Java!

Data copy from FTP server to SQL Data Lake using Azure Data Factory

You can achieve to setup data copy scenarios with Azure Data Factory from your FTP server to your SQL Data Lake by following this article. I believe this scenario is quite simple, but you can avoid to be confused by utilizing this article.

How to setup FTP server on Microsoft Azure

Create a Linux, CentOS7, virtual machine at first. After that, connect the VM with ssh, and run below commands.

[root@ftpsourcevm ~]# sudo su -
[root@ftpsourcevm ~]# yum -y update && yum -y install vsftpd

Please setup this vsftp server as passive mode with below sample. As far as I have confirmed, Azure Data Factory supports only passive mode ftp servers.

[root@ftpsourcevm ~]# vi /etc/vsftpd/vsftpd.conf

# When "listen" directive is enabled, vsftpd runs in standalone mode and
# listens on IPv4 sockets. This directive cannot be used in conjunction
# with the listen_ipv6 directive.
listen=YES
#
# This directive enables listening on IPv6 sockets. By default, listening
# on the IPv6 "any" address (::) will accept connections from both IPv6
# and IPv4 clients. It is not necessary to listen on *both* IPv4 and IPv6
# sockets. If you want that (perhaps because you want to listen on specific
# addresses) then you must run two copies of vsftpd with two configuration
# files.
# Make sure, that one of the listen options is commented !!
listen_ipv6=NO

pam_service_name=vsftpd
userlist_enable=YES
tcp_wrappers=YES

pasv_enable=YES
pasv_addr_resolve=YES
pasv_min_port=60001 ( you need add this port to this VM NSG setup
pasv_max_port=60010 ( you need add this port to this VM NSG setup
pasv_address=(update global ip address of your ftp server vm e.g. 52.1xx.47.xx)

Run below commands to reflect your config change.

[root@ftpsourcevm ~]# systemctl restart vsftpd
[root@ftpsourcevm ~]# systemctl enable vsftpd

Finally, you need to add allow port configuration between pasv_min_port and pasv_max_port into NSG. Please refer below image.
f:id:waritohutsu:20170829212808p:plain

How to setup Azure Data Lake for Azure Data Factory

Just create your Azure Data Lake instance, and add a directory for Azure Data Factory like below.
f:id:waritohutsu:20170829212827p:plain

How to setup Azure Data Factory to copy from your FTP server to your Azure Data Lake

After creating your Azure Data Factory instance, choose "Copy data (PREVIEW)" to setup this.
f:id:waritohutsu:20170829214430p:plain

Change this schedule period if it's needed.
f:id:waritohutsu:20170829212934p:plain

Choose "FTP" as "CONNECT TO A DATA SOURCE", but you can also choose other data sources such like S3 and other cloud data sources.
f:id:waritohutsu:20170829212957p:plain

Change to "Disable SSL" at "Secure Transmission" in this sample, and please setup SSL when you will deploy this pipeline in your production environments. Input a global in address of your ftp server and credential account info of your ftp server. You will get a connection error if you setup active mode FTP servers.
f:id:waritohutsu:20170829213020p:plain

Choose a folder for data source of Azure Data Factory. In this sample, we setup as binary copy mode. But you can setup other data copy types such like cvs and others.
f:id:waritohutsu:20170829213121p:plain

Choose "Azure Data Lake Store" as "CONNECT TO A DATA STORE" in this article.
f:id:waritohutsu:20170829213141p:plain

Choose your Azure Data Lake Store instance for storing data like below.
f:id:waritohutsu:20170829213202p:plain

Choose a folder for data storing destination.
f:id:waritohutsu:20170829213224p:plain

Confirm your setup info, and submit to deploy this pipeline.
f:id:waritohutsu:20170829213246p:plain

Confirm your setup

You can view your data copy pipeline in your Azure Data Factory like below. Azure Data Factory will copy your data on your FTP server into your Azure Data Lake following your schedule.
f:id:waritohutsu:20170829215545p:plain