# Data Management on UL HPC Facility

 Copyright (c) 2020 UL HPC Team <hpc-sysadmins@uni.lu>


Author: Sarah Peter

Note: To make it clear where you should execute a certain command, the prompt is prefixed with the location, i.e.

• (access)$> for commands on the cluster access/login nodes • (node)$> for commands on a cluster node inside a job
• (laptop)$> for commands locally on your machine The actual command comes only after this prefix. ## Overview ### Requirements • Access to the UL HPC clusters. • Basic knowledge of the linux command-line. ### Questions • How can I check the my quotas on file sizes and number of files? • What do "soft quota", "hard quota" and "grace period" mean? • How can I see how much space is used on a specific file system? • How can I compute checksums? • How can I verify checksums? • How can I encrypt a file? • How can I decrypt a file? • How can I encrypt a directory of files? • How can I read the files in an encrypted directory? ### Objectives • Explain the df and df-ulhpc commands to check disk usage and quota status. • Compute MD5 and SHA-256 checksums and understand the difference between them. • Verify MD5 and SHA-256 checksums. • Encrypt a single file with GPG. • Decrypt a GPG-encrypted file. • Encrypt a whole folder with gocryptfs. • Mount a gocryptfs-encrypted folder to read and write files in it. • Unmount a gocryptfs-encrypted folder to secure it from unauthorized access. ## Preliminaries Connect to the cluster. ## Quotas We provide the df-ulhpc command on the cluster login nodes, which displays current usage, soft quota, hard quota and grace period. Any directories that have exceeded the quota will be highlighted in red. Check your file size quota with: (access)$> df-ulhpc


You will see a list of directories on which quotas are applied, how much space you are currently using, your soft quota, hard quota and the grace period.

Once your usage reaches the soft quota you can still write data until the grace period expires (7 days) or you reach the hard quota. After you reach the end of the grace period or the hard quota, you have to reduce your usage to below the soft quota to be able to write data again.

(access)$> df-ulhpc -i  Check the free space on all file systems with: (access)$> df -h


Check the free space on the current file system with:

(access)$> cd (access)$> df -h .
(access)$> cd /mnt/isilon/projects (access)$> df -h .


To see what directories are using your disk space and quota:

(access)$> cd (access)$> ncdu


## Checksums

Integrity of data files is critical for the verifiability of computational and lab-based analyses. The way to seal a data file's content at a point in time is to generate a checksum. Checksum is a small sized datum generated by running an algorithm, called a cryptographic hash function, on a file. As long as a data file does not change, the calculation of the checksum will always result in the same datum. If you recalculate the checksum and it is different from a past calculation, then you know the file has been altered or corrupted in some way.

Below are typical situations that call for checksum generation:

• You have copied data files to a new storage location, for instance you moved data from local computer to HPC to start an analysis. You want to create a snapshot of your data, for instance when you’re creating a supplementary material folder for a paper/report.

### MD5

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. It can still be used as a checksum to verify data integrity, but only against unintentional corruption.

(...) it should not be relied on if there is a chance that files have been purposefully and maliciously tampered. In the latter case, the use of a newer hashing tool such as sha256sum is recommended.

First, we need to start an interactive job and prepare some test data:

(access)$> si (node)$> mkdir -p $SCRATCH/data_management (node)$> cd $SCRATCH/data_management (node)$> echo 'Happy secure computing!' > message.txt


We can create the MD5 checksum with the following command:

(node)$> md5sum message.txt 05babfa77d503a60561ceab31d767b34 message.txt  Since we want to store the checksum, we should save to a file: (node)$> md5sum message.txt > message.md5


Given a data file and its checksum, you can verify the file against the checksum with the following command:

(node)$> md5sum -c message.md5 message.txt: OK  Let us change the file and see what happens to the checksum: (node)$> echo "This file has been changed." >> message.txt
(node)$> md5sum message.txt ebaf49a0b2482da084bca535db0423ba message.txt (node)$> md5sum -c message.md5
message.txt: FAILED
md5sum: WARNING: 1 computed checksum did NOT match


### SHA

SHA is short for Secure Hash Algorithm. There are several versions of SHA that have been developed over time: SHA-1, SHA-2 and SHA-3. SHA-2 is a whole family of algorithms that create hash values of different lengths.

The most commonly used version and the one recommended by the National Institute of Standards and Technology (NIST) is SHA-256, which creates a hash value of 256 bits.

We can create the SHA-256 checksum with the following command:

(node)$> sha256sum message.txt > message.sha256  We can verify the file against the checksum similar to above: (node)$> sha256sum -c message.sha256
message.txt: OK


## Encryption

Encryption is an effective measure to protect sensitive data.

IMPORTANT NOTICE:

• Encryption keys and passphrases need to be kept safe and protected from unauthorised access.
• Ensure you have an off-site backup of critical data stored on the platform under encryption.
• (Disaster) recovery of encrypted data is not guaranteed to be viable, depending on internal consistency when the recovery snapshot is taken.

Note that any use of user-level encryption remains under the responsability of the user, with her/him accepting any inherent risks, such as:

• data corruption due to encryption store corruption, or improper use of the encryption tools

### GPG

We can encrypt our test file using GPG with the following command:

(node)$> gpg -c message.txt  Since you did not specify with what to encrypt your file, GPG will ask for a passphrase. Enter twice the same passphrase and make sure you remember it (in this case at least until the next step). This command will also prompt GPG to generate a keyring, if you do not have one yet. The passphrase will be cached for the current SSH session. This will create the encrypted file message.txt.gpg next to the unencrypted file, so let us delete the unencrypted file: (node)$> rm message.txt


You can decrypt the file with the following command:

(node)$> gpg message.txt.gpg gpg: CAST5 encrypted data gpg: encrypted with 1 passphrase gpg: WARNING: message was not integrity protected  You will see some output telling you that the file was encrypted with CAST5 and a warning that it was not integrity protected. To avoid that warning and since CAST5 is an older algorithm, let us encrypt the file with a newer and stronger algorithm: (node)$> rm message.txt.gpg
(node)$> gpg --cipher-algo AES256 -c message.txt (node)$> rm message.txt
(node)$> gpg message.txt.gpg gpg: AES256 encrypted data gpg: encrypted with 1 passphrase  #### Optional Instead of using a passphrase, you can also encrypt files using an encryption key. You create an encryption key with GPG using the following command: (node)$> gpg --gen-key


Keep in mind that DSA keys are deprecated, so you should generate a RSA key, ideally with 4096 bits. You will have to enter a couple of other details and then the key will be stored in the keyring.

You can list the contents of your keyring with

(node)$> gpg --list-keys  If you are going to transfer the data to someone else, you want to encrypt the data with the public key of the recipient, though, and not your own key. You can upload (the public) keys to public repositories for easier sharing. You can export your key to a file with: (node)$> gpg --output jane-doe.key --armor --export jane.doe@uni.lu


Make sure to replace jane-doe with your name and use the email address that you specified when generating the key.

You can import other keys to your keyring using the following command:

(node)$> gpg --import jane-doe.key  To encrypt using a key, you must specify the email address associated to the key you want to use as recipient: (node)$> gpg --encrypt --sign --recipient jane.doe@uni.lu message.txt


### Gocryptfs

Gocryptfs is a modern implementation of an encryption overlay filesystem.

For more details on its inner workings, see the following:

To use gocryptfs on the HPC platform you need to do the following steps:

(node)$> module load tools/gocryptfs  #### 2. Create two folders • dir.crypt, which will act as the storage for the encrypted files (let’s call it crypt) • dir, which will present (on demand) the unencrypted view (let’s call it view) (node)$> cd $SCRATCH/data_management (node)$> mkdir dir.crypt dir


#### 3. Initialize the crypt folder with a password

(node)$> gocryptfs -init dir.crypt Choose a password for protecting your files. Password: Repeat: Your master key is: e1ecdcf1-6bcebaa0-cbe6cfb8-8e27d4ad- acefb9d4-bd98de59-311d1898-31d7e4e4 If the gocryptfs.conf file becomes corrupted or you ever forget your password, there is only one hope for recovery: The master key. Print it to a piece of paper and store it in a drawer. This message is only printed once. The gocryptfs filesystem has been created successfully. You can now mount it using: gocryptfs dir.crypt MOUNTPOINT (node)$> ls dir.crypt/
gocryptfs.conf  gocryptfs.diriv
(node)$> ls dir  On crypt store initialization, gocryptfs provides us with the master key that can be used to restore access to the data files, especially useful in case the password is lost. You should keep the master key safe, never store it unencrypted on the platform itself! After initialization, the crypt store contains two internal configuration files: • gocryptfs.conf is the global configuration for the crypt store, while • gocryptfs.diriv is created per-directory for encryption of file names Note that you should never modify (any) files within the crypt store. #### 4. Mount the crypt folder into the view folder To be able to access/store data, the crypt store needs to be mounted in the view folder • this can be done by supplying the initially set password, either on the command line or from a file with -passfile option • … or with the generated master key, with the -masterkey option • with the passfile option, it means that you have stored your password unencrypted on the filesystem - this is then a security risk! • when using the master key mode, you should be in a full-node or exclusive job reservation such that there are no other users able to see the master key in the system (node)$> gocryptfs dir.crypt dir
Decrypting master key
(node)$> ls dir  #### 5. Add files to the view folder All your processing (new file/folder creation, modification and transfers) will happen in the view folder. Once the crypt store is mounted in the view directory we can create files in the latter: • any folder/file created in the unencrypted view will have a 1:1 correspondent in the crypt store • the plain text message.txt file is stored in encrypted format as 5o_WSYN-Tn59W3vrPiHXEA in the underlying crypt store (file name metadata is encrypted as well) • the same permissions applied on message.txt are also set for its encrypted correspondent file (node)$> echo "Happy secure computing" > dir/message.txt
(node)$> ls dir message.txt (node)$> ls dir.crypt/
5o_WSYN-Tn59W3vrPiHXEA  gocryptfs.conf  gocryptfs.diriv


#### 6. Unmount the view folder

At the end of our processing, we are using fusermount explicitly to unmount the encrypted overlay, such that the unencrypted view of your data is closed and data is flushed to the regular filesystem.

Note that you should always ensure that this happens before your job reservation expires.

(node)$> fusermount -u dir (node)$> ls dir
(node)$> ls dir.crypt/ 5o_WSYN-Tn59W3vrPiHXEA gocryptfs.conf gocryptfs.diriv  #### Other important details • Data stored in a crypt store should not be used concurrently (e.g. by multiple users at the same time). The special option -sharedstorage exists for this use-case, but is not guaranteed to work for all applications. • (Parallel) Applications ran through srun on the Iris cluster cannot "see" the unencrypted view folder as they are run in a different context. This is also the case if you use sjoin or srun --jobid to attach your terminal to a running job. #### Gocryptfs store password management You can change the password of an existing crypt store with the -passwd option: (node)$> gocryptfs -passwd dir.crypt/
Decrypting master key


Note that the master key does not change.

For running batch processing on a gocryptfs-based folder, you can provide the decryption password through an external application with the -extpass option:

(node)\$> gocryptfs -extpass "echo foobar" dir.crypt dir