Working with data

From MetaCentrum
Jump to navigation Jump to search

(Česká verze)

Metacentrum wiki is deprecated after March 2023
Dear users, due to integration of Metacentrum into (e-INFRA CZ service), the documentation for users will change format and site.
The current wiki pages won't be updated after end of March 2023. They will, however, be kept for a few months for backwards reference.
The new documentation resides at

This topical guide provides the most important information about data manipulation, storage and archiving in MetaCentrum.

Data storage: managing large numbers of files

Keeping large number of files (>=million of files) in user's homes is problematic, since it significantly increases the time needed to backup the home directories as well as to manipulate them for any other purpose. At such quantities, the number of the files is the limiting factor, not their size. To keep the service operations sustainable, there is a quota on number of files. We encourage the users who exceed the quota either to remove the data, or to squash them into suitably sized chunks. It is our experience that often the directories with millions of files result from a job gone amok and as such present a "dead weight". Users who really need to store large numbers of files can contact user support and ask for an exception.

Check your quota of number of files

You can see the state of your quotas at

Remove, tar, or tar and archive the excess files

If the data is junk, remove it either within a single command, e.g.

(BUSTER)melounova@tarkil:~$ ssh rm -rf ~/junk_dir

or wrap the command into a batch job to avoid waiting for the command to end:

(BUSTER)melounova@tarkil:~$ qsub -l walltime=24:00:00

If the data is not junk, pack them them into larger chunks using the tar command either from a command line or from within a batch job:

(BUSTER)melounova@tarkil:~$ ssh tar -cf not_junk_dir.tar ~/not_junk_dir
(BUSTER)melounova@tarkil:~$ qsub -l walltime=24:00:00

If you have enough space on your storage directories, you can keep the packed data there. However we encourage users to archive the data that are of permanent value, large and not accessed frequently.

If you really need to keep large numbers of files in your home directory, contact us at user support mail

Basic commands for data transfer

Basic commands for data transfer are scp, wget and sftp.

Windows alert: Windows users will need an application that emulates the Linux commands for data transfer. See How to ssh from Windows.


scp works in pretty much the same way as normal cp command, only it allows you to copy files between different machines.

marenka@home_PC:~$ scp my_file.txt # copy file "my_file.txt" to a home folder of user "jenicek" on a frontend ""
marenka@home_PC:~$ scp -r my_dir # as above; copy directory "my_dir" together with all subdirectories
marenka@home_PC:~$ scp -r . # from jenicek's home on skirit, copy to marenka's local PC folder "results" 
marenka@home_PC:~$ scp -r . # copy jenicek's folder "results" directly from /storage/brno2 (see section below for explanation of the path and server address)


Alternative way to download data is a wget command. wget will work only if the file is available via ftp or http(s) network protocols, typically when it is a downloadable file on some server. wget is faster and less safe than scp, so it may be a method of choice if you need to download larger amount of data from Internet where privacy is not an issue.

ssh # login to a frontend; replace "jenicek" by your real username$ mkdir data; cd data # create and enter directory "data" where the data will be downloaded to$ wget # download file "" from a server (= webpage) ""

By wget you can only transfer data to Metacentrum machines. wget is of no use if you want to transfer from Metacentrum.


sftp is just another protocol for transferring data. Contrary to scp it is interactive, slower and apart from copying it also enables the user to manipulate files and directories on the remote side. We recommend to use scp if you need only to copy the data.

Related topics
Usage of WinSCP
About SFTP protocol

Windows users need an SFTP client, we recommend the WinSCP application. Keep in mind you have to fill in as target chosen NFS4 server instead of frontend in Step 1. Make sure you have selected SFTP file protocol, too.

Linux users just open a terminal and use sftp command as shown below. More about sftp command can be found in this external link.

sftp 'META username'@target_NFS4_server # Login
help # Shows available commands
get target_file # Downloads target file to your local system
get -r target_directory # Downloads target directory to your local system
put target_file # Uploads target file to server
put -r target_directory # Uploads target directory to server

There is a bug affecting Ubuntu 14.04+ concerning the recursive copy command: put -r . If put -r fails, create the target directory on the server first to work around this issue.



The rsync command is a more advanced and versatile copying tool. It enables the user to synchronize the content of two directories in a more efficient way than scp, because rsync copies only the differences between the directories. Therefore it is often used as a tool for regular backups.

Copy directory data to archive:

$ rsync -zvh /storage/brno2/home/melounova/data /storage/du-cesnet/home/melounova/VO_metacentrum-tape_tape

Copy only the content of directory data to archive:

$ rsync -zvh /storage/brno2/home/melounova/data/ /storage/du-cesnet/home/melounova/VO_metacentrum-tape_tape

Data transfer: moderate amount of data (<=1000 files, <= 100 GB)

Moderate amount of data (hundreds of individual files and/or less that 100 GB) can be transferred to/from MetaCentrum machines in a straightforward way.

The server you login to is one of the frontends.

The path to locate your destination/source data is the same path you see when you ale logged in on a frontend.


melounova@home_PC:~$ scp . # copy file "foo" from "brno2" storage through frontend skirit to my local PC

Data transfer: large amount of data(>1000 files, > 100 GB)

Data transfer between storages and PC, principles

When transferring large amount of data we ask users to avoid frontends. This is because transfer of large data can overload the frontend and cause slowdown, which is inconvenient to other users.

For example, the command

melounova@home_PC:~$ scp -r .

does this:

Scp frontend.jpg

The data are not stored on frontend, but they load its CPUs and RAM. Therefore for large data it is better to access the data storages (NFS4 servers) directly.

The direct-access-equivalent to the command above is

melounova@home_PC:~$ scp -r . 

and it can be visualised as:

Scp direct.jpg

Why do I log in to brno6, if I want to access brno2?

As hardware is changing, the user data are moved to new disk fields with new names. For convenience the old names, such as "brno2" are kept as symlinks, so that users don't need to revise all their scripts and aliases every time there is a change. When working from a frontend, everything remains the same no matter the changes in the background. For example, the brno2 still exists as a symlink, although the original brno2-hardware was replaced and the data now reside physically on brno6.

Since the direct access avoids the frontend, you cannot use the symbolic links, but you need to use real server names and correct paths. Although they can be figured out from the directory tree, for convenience we collect in the following table a list of storages, list of server names and corresponding paths.

Which /storage server name path to user homes example: how to copy from this storage to local PC
/storage/brno2 ~/../fsbrno2/home/USERNAME scp -r .
/storage/brno1-cerit/ ~/ scp -r .
/storage/brno3-cerit/ ~/ scp -r .
/storage/brno6/ ~/
scp -r .
scp -r .
/storage/brno8/ ~/
scp -r .
scp -r .
/storage/budejovice1/ ~/
scp -r .
scp -r .
/storage/du-cesnet/ ~/VO_metacentrum-tape_tape
scp -r .
scp -r .
/storage/liberec3-tul/ ~/
scp -r .
scp -r .
/storage/plzen1/ ~/
scp -r .
scp -r .
/storage/praha1/ ~/
scp -r .
scp -r .

In case something does not work as expected or some storage is missing from this table, write us at

Data transfer between storages using scp

If you want to move large amount of data between storages, the setup is similar as in the case when you copy data between your PC and a storage. The only difference is the you cannot access storages interactively (see Working with data) and therefore the scp command has to be passed as an argument to ssh command.

For example, copy file foo from plzen1 to your home at brno2:

ssh "scp foo"

If you are already logged on a frontend, you can simplify the command to:

ssh storage-plzen1 "scp foo storage-brno6:~/../fsbrno2/home/USERNAME/"

The scp-command examples shown above will run only until you either disconnect, or the validity of Kerberos ticket expires. For longer-lasting copy operations, it is a good idea to submit the scp command within a job. Prepare a trivial batch script called e.g.

 #PBS -N copy_files
 #PBS -l select=1:ncpus=1:scratch_local=1gb
 #PBS -l walltime=15:00:00

 ssh storage-plzen1 "scp foo storage-brno6:~/../fsbrno2/home/USERNAME/"

and submit it as qsub

Data transfer between storages using rsync

Another option how to pass data between storages is to use rsync command.

For example, to move all your data from plzen1 to brno12-cerit:

(BUSTER)USERNAME@skirit:~$ ssh storage-plzen1 "rsync -avh ~ storage-brno12-cerit:~/home_from_plzen1/"

To move only a selected directory:

(BUSTER)USERNAME@skirit:~$ ssh storage-plzen1 "rsync -avh ~/my_dir storage-brno12-cerit:~/my_dir_from_plzen1/"

You can wrap the rsync command into a job, too.

 #PBS -N copy_files
 #PBS -l select=1:ncpus=1:scratch_local=1gb
 #PBS -l walltime=15:00:00

 ssh storage-plzen1 "rsync -avh ~ storage-brno12-cerit:~/home_from_plzen1/"

If you then look at the output of running job you can check how the data transfer proceeds.

USERNAME@NODE:~$ tail -f /var/spool/pbs/spool/

Other ways to access /storage directly

ssh protocol

Selected programs serving for data manipulation directly at the NFSv4 storage server can be run through SSH. On the other hand these operations can easily overload NFSv4 server. If you plan massive file moves, contact us in advance, please.


  • Apart from the cerit NFS4 servers, there is no shell available on storage servers, so typing simply ssh will not work (you can log in, but you will be immediately logged out). Instead, use the construction ssh command.
  • It is not possible to run programs at storage volume. No computation should be run at the NFSv4 server.
  • When copying files with dd set block size (bs parameter) to at least 1 M (comparing with default 512 byte). Operations will be faster.

On storage servers, only the following commands are available:

  • /usr/bin/scp,
  • /usr/lib/sftp-server,
  • /bin/cp,
  • /bin/ls,
  • /bin/tar,
  • /bin/cat,
  • /bin/dd,
  • /bin/rm,
  • /bin/mkdir,
  • /bin/mv,
  • /bin/rmdir,
  • /bin/chmod,
  • /usr/bin/gzip,
  • /usr/bin/gunzip


List the content of home directory on remote machine by the following command:

ssh ls -l

Full path can be used as well:

ssh ls -l /home/USERNAME

Mount storage on local station

For more advanced users, there is also the possibility to mount the data storages locally. The NFS4 servers can then be accessed in the same way as local disk. Follow the tutorial in Mounting_data_storages_on_local_station to learn how to mount the storages locally.

Data storage

Warning.gif WARNING: There data backup and archiving policy differs between NFS servers (see the table below). Please keep in mind that the data in your home directory are backed up only in a form of daily snapshots stored on the same disk array, and thus are prone to loss in case of HW failure or natural disaster. For data of permanent value consider keeping your own copy or using safer back-up and/or archiving strategy - find more info at Working with data.

Related topics
Types of scratch storage
CESNET data care

There are three data storages types offered by MetaCentrum:

Storage type Basic description Typical usage
Scratch storages Fast storages with minimum data capacity Working with data during computations
Disk arrays /storage volumes in MetaCentrum Data storing between computations
Hierarchical storages Storages with massive data capacity Data archiving

Scratch storages

Scratch storages are accessible via scratch directory on computational nodes. Use this storages during computations only. This means the batch script should clean up the scratch after the job is done or, if the data are left in scratch, it should be done manually (as is the case when the job fails or is killed) – see Beginners guide. Data on scratch storages are automatically deleted after 14 days.

Disk arrays

Disk arrays are several connected hard drives and are accessible via /storage directories. Files are stored on multiple drives, which guarantees higher I/O data speed and reliability. Use disk arrays for preparing data and storing data between jobs.

NFS4 server adresář - directory velikost - capacity zálohovací třída - back-up policy alternativní jména serverů v Perunovi - alternative name / poznámka - note /storage/brno1-cerit/ 1.8 PB 2 /storage/brno2/ 306 TB 2 /storage/brno3-cerit/ WILL BE decomissioned 2 data moved to /storage/brno12-cerit/ /storage/brno4-cerit-hsm/ zrušeno - decommissioned data archived in /storage/brno1-cerit/ /storage/brno5-archive/ zrušeno - decommissioned 3 /storage/brno6/ zrušeno - decommissioned 2 /storage/brno7-cerit/ zrušeno - decommissioned 2 data archived in /storage/brno1-cerit/ /storage/brno8/ zrušeno - decommissioned 3 in past /storage/ostrava1/, data moved to /storage/brno2/home/USERNAME/brno8 /storage/brno9-ceitec/ zrušeno - decommissioned 3 - pro NCBR CEITEC /storage/brno10-ceitec-hsm/ zrušeno - decommissioned 3 dedicated to NCBR CEITEC /storage/brno11-elixir/ 313 TB 2 dedicated to ELIXIR-CZ, /storage/brno12-cerit/ 3.4 PB 2, domovský adresář v nfs4/home/$USER /storage/budejovice1/ 44 TB 3 (storage-cb1|storage-cb2) /storage/jihlava1-cerit/ zrušeno - decommissioned data archived to /storage/brno4-cerit-hsm/fineus,, symlink /storage/jihlava1-cerit/ /storage/jihlava2-archive/ zrušeno - decommissioned /storage/du-cesnet/ 3, optimal archive storage for all MetaCentrum users /storage/liberec3-tul/ 30 TiB /storage/plzen1/ 352 TB 2 /storage/plzen2-archive/ zrušeno - decommissioned /storage/plzen3-kky/ zrušeno - decommissioned nahrazeno plzen4-ntis /storage/plzen4-ntis/ 200 TiB 3 pro cleny skupiny iti/kky /storage/praha1/ zrušeno - decommissioned 3 storage-praha1(a|b) /storage/praha2-natur/ 88 TB /storage/praha4-fzu/ zrušeno - decommissioned 15 TB /storage/praha6-fzu/ 76 TB /storage/praha5-elixir/ 157 TB 3 /storage/pruhonice1-ibot/ 179 TB 3 /storage/vestec1-elixir/ /storage/praha1/ 2
Zálohovací třídy jsou popsány v / Back-up policy is described at: Politika zálohování (Back-up policy). Výtah/summary:
  • třída 2 - záloha (pouze) formou časových řezů / class 2 - backup (only) in a form of time slices
  • třída 3 - data se záložní kopií / class 3 - data with a backup copy

There are several /storage volumes. Their names (e.g. /storage/brno2, resp. /storage/plzen1) reflect their physical placement in Czech republic (in the example cities of Brno, resp. Plzen). The home directory on frontend (/home) is mapped on one of the storages (/storage/CITY_XY/home) in the same location. E.g. the home of skirit frontend, which is physically located in Brno, is mapped to /storage/brno2/home directory. Use ls -l command to find out:

(STRETCH)melounova@skirit:~$ ls -l /home
lrwxrwxrwx 1 root root 19 zář 17  2018 /home -> /storage/brno2/home

It is reasonable to use one of the frontends/storages closest to you, as it is usually faster. However it does not mean you cannot use any other storage volume which you are allowed to access! The only thing to keep in mind it that in such case you have to use explicit path:

(STRETCH)melounova@skirit:~$ pwd # pwd - print working (=current) directory
/storage/brno2/home/melounova # I am in Brno 2 storage now
(STRETCH)melounova@skirit:~$ cd /storage/plzen1/home/melounova/
(STRETCH)melounova@skirit:/storage/plzen1/home/melounova$ pwd
/storage/plzen1/home/melounova # I am in Plzen 1 storage now

Disk arrays have a backup policy of saving snapshots (once a day, the backup is done usually during the night) of user's data. The snapshots are kept at least 14 days backwards. This offers some protection in case user unintentionally deletes some of his/her files. Generally, data that existed the day before the accident can be recovered. The snapshots are stored, however, on the same disk arrays as the data, so in case of e.g. hardware failure of the disks these backups will be lost. Therefore we recommend strongly to backup any important data elsewhere. For archiving purposes MetaCentrum offers dedicated storage servers.

Disk arrays with hierarchical storage

Disk arrays with hierarchical storage have a more robust backup policy and should be used primarily for archiving purposes. To increase redundancy of data, they contain several layers of storage media. The first layer is a disk array, lower layers are made of MAIDs (Massive Array of Idle Drives) or magnetic tape libraries. Lower layers have bigger capacity but slower access times. Data are moved automatically among these layers based on their last usage. The most important consequence from the user's point of view is that the access to long-unused data may be slower than to the recently-used ones.

Use hierarchical storages for storing data which you do not currently use, but which are important and not easily reproducible in the future.

Data archiving and backup

ZarovkaMala.png Note: Note that data storage is a service provided by CESNET, contrary to grid computation, which is provided and supported by MetaCentrum. The following is a rough overview of data services. In case of problems/questions, we recommend to consult Cesnet storage department homepage or contact (see Cesnet storage department FAQs).

Since the data in "normal" home directories are backed-up only in a form of snapshots, they are not protected against data loss due to hardware failure. The data of permanent value which would be hard to recreate should be backed-up on dedicated servers with hierarchical storage policy. Among other NFS4 servers they can be identified by their name, which contains hsm (hierarchical storage machine) or archive.

NFS4 server adresář - directory status alias v Perunovi / aliased in Perun as ; poznámka / note /storage/du-cesnet/ active primary space for MetaCentrum users /storage/brno14-ceitec/ active visible only to NCBR/CEITEC users /storage/ostrava2-archive/ zrušeno - decommissioned exists formally as symlink to /storage/du-cesnet/ /storage/brno5-archive/ zrušeno - decommissioned data will be archived in ---> /storage/du-cesnet /storage/jihlava2-archive zrušeno - decommissioned data archived in /storage/du-cesnet /storage/plzen2-archive/ zrušeno - decommissioned data archived in /storage/du-cesnet /storage/brno4-cerit-hsm/ zrušeno - decommissioned data archived in /storage/brno1-cerit/

The users are free to access any active server in the table above directly, however we recommend to use the directory /storage/du-cesnet/home/META_username/VO_metacentrum-tape_tape-archive/ (for archiving (long term)) or /storage/du-cesnet/home/META_username/VO_metacentrum-tape_tape/ (for backup service).

ssh # log in to any frontend, replace "jenicek" by your real username
cd /storage/du-cesnet/home/jenicek/VO_metacentrum-tape_tape-archive/ # enter the directory for archival data
cd /storage/du-cesnet/home/jenicek/VO_metacentrum-tape_tape/ # enter the directory for backup data

Both of these directories are so-called symlinks (symbolic links), which only point to an actual HSM server. Contrary to "normal" path, the symlink is a more abstract construction and is not dependent on the actual HSM server currently in use. For example, if an old HSM server (in some location) is replaced by a new one (possibly in different location), the link will remain valid and the user does not need to rewrite the path in all his/her backup and archiving scripts.

Never leave data directly in the home, i.e. in /storage/du-cesnet/home/META_username/. The home directory should serve only to keep SSH keys, making links to directories with the actual data and other configuration files. To enforce this, there is tiny quota set on home directory, see

Backup or archive?

On technical level, there is no difference between the VO_metacentrum-tape_tape-archive and VO_metacentrum-tape_tape directories. What differs is the policy applied to the data stored in either of them.

Permanent data archives are normally limited in size (typically results of some research, not raw data) and the user wants to keep then "forever". Therefore the VO_metacentrum-tape_tape-archive has user quota set for volume of data and/or number of files. On the other hand the data are not removed after a time (they do not "expire"). Use this link if you want to stash away data of permanent value.

Backed-up data serve to protect from data loss in case the primary data are lost. Typically these data need not to be kept for a very long time. Therefore in VO_metacentrum-tape_tape the files older than 12 months are automatically removed (they are considered as "expired"). Use this link if you want to protect your current data e.g. from HW failure of the server where the primary data are stored.

A few notes

Transfering the files to/from the archive

In general: the smaller number of files in the archive, the better (it speeds operations up and generates lower load on the storage subsystems; on the other hand, packing the files makes searching less comfortable). In case you need to archive a large number of small files, we recommend strongly to pack them before, as read/write operations are slower with many small files. Often there is quota set not only on the volume, but on the number of files, too; having hundrehts thousands of small files can hit this quota.

  • if most of your files are large (hundreds of MBs, GBs, ...), don't bother with packing them and make a one-to-one copy to the archive,
  • if your files are smaller and you don't plan to search individual files, pack them into tar or zip files,
  • from the technical point of view, optimal "chunk" of packed data is 500 MB or bigger,
  • don't use front-end servers for anything else than moving several small files! Submit a regular job and/or take an interactive job instead to handle with the archival data, e.g. qsub -I -l select=1:ncpus=1:mem=2gb:scratch_local=2gb -l walltime=48:00:00,
  • keep in mind that the master HOME directory of each HSM storage is dedicated just for initialization scripts, and thus has a limited quota of just 50 MB.

tar command

The tar (tape archiver) is a Linux command to pack files and directories into one file, a packed archive. tar by itself does not compress the size of the files, and the resulting volume of the packed archive is (roughly) the same as the sum of the volumes of individual files. Tar can cooperate with commands for file compression like gzip.

In all examples, the option v in tar command options means "verbose", giving a more detailed output about how the archiving progresses.

  • In /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive, create (tar c) uncompressed archive of the directory named (tar f) ~/my-archive and its content:

tar cvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz ~/my-archive

  • In /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive, create archive of the directory named ~/my-archive and compress it by gzip command (tar z):

tar czvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz ~/my-archive

  • List (tar t) the content of the existing archive:

tar tzf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz

  • Unpack the WHOLE archive my-archive.tgz residing in storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/ into current directory:

tar xzvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz

  • Unpack PART of the archive:
tar tzf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz # list the content of the archive
# unpack only file PATH1/file1 and directory PATH2/dir2 into the current directory
tar xzvf /storage/du-cesnet/home/USER/VO_metacentrum-tape_tape-archive/my-archive.tgz "PATH1/file1" "PATH2/dir2"

There are many other options to customize the tar command. For the full description, read manual pages (man tar).

Sharing data in group

If you want to share your data in a group, follow these instructions Sharing data in group