This should be an integrated process in HopsFS that takes as input a batch of 3-way replicated files and outputs and a set of files now (1) stored on archive storage and (2) erasure-coded.
This leads to a number of challenges:
- an external system (like Hopsworks that discovers stale files using Elasticsearch/ePipe) to identify cold files to archive
- how do we erasure code small files (with less than 6+3 or 10+4 blocks).
We are only interested in this Jira in the latter problem. I propose we do the following:
- For files with > 9 or 14 blocks, we erasure code them in-place as they are. In this case, it would reduce write amplification if we have a system with balanced amounts of archive and triple-replication volumes. Our block placement policy would place 2 blocks on standard storage and the 3rd block on archive storage (ZFS-RAID5). When we generate the EC blocks, and reduce the replication degree to '1', the block that is kept will be the one already be on archive storage - we wont need to be copy the block from standard storage to archive storage. A nice side-effect is that archive storage could be used by 'hot' data.
- For files with < 9 or 14 blocks, we will make a new copy of the file so that it has at least 9/14 blocks. The new file will have a much smaller block size (that we calculate) and its blocks will be stored on archive storage. We then generate the parity blocks. When we are finished we will perform an atomic rename from the old file to the new file. All of this could be done in MapReduce jobs or in the NameNode.