NPO Tsukuba OSS Technical Support Center (OSS Tsukuba)

Technical Overview

Technology for High-Speed Bundling of Numerous "Small" Files

In large-scale distributed file systems (Gfarm), it is recommended to increase individual file sizes & access multiple files in parallel to achieve efficient data access. However, in reality, there is a significant need to efficiently handle "large numbers of small files" such as log files & simulation intermediate data.

Gfptar is a parallel archive command developed to solve this challenge. It automatically bundles a large number of input entries (files) into multiple archive files in the output directory while transferring them in parallel.

Features & Strengths

Speed & Flexibility to Handle Hundreds of Millions of Files

1Overwhelming File Handling Capacity

Gfptar enables efficient access and processing even for hundreds of millions of files. By default, it splits files approximately every 200MB to create archives, maximizing parallelism.

2Fast List Retrieval & Restoration

Verifying the contents of created archives (list retrieval: -t option) can be performed quickly even with large numbers of files. Additionally, restoring archives (extraction: -x option) can also be done at high speed.

3Flexible Archive Updates & Fault Tolerance

Update (-u option)
Only new files can be added to already created archives. Additionally, if archive creation is interrupted due to network failures or other issues, this option can be used to resume the archive process.
Append (-r option)
When only adding files, the -r option, which executes faster than the -u option, can be used.
Customization
While gzip compression is performed by default, flexible settings such as specifying other compression methods or excluding specific filenames are possible.

Usage Examples

Big Data Backup & Efficiency

Backup of Large-Scale Simulation Results: Efficiently & parallelly bundle hundreds of millions of small intermediate files into a single archive directory, completing stable backups quickly.

Recovery from Interruptions: Even if a system failure occurs during large-capacity archive creation, use the -u option to resume processing from the interruption point, significantly reducing work time.

Extracting Only Required Data: Instead of the entire archive, specific entries (files) can be specified & restored (-x option), making access to necessary data easier.

Target Needs & Users

HPC (High-Performance Computing) Users: Those who generate large numbers of small files as simulation or calculation results & need to efficiently manage, move, & store them.

Data Managers / IT Personnel: Those seeking fast & robust archive/backup methods for systems handling hundreds of millions of files.

Data Scientists: Those who want to manage dataset versions or bundle & move/share specific groups of intermediate files.

Other Technology & Development

OAuth/OIDC Authentication: This mechanism enables more secure access to the Gfarm file system using a new login method.

Nextcloud Support: By accessing shared storage directly, users can more easily utilize data on Gfarm.

Gfarm HTTP Gateway: A mechanism enabling access through the HTTP protocol, widely used in web applications.

Back to Gfarm File System