NPO Tsukuba OSS Technical Support Center (OSS Tsukuba)

Use Case Overview

HPCI Shared Storage brings together the power of the wide-area distributed file system Gfarm & SINET's high-speed network, achieving the highest level of "service continuity" & "data security," which are the lifelines of research infrastructure, through thorough data duplication & redundancy at east & west sites. It is a future-oriented data sharing platform where users can reliably use necessary data when needed, without being aware of any geographical constraints.

This case study introduces an initiative that achieved extremely high reliability & availability (a system that never stops) for "HPCI Shared Storage," which connects supercomputer resources nationwide.

Challenge

Challenges of Massive Data Movement & Service Continuity

In research using high-performance computing (supercomputers), a single simulation generates massive data on the order of several terabytes (TB). Traditionally, sharing this research result data among multiple research institutions required moving massive data, which hindered efficient research. Additionally, service continuity is the lifeline for research infrastructure. The greatest challenge was ensuring robust availability to prevent service interruption even during disasters or maintenance, so that data processing & research would not stop midway.

Solution

Achieving "Uninterrupted" Infrastructure & Seamless Data Sharing

HPCI Shared Storage solved this challenge by utilizing the wide-area distributed file system "Gfarm."

Inter-Site Collaboration & Data Protection

1High-Speed, Large-Capacity Single File System

Access & data sharing became possible from HPCI computational resources nationwide to high-speed, large-capacity storage resources (100PB logical, 200PB physical) as a single file system. This allows users to use data from any supercomputer center without being aware of data location or multiplicity.

2Robust Redundancy Through East & West Sites

Storage is deployed at two locations, R-CCS (west site) & University of Tokyo Kashiwa Campus (east site), with constant data duplication (replication). The entire system configuration (network, servers, storage) is also redundant, designed so that service continues even if a part of the system fails.

3Achieving Overwhelming High Reliability

This geographic distribution & redundancy enabled service provision with just one site (single-site standalone operation). As a result, 100% uptime was maintained continuously in FY2019, with no unplanned service outages, achieving one year of continuous uninterrupted operation.

4High-Speed Data Synchronization

By utilizing SINET's broadband, highly reliable network (400G lines), replication (resynchronization) of large amounts of accumulated data after failures or maintenance is completed in a short time.

5Guaranteeing Data Integrity

Checksums are automatically verified during data writing, & checksum verification is also automatically executed during file data duplication, realizing double & triple data integrity checks to robustly protect valuable research results.

Future Development

We aim to utilize this highly reliable data sharing platform not only for domestic research support but also to accelerate international collaboration & data sharing with overseas research institutions. Additionally, with the collaboration between the supercomputer "Fugaku" & the cloud, traffic will inevitably increase further as usage expands. We expect the next-generation network to further develop & expand the network environment to accommodate this enormous traffic & support international research.

Other Use Cases

JLDG (Japan Lattice Data Grid): Supporting the forefront of physics! An international data grid realized with Gfarm

Subaru Telescope Data Analysis: Initiatives that significantly improved processing speed by leveraging Gfarm & Pwrake

NICT Science Cloud: Enables real-time processing, high-speed data visualization, & instant analysis of big data

Back to Gfarm File System