Storage
Storage refers to the methods and systems used to save, manage, and retrieve data. Storage is essential because data science workflows involve working with large volumes of data, often spanning from gigabytes to petabytes, which need efficient, reliable storage solutions to facilitate analysis and modeling. It can be distinguished between a) local storage that stores data on physical devices like hard drives or SSDs, either on personal computers or local servers; b) Cloud Storage that uses internet-based services (like Google Cloud Storage) to store and manage data; c) distributed storage, which divides data across multiple machines, allowing for parallel processing and fault tolerance; or d) Database Storage, which uses structured storage systems like SQL or NoSQL databases, or relational databases (e.g., MySQL, PostgreSQL) to organize data. Other important topics are storage speed and data retrieval times, as well as backup solutions, and data formats that balance size and speed of the data storage.
Example Implementation:
- CEPH (Ceph distributed storage system)
- GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen)
- Data Science Storage at LRZ
Standards:
- S3-Object storage
- ZFS (Zettabyte File System, as a de facto standard)
- GPFS (General Parallel File System, de facto standard)
- RAID (Redundant Array of Independent Disks)
- EXT4 (Fourth extended filesystem, a widely used Linux file system)
- Btrfs (B-Tree File System, a copy-on-write file system for Linux)