Data Engineering,  Technology

25 Facts about Amazon S3

3 min read

Amazon S3 may stand for Simple Storage Service, but it is anything but simple. Last week I read about a data breach, it was found that some of the S3 buckets containing sensitive company information was exposed to public! This shows that in spite of S3 being an easy solution to store data, you need to pay attention to rules applied around the data, to keep it highly secure and durable.

What is S3?

According to AWS — S3 is highly-durable and highly-scalable object storage that is optimized for reads and is built with an intentionally minimalistic feature set. It provides a simple and robust abstraction for file storage that frees you from many underlying details that you normally do have to deal with in traditional storage.

All objects in S3 are stored in containers called Buckets. Objects can be any type of data — structured data like files from your database, unstructured data like — text files, photos and videos or semi-structured data like — json and xml files.

Here are 25 facts about S3 you need to know before you start using the service

#1 A newly created Bucket is PRIVATE by default. Bucket policies need to be applied to provide access to objects.

#2 S3 is natively a REST API. Standard HTTP, HTTPS(recommended) protocols can be used to perform CRUD operations on S3.

#3 Each object in S3 is uniquely identified with the combination of bucket name + Key or object name + version ID(version is optional)

#4 Each object in S3 contains data and metadata. Data being contents of the file and metadata are details about the file like size, date of creation etc.

#5 Even though S3 bucket has a flat structure, logically hierarchies can be created — to represent sub-folders within in a bucket. Example — mybucket/year/month/filename

#6 Two commonly used storage classes — Standard and Standard-Infrequent Access offer the same durability, high throughput and low latency. You can choose a class based on frequency of your data access.

#7 S3 allows each object to be up to 5TB, to facilitate faster upload of large files Multipart upload option can be used.

#8 Multipart upload for files larger than 5GB is automatically applied when using CLI.

#9 Multipart upload is a three step process — Initiate the upload,Upload each part,explicitly Complete upload process to see the entire object on S3.

#10 When objects stored in one region needs to be accessed in different regions, cross region replication can be applied to objects to reduce latency of accessing the object.

#11 Two prerequisites for cross region replication are — add IAM policy to provide S3 permission to replicate object; and turn on versioning on both the source and target buckets.

#12 When versioning is turned on on a bucket that already containing objects, only new objects will be replicated.

#13 Versioning once activated on a bucket cannot be turned off, it can only be suspended.

#14 Additional layer of protection for versioned data can be added by using MFA delete, where only the root account has access to delete any version of the object or change version settings on bucket.

#15 Objects can be made accessible to public for a limited time using presigned URL by the object owner.

#16 Encryption to data moving in and out of S3 can be applied using S3 SSL Layer(used on HTTPS).

#17 For data at rest, encryption can be applied on both Server side and/or at the Client side and then sent over to S3. The three types of server side encryptions are mentioned below.

#18 SSE-S3(server side encryption) is S3 managed keys where each object is encrypted by a key, and the key itself is encrypted by Master Key and regularly rotated. All keys are handled by AWS.

#19 SSE-KMS(key management service) — Has added advantages where separate permission is required for using envelope key, that protects encryption key and it provides auditing, to see who used the key and what object was accessed.

#20 SSE-C where customer manages the encryption key and AWS manages encryption.

#21 For data archival, Glacier service can be used for long term data storage at much lower cost. Archives can be a single file or a set of files and can be up to 40 TB in size.

#22 Archives are stored in Vaults on Glacier. Vault locks can be applied to handle data retention and lock data from edits.

#23 There are different data retrieval jobs from Glacier- Standard retrieval takes 3–5 hours for data retrieval, Bulk retrieval takes 5–12 hours for data retrieval, Expedited 1–5 minutes for data retrieval and Ranged retrieval allows limited data to be retrieved from the files by specifying the byte range.

#24 When retrieving data from Glacier, a copy of the data is made on S3, the actual data still resides in Glacier.

#25 Lifecycle policy to move data from one storage class to another after a period of time, can be applied to all objects within a bucket or to a set of objects by specifying a prefix.

Leave a Reply

Your email address will not be published. Required fields are marked *