Google storage is a file storage service available from Google Cloud. Quite similar to Amazon S3 it offers interesting functionalities such as signed-urls, bucket synchronization, collaboration bucket settings, parallel uploads and is S3 compatible. Gsutil
, the associated command line tool is part of the gcloud
command line interface.
After a brief presentation of the Google Cloud Storage service, I will list the most important and useful gsutil
command lines and address a few of the service particularities.
The google storage platform is Google’s Entreprise storage solution. Google Storage offers a classic bucket based file structure similarly to AWS S3 and Azure Storage. Google Storage was introduced in may 2010 as Google Storage for Developers, a RESTful cloud service limited at the time to a few hundreds developers. gsutil
the command line tool associated with Google Storage was released at the same time.
The google storage platform is Google’s Entreprise storage solution. Google Storage offers a classic bucket based file structure similarly to AWS S3 and Azure Storage. Google Storage was introduced in may 2010 as Google Storage for Developers, a RESTful cloud service limited at the time to a few hundreds developers. gsutil
the command line tool associated with Google Storage was released at the same time.
Fast forward to 2018, Google Storage now offers 3 levels of storage with different accessibility and pricing.
- standard storage is for fast access to large amounts of data. It offers high speed of response to requests.
- DRA is for long-term data storage and infrequent access and is priced lower than standard storage.
- Nearline storage is for even less frequent access and offers longer response times. It is the cheapest option.
Google Storage price structure depends on location and storage class and evolves frequently. At time of writing prices are $0.026 per Gb-month for Standard , $0.01 for Nearline and as low as $0.007 for Coldline storage with a Multi-regional location. See the pricing page for uptodate prices. See also Google Cloud Storage on a shoestring budget
for an interesting cost breakdown.
A distinct trait of Google Storage structure is that folders and subfolders within a bucket are not associated with a “physical” structure as they would be on your local machine. On Google Storage, buckets have virtual folders. The full path to a file is interpreted as being the entire filename.
Consider for instance, the file hello_world.txt
located in mybucket/myfolder/
. The file’s URL is: gs://mybucket/myfolder/hello_world.txt
. Google Storage interprets that file has having the filename myfolder/hello_world.txt
.
The slash /
character is part of the object filename instead of being an indication of an existing folder. As Google calls it, this object naming scheme creates ” the illusion of a hierarchical file tree atop the “flat” name space”.
Although this is transparent most of the time, virtual paths may results in misplaced files when uploading a folder with multiple subfolders. If the upload fails and needs to be restarted, the copy command will have unexpected results since the folder did not exist in the first upload but does with the second try.
In order to avoid these weird cases, the best practice, is to make sure to start by creating the expected folder structure and only then upload the files to their target folders.
gsutil
Gsutil
is the command line tool used to manage buckets and objects on Google Storage. It is part of the gcloud
shell scripts. Gsutil
is fully open sourced on github, and under active development.
Gsutil goes well beyond simple file transfers with an impressive lists of advanced gsutil features, including:
- ACLs: setting access control via Access Control Lists
- rsync: synchronizing folders and buckets
- lifeline: defining lifecycle rules
- signed urls: setting time limited online access ()
- perfdiag for troubleshooting
- logging, notifications and versioning
Before diving in these powerful functionalities, let’s walk through a simple case of file transfer.
If you don’t have gsutil
installed on your local machine or cloud instance, follow the Google Cloud SDK install instructions for your OS in order to get started. You may need to sign up for a free trial account.
Getting around with gsutil
In the following examples, I create a bucket, upload some files, get information on these files, move them around and change the bucket storage class.
- First things first. In order to get
help
ongsutil
or anygsutil
sub commands:
$ gsutil help
$ gsutil
- Now create a bucket named
<bucketname>
All buckets names share a single global Google namespace and must not be already taken.
$ gsutil mb gs://<bucketname>
Note that there are certain restrictions on bucket naming and creation beyond the uniqueness condition. For instance you cannot change the name of an existing bucket, and a bucket name cannot include the word google.
- Upload and download a file with
cp
$ gsutil cp
- And transfer a file between buckets:
$ gsutil cp gs://<bucket_A>/<remote_file> gs://<bucket_B>/
- Create a folder in a bucket with
cp
$ gsutil cp <new_folder> gs://<bucketname>/
- Upload a file to a
<new_folder>
withcp
$ gsutil cp <local_file> gs://<bucketname>/<new_folder>/
This will create the folder <new_folder>
and at the same time upload the file <local_file>
to that folder. Note the trailing /
that tells gsutil
to actually interpret <new_folder>
as a new folder and not as the target filename. If you omit the trailing /
gsutil will rename the file with the filename <new_folder>
once uploaded and the new folder will not be created.
- List the folder with
ls
$ gsutil ls gs://<bucketname>/
- Check storage space with
du
$ gsutil du -h gs://<bucketname>/
where the -h
flag makes it human readable
The -r
and -m
flags
- Copy a local folder and its content to a bucket with
cp -r
$ gsutil cp -r ./<local_folder> gs://<bucketname>/
Consider for instance a local ./img
directory that contain several image files. We can copy that entire local directory and create the remote folder at the same time with the following command:
$ gsutil cp -r ./img gs://<bucketname>/
The bucket now has the virtual folder /img
.
- Improve performance with the
-m
flag
When moving large number of files, adding the -m
flag to cp
will run the transfers in parallel and significantly improve performance provided you are using a reasonably fast network connection.
gsutil supports *
and ?
wildcards only for files. To include folders in the wildcard target you need to double the *
or ?
sign. For instance, gsutil ls gs://<bucketname>/**.txt
will list all the text files in all subdirectories. The wildcard page offers more details.
Gsutil full configuration
Gsutil full configuration is available in the ~/.boto
file. You can edit that file directly or via the gsutil config
command. Some interesting parameters are:
parallel_composite_upload_threshold
: to specify the maximum size of a file to be uploaded in a single stream. Files larger than this threshold will be uploaded in parallel. The parallel_composite_upload_threshold
parameter is disabled by default.
check_hashes
: to enforce integrity checks when downloading data, always, never or conditionally.prefer_api
: to specify the API to use when interacting with cloud storage providers (S3, GCS, …)- and
aws_access_key_id
andaws_secret_access_key
for interoperability with S3.
Cloud storage compatibility is powerful. Not only can you migrate easily from AWS S3 to GCP or vice versa but you can also sync S3 buckets and GCP buckets with the rsync command. ACL As stated in the documentation, Access Control Lists (ACLs) allow you to control who can read and write your data, and who can read and write the ACLs themselves. ACL are assigned to objects (files) or buckets. By default all files in a bucket have the same ACL as the bucket they’re in. ACL has 3 commands
- GET: lists the permissions on a given object. For instance
gsutil acl get gs://<bucketname>/
outputs the access settings for the<bucketname>
bucket. - SET: sets the permissions on a given object. The best way to set the permissions and avoid mistakes is by first exporting them to a file with
gsutil acl get gs://<bucketname>/<filename> act.txt
, modify the acl.txt file and then set the new permissions withgsutil acl set acl.txt gs://bucket/<filename>
- CH: for change, modifies the current permissions on a given object. For instance to grant WRITE access to a user
gsutil acl ch -u [email protected]:WRITE gs://<bucketname>/
The default settings for buckets are defined with the defacl
command which also responds to get
, set
and ch
subcommands. The command gsutil defacl get gs://<bucketname>/
will return the default settings for the bucket <bucketname>
.
Several pre defined setings are available:
- project-private: is the default setting for new objects. It gives permission to the project team based on their roles. All team members have READ permission while editors and owners have OWNER permission.
- private: Gives the requester OWNER permission for a bucket or object
- public-read: Opens the objects to the whole internet as it gives all users read permission.
- public-read-write: The dangerous setting that allows anyone on the internet to upload files to your bucket.
Further ACL details are available in the ACL page
rsync
The gsutil rsync
makes the content of a target folder identical to the content of a source folder by copying, updating or deleting any file in the target folder that has changed in the source folder. This synchronization works across local and GCP folders as well as other gsutil cloud compatible storage solutions such as AWS S3. With the gsutil rsync
command you have everything you need to create an automatic backup of your data in the cloud. The rsync command follows:
$ gsutil rsync <source folder> <target folder>
Consider a local folder ./myfolder
and the <bucketname>
bucket, the following command synchronizes the content of the local folder with the storage bucket:
$ gsutil -m rsync -r -d ./myfolder gs://<bucketname>
The content of gs://<bucketname>
will match the content of your local ./myfolder
directory, effectively backing up the local documents.
- Note the presence of the
-r
flag which ensures that all subfolders are matched. - The
-d
flag is to be used with caution as it will delete the content in the target when deleted from the source. If you inadvertently make a mistake in your command, for instance inverting the source and target folders, you may end up deleting your content. A good way to ensure that does not happen is to enable bucket versioning.
If you don’t want to have to run the gsutil
command every time you make a change in the source folder, you can set up a cron job on your local with crontab -e
or the equivalent for windows machines. For instance the following cron job will backup your local folder to Google Cloud every 15mn.
\*/15 * * * * gsutil -m rsync -r -d < full path to myfolder> gs://mybucket >> <full path to log file>
Bucket versioning
Bucket versioning is a powerful feature that prevents any file deletion by mistake. Enabling and disabling versioning is done at the bucket level with the command:
$ gsutil versioning set on gs://
Google Cloud Storage is a fully featured enterprise level service which offers a viable alternative to AWS S3. Prices, scalability, and reliability are key features of the service. I’ve been using Google Storage for awhile across different projects and find it very user friendly. Definitely worth testing if you need to store significant amount of data.
Google Storage price structure depends on location and storage class and evolves frequently. At time of writing prices are $0.026 per Gb-month for Standard , $0.01 for Nearline and as low as $0.007 for Coldline storage with a Multi-regional location. See the pricing page for uptodate prices. See also Google Cloud Storage on a shoestring budget for an interesting cost breakdown.
The slash /
character is part of the object filename instead of being an indication of an existing folder. As Google calls it, this object naming scheme creates ” the illusion of a hierarchical file tree atop the “flat” name space”.
In order to avoid these weird cases, the best practice, is to make sure to start by creating the expected folder structure and only then upload the files to their target folders.
Gsutil goes well beyond simple file transfers with an impressive lists of advanced gsutil features, including:
-
ACLs: setting access control via Access Control Lists
-
rsync: synchronizing folders and buckets
-
lifeline: defining lifecycle rules
-
signed urls: setting time limited online access ()
-
perfdiag for troubleshooting
-
logging, notifications and versioning
Getting around with gsutil
In the following examples, I create a bucket, upload some files, get information on these files, move them around and change the bucket storage class.
- Now create a bucket named
<bucketname>
$ gsutil mb gs://
1 |
|
- Upload a file to a
<new_folder>
withcp
$ gsutil cp
1 |
|
where the -h
flag makes it human readable
$ gsutil cp -r ./img gs://
1 |
|
The content of gs://<bucketname>
will match the content of your local ./myfolder
directory, effectively backing up the local documents.
-
Note the presence of the
-r
flag which ensures that all subfolders are matched. -
The
-d
flag is to be used with caution as it will delete the content in the target when deleted from the source. If you inadvertently make a mistake in your command, for instance inverting the source and target folders, you may end up deleting your content. A good way to ensure that does not happen is to enable bucket versioning.
Bucket versioning is a powerful feature that prevents any file deletion by mistake. Enabling and disabling versioning is done at the bucket level with the command:
1 |
|
To retrieve the correct version, simply append the version number to the object name in the cp command.
Signed URLs
Signed URLs is a mechanism for query string authentication for buckets and objects. In other words, Signed urls provide a way to give time-limited read or write access to anyone in possession of the URL, regardless of whether they have a Google account.
To create a signed url you first need to generate a generate a private key following these instructions. Click on Create a service account key
, select your project, and download the JSON file that contains your private key.
$ gsutil signurl -d 10m -m GET <path/private_key.json> gs://
1 |
|
For instance, in order to delete the content of the bucket after 30 days, the config file would be: Example: delete after 10 days
1 |
|
Check the lifecycle configurations page for more info.
If you liked this post, please share it on twitter And leave me your feedback, questions, comments, suggestions below. Much appreciated :)