there are times when we tar
and gzip
a directory and the final tarball is just too damn big. maybe it doesn’t fit on any of those old thumbdrives we got at the 7-eleven, or maybe we’re trying to upload it to s3 and aws is complaining it’s too large.
let’s take a look at a quick and dirty way to split our tarballs into a bunch of smaller files of a set size.
we’ll go over creating the segmented tarballs first and then cover how to put them back together into something usable.
first, we can create our tarball parts like so:
$ tar czpf - /path/to/directory | split -d -b 10M - tarballname.tgz-part
this little pipeline has two components: a call to tar
to pack up our directory and a call to split
to slice it all up into ten meg chunks.
the tar
part
we’re calling tar
here with the switches czpf
. let’s look at what those actually mean:
-c
this stands for ‘create’, as opposed to ‘extract’.-z
the ‘z’ is for ‘zip’. in this case,gzip
-p
this is for ‘preserve permissions’-f
this means ‘file name’. if we don’t use this,tar
will chose a filename for us, and we won’t like it
the next interesting bit is the lone dash as the first path argument. normally, when using tar with -f
we provide an input directory and an output file name as arguments. in this example, we don’t want to do that: split
will be creating our name for us. by providing that single -
character, we’re telling tar
to take all of our output and send it to STDOUT
instead of to a file. this allows us to use our pipe operator.
generally speaking, when we see that single dash, we are using STDIN
if reading or STDOUT
if writing, instead of a file.
add all this together, and we have a tar
call that is tarring and zipping our directory and dumping the output into the pipe operator for split
to pick up.
the split
part
the split
command here is reading it’s input from STDIN
through that pipe. there’s that single dash argument, again, telling split
that we’re using the piped-in contents instead of a file.
the -b 10M
argument here is where we set the maximum size of our parts, in this case ten megs. although -b
stands for ‘bytes’, we can use more convenient measurement units here; M
for megabytes and G
for gigs.
lastly, we have the -d
switch to tell split
to add some numbers to the end of our output file names. the ‘d’ stands for ‘digits’. if we don’t use this switch, split
will suffix our files with increments like ‘aa’, ‘ab’, and so on. hideous stuff. never send a letter to do a number’s job.
the tarballname.tgz-part
argument is just the prefix for the name of all our tarballs parts. we can call it whatever we want.
after running this pipeline on a suitably large directory, we’ll get a list of files that looks something like:
$ ls -1 thename.tgz-part00 thename.tgz-part01 thename.tgz-part02 thename.tgz-part03
great stuff.
putting it all back together
of course, splitting up a tarball is no good if we can’t re-assemble and extract it. let’s do that.
$ cat thename.tgz-part* | tar xzpf -
most people use cat
as a quick way to dump a file to screen but, fun fact, ‘cat’ stands for ‘concatenate’. it can take a list of files and stick them all together in order.
here, we’re cat
ting all the parts of our tarball together and dumping the output through a pipe.
on the other side of the pipe, we’re calling tar
with xzpf
to extract our archive. note, again, that single dash that tells tar
to accept input from STDIN
and not a file.
when we’re done, our directory is on disk, untarred. victory.
co-founder of fruitbat studios. cli-first linux snob, metric evangelist, unrepentant longhair. all the music i like is objectively horrible. he/him.