bash: splitting tarballs the ‘easy’ way

there are times when we tar and gzip a directory and the final tarball is just too damn big. maybe it doesn’t fit on any of those old thumbdrives we got at the 7-eleven, or maybe we’re trying to upload it to s3 and aws is complaining it’s too large.

let’s take a look at a quick and dirty way to split our tarballs into a bunch of smaller files of a set size.

segments of a 32gbh tarball everywhere

we’ll go over creating the segmented tarballs first and then cover how to put them back together into something usable.

first, we can create our tarball parts like so:

$ tar czpf - /path/to/directory | split -d -b 10M - tarballname.tgz-part

this little pipeline has two components: a call to tar to pack up our directory and a call to split to slice it all up into ten meg chunks.

the tar part

we’re calling tar here with the switches czpf. let’s look at what those actually mean:

  • -c this stands for ‘create’, as opposed to ‘extract’.
  • -z the ‘z’ is for ‘zip’. in this case, gzip
  • -p this is for ‘preserve permissions’
  • -f this means ‘file name’. if we don’t use this, tar will chose a filename for us, and we won’t like it

the next interesting bit is the lone dash as the first path argument. normally, when using tar with -f we provide an input directory and an output file name as arguments. in this example, we don’t want to do that: split will be creating our name for us. by providing that single - character, we’re telling tar to take all of our output and send it to STDOUT instead of to a file. this allows us to use our pipe operator.

generally speaking, when we see that single dash, we are using STDIN if reading or STDOUT if writing, instead of a file.

add all this together, and we have a tar call that is tarring and zipping our directory and dumping the output into the pipe operator for split to pick up.

the split part

the split command here is reading it’s input from STDIN through that pipe. there’s that single dash argument, again, telling split that we’re using the piped-in contents instead of a file.

the -b 10M argument here is where we set the maximum size of our parts, in this case ten megs. although -b stands for ‘bytes’, we can use more convenient measurement units here; M for megabytes and G for gigs.

lastly, we have the -d switch to tell split to add some numbers to the end of our output file names. the ‘d’ stands for ‘digits’. if we don’t use this switch, split will suffix our files with increments like ‘aa’, ‘ab’, and so on. hideous stuff. never send a letter to do a number’s job.

the tarballname.tgz-part argument is just the prefix for the name of all our tarballs parts. we can call it whatever we want.

after running this pipeline on a suitably large directory, we’ll get a list of files that looks something like:

$ ls -1
thename.tgz-part00
thename.tgz-part01
thename.tgz-part02
thename.tgz-part03

great stuff.

putting it all back together

of course, splitting up a tarball is no good if we can’t re-assemble and extract it. let’s do that.

$ cat thename.tgz-part* | tar xzpf -

most people use cat as a quick way to dump a file to screen but, fun fact, ‘cat’ stands for ‘concatenate’. it can take a list of files and stick them all together in order.

here, we’re catting all the parts of our tarball together and dumping the output through a pipe.

on the other side of the pipe, we’re calling tar with xzpf to extract our archive. note, again, that single dash that tells tar to accept input from STDIN and not a file.

when we’re done, our directory is on disk, untarred. victory.

Posted by grant horwood

co-founder of fruitbat studios. cli-first linux snob, metric evangelist, unrepentant longhair. all the music i like is objectively horrible. he/him.

         

Leave a Reply