Multithreaded Unzip?
Published by Matt Hicks under java, programming on Sunday, January 13, 2008
A co-worker and I were discussing the possibilities of performance gain doing extraction of ZIP files using multiple threads rather than the typically single-threading extraction that the majority (if not all) mainstream archive extraction utilities use and given Java's great ZIP file support built-in it seemed rather trivial to give this a shot, thus, UberZip was born.
UberZip is my simple little sample Java command-line application to extract files where you can specify the number of threads to utilize during extraction. My #1 biggest headache is extracting Eclipse from the ZIP file you get, so I decided that would be an ideal test of my little program. I downloaded the Eclipse J2EE bundle (3.1.1), which is a happy 132 meg with thousands of files.
I need to test further with a machine that actually can better utilize multiple threads, but here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded, so I actually get a very slight benefit) running on Windows XP:
14.7 seconds with 1 thread
13.4 seconds with 5 threads
11.6 seconds with 30 threads
Anything above 30 threads seems to actually create more overhead than gain.
Now, to be fair I matched this up against the fastest unzipper I am aware of, 7zip (http://www.7-zip.org). It took right at about 14 seconds to unzip the file inside 7zip, which seems to fall nearly perfectly in-line with the single-threaded execution of UberZip.
For those of you that would be interested in taking a look at this very simple example, I have committed the source to my public repository:
svn://captiveimagination.com/public/uberzip/trunk
Further, if you'd simply rather download the JAR or EXE I have uploaded copies of it for those of you that would like to unzip files "uber" fast. :)
uberzip.exe
uberzip.jar
After taking this to try out on my work machine (Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows) I got the following stats:
7zip - 25 seconds
UberZip (1 thread) - 22.17 seconds
UberZip (5 threads) - 19.5 seconds
UberZip (30 threads) - 21.6 seconds
Oddly it would seem about 5 threads is the "sweet-spot" for this machine. Though some of the results are a bit strange and it would seem on this machine the hard drives don't perform quite as well as my home machine it is obvious that at some level of configuration multithreading you can gain some good performance on the single-threaded applications out there.
Perhaps someone else will find this information useful and turn UberZip into a product people can use. ;)
UberZip is my simple little sample Java command-line application to extract files where you can specify the number of threads to utilize during extraction. My #1 biggest headache is extracting Eclipse from the ZIP file you get, so I decided that would be an ideal test of my little program. I downloaded the Eclipse J2EE bundle (3.1.1), which is a happy 132 meg with thousands of files.
I need to test further with a machine that actually can better utilize multiple threads, but here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded, so I actually get a very slight benefit) running on Windows XP:
14.7 seconds with 1 thread
13.4 seconds with 5 threads
11.6 seconds with 30 threads
Anything above 30 threads seems to actually create more overhead than gain.
Now, to be fair I matched this up against the fastest unzipper I am aware of, 7zip (http://www.7-zip.org). It took right at about 14 seconds to unzip the file inside 7zip, which seems to fall nearly perfectly in-line with the single-threaded execution of UberZip.
For those of you that would be interested in taking a look at this very simple example, I have committed the source to my public repository:
svn://captiveimagination.com/public/uberzip/trunk
Further, if you'd simply rather download the JAR or EXE I have uploaded copies of it for those of you that would like to unzip files "uber" fast. :)
uberzip.exe
uberzip.jar
After taking this to try out on my work machine (Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows) I got the following stats:
7zip - 25 seconds
UberZip (1 thread) - 22.17 seconds
UberZip (5 threads) - 19.5 seconds
UberZip (30 threads) - 21.6 seconds
Oddly it would seem about 5 threads is the "sweet-spot" for this machine. Though some of the results are a bit strange and it would seem on this machine the hard drives don't perform quite as well as my home machine it is obvious that at some level of configuration multithreading you can gain some good performance on the single-threaded applications out there.
Perhaps someone else will find this information useful and turn UberZip into a product people can use. ;)
15 comments:
The SVN seems to be down. I get a 404. I'd love to comment on the code. Can you email me a zip (haha) of the code to drig@noses.org? At your convenience, of course!
Have you benchmarked the code when the zip file is hosted on a shared drive? Off a local disk, the bottleneck will largely be disk I/O speed, which really doesn't benefit much from threading. But, in theory, thread-aware network file sharing, like NFS and SMB, should perform *much* better when threaded. In theory, of course.
I just wanted to let your know that UberZip is still impressive almost 5 years later. I just tried it on an i7-2720QM with a Samsung 830 SSD (using PGP full-disk encryption).
7zip: ~25 seconds
Windows 7: ~57 seconds
UberZip(1 thread): 22.4 seconds
UberZip(8 threads): 6.7 seconds
UberZip(16 threads): 5.9 seconds
Source file was a 400 MB zip of RTC client.
> here are the stats for my AMD Athlon 64 3200+ (Hyperthreaded [...])
I have no idea what were you on when you wrote this.
>Dual Xeon dual-core 3.2 GHz processors + 4 gig of memory = 8 processors in Windows [...]
> Oddly it would seem about 5 threads is the "sweet-spot" for this machine
It's not odd. It's natural. Server boards have slower memory but as you mentioned it's probably your hard drive that's the bottle neck there.
The logic behind n+1 threads (number of cores+1) is to better utilize the CPU by always having something for it to do. Depending on the CPU it might also work better with just n or n+2.
Jacob, good to know that even though this is code that's a few years old it still performs well. :)
Petteri, thanks for the feedback regarding CPUs.
Does uberzip also support password-protected zip files?
Duane
Duane, no, unfortunately not. UberZip was meant as a prototype and I never expected it to be as fast as it is.
I'm playing around with UberZip on an c3.8xlarge Windows 2008R2 instance on Amazon Web Services (32 vCPU's Intel Xeon E5-2680).
Unzipping a single file containing 136,012 unique files (20.3Gb uncompressed; 1.7Gb zipped).
7Zip: 10 mins 8.2 secs.
UberZip (16 threads): 2 mins 22.5 secs.
UberZip (32 threads): 2 mins 7.9 secs.
UberZip (64 threads): 2 mins 19.9 secs.
I also verified the unzipped content to check for corruption. It was perfect!
Well done.
Nick, That's some pretty impressive results. I've been thinking a lot recently of resurrecting this project. I wrote this so long ago I'm sure there are some significant optimizations I could make to make it run even faster. I just recently created a project for multi-threaded backup (https://github.com/outr/outrbackup). It is significantly faster than any other backup solution I've seen to-date if you're interested in such a thing.
Hi Matt,
I'm sure there is a real opportunity for a small command-line high-performance unzip utility. It would be good to remove the requirement to have a JRE installed too.
-Nick
Thanks -- much faster than 7-Zip for a second drive on my machine. Do you still have the svn posted? I'd like to add messages to the output so I can see the progress, and the time steps take.
D L, just for you I dusted off the archives and found this project, converted the source code to Scala (since Java is just so darn ugly), made a couple of minor optimizations, and created a GitHub project: https://github.com/outr/uberzip
It should be a little faster than the pre-built version referenced in this blog post. It wouldn't take much effort to optimize it further. I originally considered using Scala's Parallel Collections support, but figured that was premature optimization and honestly wasn't sure if it would help or hurt.
Let me know if you have any trouble with this. I've often thought about resurrecting this project into a graphical zip / unzip tool, I just haven't had the time.
Thank you, Mr Hicks sir, I greatly appreciate this. Before finding this on your site, I tried to write a multi-thread extract myself and couldn't get it to work, I think it always had concurrency issues. This is great -- and only 69 lines !! I wonder why more tools don't use the multi-threaded approach, if they find if doesn't work for any reason. But for me, this tool has cut my zip extraction time in half. Thank you again !!
UberZip tool code uploaded on github has a bug!!! The main thread never exists as executor doesn't decrements the counter in its thread after finishing the task.
Possible fix:
override def run(): Unit = {
UberZip.unzip(zip, entry, directory)
counter.decrementAndGet()
}
Note: In java context, counter will need to be marked final with this change, however, I believe Scala won't require that
I added a bug to the GitHub repo, but UberZip is currently a lot less efficient than it could be. You're currently creating only one ZipFile instance, and sharing that with all worker tasks. There's a lock around all ZipFile methods (check out the source code) that prevents more than one thread from doing work at any given time. This is why in your blog post you weren't getting the performance you expected.
If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile instance per thread. That way, the per-thread lock in the ZipFile methods does not block all but one thread from reading from the zipfile at one time. (It also means that when each thread closes the ZipFile after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.)
This post inspired me to write my own parallel unzipper, which uses one ZipFile instance per thread (this is the only way to actually unzip in parallel using the standard ZipFile API, as I described in my previous comment -- any other scalability wins you're seeing is from copying from an InputStream's buffer into a file in parallel).
I released a version of my parallel unzip code here for comparison. Feel free to benchmark it on different zipfiles and different machines. It's twice as fast as InfoZip on my machine when unzipping the Eclipse distribution zipfile.
I didn't benchmark against your Scala version, but this should be faster, because of using one ZipFile instance per thread.
https://github.com/lukehutch/quickunzip
Post a Comment