7zip vs Gzip … compression and speed
Since we work with twitter, this comparison will be using twitter as a data-source. We have randomly selected 5k users and used their public tweets for this test.
Setup
The setup is not extremely important as we are more concerned with the ratio’s for our conclusions, for the sake of completeness the tests were run in python 2.7.3 (gcc 4.7.2) with the latest version of pylzma from pypi (0.4.4) and gzip from the standard python library. The hardware configuration is CPU E5-2620 @ 2.00GHz and the software was allowed to use the full capacity of this hardware.
Data
As mentioned a random sample of 5k users were selected for this experiment. The following are some stats around the data.
Users: 5,000 Avg # of Tweets: 1433 Std Dev: 1018 Avg uncompressed size: 1228518 bytes (1.2Mb) Std Dev size: 874973 bytes (0.8 Mb)
Compression
There is no doubt that Lzma compresses much smaller than zlib, in these tests we want to see exactly how much better the compression ratio actually is.
lzma avg size: 100286 (8.2%, 97kb) zlib avg size: 142456 (11.6%, 139kb)
Speed
Lzma compresses better than BZ2 and faster, but it is well known that zlib compresses faster. Here is a comparison of the compression speed difference on our dataset.
lzma 3884.27s (776.8ms / user) zlib 184.40s (36.9ms / user)
Note: lzma used 2x as much memory as the zlib test
Conclusion
With our dataset, lzma compressed down to an average of 8% of the dataset size, while zlib compressed to 12%. In measurable numbers, for 5,000 users tweets using lzma would save 200Mb, an average savings of 41kb per user. Regarding compression speed, using 7zip we spent 1 hour more, an average of 0.7s more spent per user.
To put things into perspective, if we are processing 1 million users, gzip would compress 9 days sooner, but have an extra overhead of 40 Gb.
Neither the compression speed nor the size are really negligible in this case, so depending on your specific needs you may pick one over the other. Generally though, for many people / usecases disk space is usually not a concern as much as speed.