compression
Reducing data size by eliminating redundancy. Encoding information in fewer bits than the original representation.
lossless
Every bit of original data is perfectly recoverable.
- Huffman coding: variable-length codes based on symbol frequency
- Lempel-Ziv (LZ77, LZ78, LZW): dictionary-based, replacing repeated patterns with references. Used in gzip, PNG
- arithmetic coding: encoding entire messages as single numbers in [0,1)
- run-length encoding: replacing consecutive identical symbols with count + symbol
lossy
Discards information deemed less perceptible. Smaller files, irreversible.
- JPEG: discrete cosine transform, quantization of high-frequency components in images
- MP3/AAC: psychoacoustic models, discarding sounds humans perceive poorly
- H.264/H.265: video compression, motion estimation, inter-frame prediction
theoretical limit
Shannon entropy defines the minimum average bits per symbol for lossless compression. No algorithm can compress below this bound. Information content is incompressible.
applications
- storage: fitting more data on disk, databases use compression internally
- transmission: faster transfer over networks, reduced bandwidth cost
- IPFS: content-addressed storage benefits from deduplication, a form of compression
- cyber: compressed representations in the knowledge graph
connections
Rooted in information theory and Shannon entropy. encryption and compression interact: encrypted data is incompressible (maximum entropy). algorithms for compression are studied in complexity theory.