Despite how common JPEG is, the specification leaves a lot to be discovered. The biggest hurdle is knowing which parts of the feature rich specification are relevant and which is not. The standard suggests a lot of features, some of questionable usefulness for applications of today.
There are two different ways to organize image data in a JPEG stream. Sequential JPEGs are simple, efficient and by far the most common way. In progressive JPEGs an approximation of the image can be obtained before the entire image is decoded. The feature was designed for environments with less bandwidth than what is common today, but is not extinct. Some old JPEGs are still around and some new as well are encoded using progressive JPEG since it typically results in slightly better compression. Even though only a small portion of all JPEGs are progressive it takes significantly more effort to implement and decode than sequential JPEG.
Arithmetic coding is a large part of the standard which I have still not encountered in a single JPEG file.
To figure out what is relevant in the specification and to test my decoder i needed a test set of valid JPEGs. Fortunately Wikimedia commons host a lot of pictures, of which a majority is JPEGs, that are mostly valid and easily accessible through the MediaWiki API. I fetched 1000 JPEGs and started decoding. This data set contained a diversity of relevant use cases produced by a multitude of different encoders:
- big and small images
- gray scale and color
- hard compressed and high quality
- old and new
as well as a lot of low level JPEG features such as:
- sequential and progressive scans
- chroma subsampling ratios
- restart intervals and padding
- zero-skipping corner cases
Having a data set of relevant pictures also allowed benchmarking and eventually profiling of difficult pictures.
There are several existing JPEG data sets that mostly contain fuzzed images for detecting crashes in already working decoders. Most pictures in those sets are invalid JPEGs which are not that helpful while developing a decoder and figuring out which parts of the specification are important.