So here's an interesting problem I ran into today. You have metadata in an XML file. You want to make the file self-describingly self-correcting, so you want to embed its checksum inside it. The problem is, you can't add the checksum to the XML file without changing the file's checksum!
Is there an XML verification tool not subject to this particular tail-chase? I don't know of one offhand.
A sufficiently clever CRC implementation could do the job.
You don't need the last 32 bits; *any* 32 linearly independent bits can be used to make a valid CRC. So just say 0x00000000, and then solve for the values that make the CRC work.
That would be easier if each character had 4 independent bits you could fiddle rather than the jump between '9' and 'A', but it might still be solvable.
It's just a system of linear equations to be solved so the overall file checksum comes out to the desired value.
Put the checksum in the name of the file?
FITS files (an astronomy image/data format) can contain their own checksum. Check out http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/checksum23may02/
It all depends upon exactly how strong and for what purpose you need the checks. If you really want self-correcting, then a mere crc32, or even a 2048 bit RSA signature, is not going to do the trick. I forget the exact relationship, but you will need about 3 or 4 bits paralleling every single character in the file to get enough redundancy to correct single bit errors, plus longitudinal crc32 protection. OTOH, if you need only to verify that the document has not been changed, then just a crc32 could fit the bill. Maybe a crc16 will too, but a checksum will not - not enough redundancy, too simple a computation.
In the crc world, the typical algorithm is to run the original document without the final CRC representation through the crc generator then append the result. Checking it means running the whole thing, including the result, through the same engine and expecting a particular number at the end. I would have to do the research to determine the numbers, but this was the method for the IBM standard used in SNA, SDLC, HDLC, and X.25 - crc16(16,15,2,1) (the parens indicate the XOR feedback taps in the CRC generation shift register). 16 bit crc was generally considered OK only up to a few KB message size. Todays documents would probably need at least crc32. Encoding is an issue - the crc value needs to be in raw binary, not hex or ASCII encoding.
Today the preferred method is to use RSA signing, typically via gpg (or pgpg) for change detection and origin verification. 2048 bits vs 32 makes for a pretty reliable test. Unlike crc engines, which are low level data comms tools and usually implemented in IO hardware, gpg signatures are available as command line tools, which also makes them scriptable.
gpg --sign document
gpg --check-sigs document
A checksum won't help in correcting errors. For that you need a dedicated error-correcting code. The simplest is Hamming. The problem is that the application that uses the file must be aware of the code.
If you only need to detect errors, you still need to define if the threat is accidental or intentional modifications. CRC can detect accidental errors, but it is easy to circumvent. To protect against forgeries you need a cryptographic hash, e.g. SHA.
You need to embed both the checksum and minus the checksum in the data. These two will then sum to zero and consequently the value of the checksum will not affect the value of the checksum!
Thanks for the comments and suggestions! Just to clarify, I'm aware that a checksum without code to actually check it is pointless -- what I'm after here is a way to embed a file's digital fingerprint inside the file, so that anyone who looks at the file can do a basic integrity check.
I'm definitely taking a look at FITS. Thanks for the tip!