Here is a simple syntactic scheme for encoding external object identifiers. An external object identifier is an internal (or nominal) object identifier that has been packaged so as to achieve some additional properties, namely:
An external object identifier has the general syntax
domain-syntax-internal-checksum
where
domain::=string-safe-charset
syntax::=string-safe-charset
internal::=string-safe-charset
checksum::=string-safe-charset-no-dot
string-safe-charset::= [A-Za-z0-9_.]+
string-safe-charset-no-dot::= [A-Za-z0-9_]+
The above is a general syntactic scheme and should be taken as a recommendation. Variations of the scheme are allowed, subject to the rules described next.
domain identifies the domain or namespace of the identifier. It also indicates the syntax and interpretation of the remainder of the identifier.
syntax indicates the syntax of the identifier among the syntactic forms supported by domain. domain and syntax together indicate the syntax and interpretation of the remainder of the identifier.
internal is the internal (or nominal) identifier.
checksum is a checksum of the "domain
-syntax-internal" portion of the identifier. The checksum algorithm is determined by domain and syntax.
The absolute minimal requirement of this scheme is that external
identifiers be prefixed with a domain name (a string of one or more
printable characters, not including dashes) followed by a dash
('-').
The following external identifier scheme is proposed for the ADL
Gazetteer. The domain is "adlgaz", the syntax is
"1", an internal identifier is a positive integer written
in decimal notation, and the checksum is computed on the
"adlgaz-1-internal" portion of the identifier
using the Modulus 131 algorithm (described below) and written as two hexadecimal digits.
For example:
internal external 123adlgaz-1-123-2412adlgaz-1-12-4f132adlgaz-1-132-23999adlgaz-1-999-03
Try out other identifiers with this ID wrapper.
The Modulus 131 checksum algorithm computes a single check byte from a sequence of zero or more bytes. To compute it, simply multiply each byte by its ordinal position in the sequence (1, 2, 3, etc.) and sum modulo 131. If the checksum is to be inserted into a printable string, it is recommended that the checksum be written as two hexadecimal digits. For example, in C++:
#include <stdio.h>
#include <string.h>
int modulus_131_sum (const char* id) {
const char* c = id;
int i = 1;
int sum = 0;
while (*c != '\0') sum += *c++ * i++;
return sum%131;
}
char* append_checksum (const char* id) {
char* s = new char[strlen(id)+4];
sprintf(s, "%s-%02x", id, modulus_131_sum(id));
return s;
}
If some reasonable assumptions are made about the domain of sequences (specifically, that sequences consist of fewer than 131 printable ASCII characters), then the algorithm detects all single-character and transposition errors.
Source notes: this algorithm is a variation on the Modulus 11 algorithm used in ISBNs. For more information and background theory, see: Joseph A. Gallian, Error Detection Methods, ACM Computing Surveys, vol. 28, no. 3 (September 1996), pp. 504-517.
Another checksum algorithm is the Luhn checksum algorithm, which computes a single check digit from a sequence of zero or more decimal digits. The algorithm is used by all major credit cards. It detects most (but not all) common data entry errors.
The algorithm for a sequence with an odd number of digits is as follows. Double every odd-numbered digit (1st digit, 3rd digit, etc.) and subtract 9 if the product is greater than 9. Sum all the even-numbered digits (2nd digit, 4th digit, etc.) as well as the doubled odd-numbered digits. The checksum is the single digit that must be added to the sum to make the sum a multiple of 10. For a sequence with an even number of digits, perform the same summation but double the even-numbered digits instead. (If you think of the checksum as being appended to the sequence, the set of digits to double is the set that does not include the checksum digit.)
Here are some C++ functions that compute and verify Luhn checksums:
#include <string.h>
int luhn_sum (const char* id) {
bool toggle = (strlen(id)%2 == 1);
int sum = 0;
for (const char* c = id; *c != '\0'; ++c) {
int digit = *c - '0';
if (toggle) {
digit *= 2;
if (digit > 9) digit -= 9;
}
sum += digit;
toggle = !toggle;
}
return sum;
}
char compute_checksum (const char* id) {
return '0' + (10 - luhn_sum(id)%10)%10;
}
bool verify_checksum (const char* id, char checksum) {
return ((luhn_sum(id) + (checksum-'0'))%10 == 0);
}
Source notes: this algorithm is described (not surprisingly, given its relation to credit cards) in several hacker publications, e.g., Phrack, vol. 6, no. 47 (April 15, 1995), file 8, FAQ question F02. The original source appears to be U.S. Patent 2,950,048, "Computer for Verifying Numbers," granted to H.P. Luhn. Having been filed in 1954, it contains the usual hilarious mechanical embodiment of the algorithm ("...a spring pawl disposed adjacent each of said plurality of pulleys has its free end arranged to engage in the transverse notches in..."). The algorithm is also described in ANSI X4.13, "American National Standard for Financial Services - Financial Transaction Cards," and in ISO/IEC 7812-1:2000, "Identification Cards -- Identification of issuers -- Part 1: Numbering system."
created 2002-12-09; last modified 2009-01-13 08:45