Monday, June 18, 2007

Data Improvement - Addresses

I titled this blog specifically as 'Data Improvement' instead of 'Data Assurance' or 'Data Quality'. The reason is quite simply because unless you have deterministic data coming in, you can not be assured what may be passed as data. Deterministic = there is a fixed number of values that will be accepted.

Addresses data

deterministic
A deterministic field from an address is the US State 2-digit code field. There are only 50 deterministic values acceptable, all others are rejected. These values can be cross-checked with the 5-digit zipcode (do not need a full 9-digit for State crosschecks) to ensure both the zipcode and the State code are in-sync. I like deterministic, easy to work with.

non-deterministic
A non-deterministic field is the actual address line. Attempts to improve the data on the address line include seperating the STREET physical address line and the postal MAILING address line. But, just because there are two seperate fields doesn't mean the data will be in the right place...usually when you are asking for address information, it is from a human being and human nature will kick in.

improve non-deterministic data - standards/specifications
So what can you do about these address lines? For the most part, nothing - what you get passed as data is what you have to work with. However, if you have a specific intent where you need address information to be relatively accurate, you can do something. First, determine your intent:
  • Accurate Mailing Address
  • Bulk Mailing discounts with POSTNET/barcode/zipcode sorts.
  • Seperation between Street address for carrier shipment vs passing a mailing address.
  • individual person identification from different data sources (i.e. john smith at 1 west rd vs 34 baltic ave).
USPS Publication 28/CASS software
If you are 99% working with United States addresses and are concerned with address accuracy for actual mailings/shipments, look at some type of official CASS software. http://www.usps.com/ncsc/addressservices/certprograms/cass.htm

However, if you are trying to improve the data for the last option - individuality - and can not afford utilizing CASS software for this feature (which, btw, I highly recommend you do get CASS anyway because you can also enhance it with Address Change information), you can follow what is called 'USPS Publication 28' to standardize how the addresses look. This will not make your data foolproof by any means, but should greatly assist. Example is better:
  • 1 West Road vs 1 WEST RD; 1 West River Road vs 1 W RIVER RD
  • P.O. Box vs PO BOX vs POBOX vs P.O.BOX
  • APARTMENT # 4, APT #4, APT 4, APARTMENT 4
My first attempt at following USPS Publication 28 in java has met with some success. I happen to code this originally as its own class, but adapted it to process the specificaiton rules through the Pentaho Data Integration (Kettle) product as a static method call in their javascript step; pushed over 3000 records/sec on my desktop which is sufficient for my intial needs.
  1. 400k distinct raw address lines.
  2. Java-based convertor for USPS Pub 28 specification.
  3. 345k processed distinct address lines.
> 13% data improvement
By simply modifying the data to follow the specification, essentially 'corrected' more than 50k entries in the sample (13.75%). Now that is savings!

1 comment:

Anonymous said...

I hate to see this recommendation of trying to roll your own address quality solution. With address validation rates around 50 cents per thousand including NCOALink processing, it seems silly to spend so much development time and still get the wrong answer. If you have a charity, startup with potential future revenues, or other good reason, DuoShare will provide free processing in exchange for web links and blog kudos. Then you can have USPS certified addressing and fully secured SOAP-based validation at very affordable prices, or free if you qualify with no upfront investment....pay as you go.