Addresses data
deterministic
A deterministic field from an address is the US State 2-digit code field. There are only 50 deterministic values acceptable, all others are rejected. These values can be cross-checked with the 5-digit zipcode (do not need a full 9-digit for State crosschecks) to ensure both the zipcode and the State code are in-sync. I like deterministic, easy to work with.
non-deterministic
A non-deterministic field is the actual address line. Attempts to improve the data on the address line include seperating the STREET physical address line and the postal MAILING address line. But, just because there are two seperate fields doesn't mean the data will be in the right place...usually when you are asking for address information, it is from a human being and human nature will kick in.
improve non-deterministic data - standards/specifications
So what can you do about these address lines? For the most part, nothing - what you get passed as data is what you have to work with. However, if you have a specific intent where you need address information to be relatively accurate, you can do something. First, determine your intent:
- Accurate Mailing Address
- Bulk Mailing discounts with POSTNET/barcode/zipcode sorts.
- Seperation between Street address for carrier shipment vs passing a mailing address.
- individual person identification from different data sources (i.e. john smith at 1 west rd vs 34 baltic ave).
If you are 99% working with United States addresses and are concerned with address accuracy for actual mailings/shipments, look at some type of official CASS software. http://www.usps.com/ncsc/addressservices/certprograms/cass.htm
However, if you are trying to improve the data for the last option - individuality - and can not afford utilizing CASS software for this feature (which, btw, I highly recommend you do get CASS anyway because you can also enhance it with Address Change information), you can follow what is called 'USPS Publication 28' to standardize how the addresses look. This will not make your data foolproof by any means, but should greatly assist. Example is better:
- 1 West Road vs 1 WEST RD; 1 West River Road vs 1 W RIVER RD
- P.O. Box vs PO BOX vs POBOX vs P.O.BOX
- APARTMENT # 4, APT #4, APT 4, APARTMENT 4
- 400k distinct raw address lines.
- Java-based convertor for USPS Pub 28 specification.
- 345k processed distinct address lines.
By simply modifying the data to follow the specification, essentially 'corrected' more than 50k entries in the sample (13.75%). Now that is savings!
1 comment:
I hate to see this recommendation of trying to roll your own address quality solution. With address validation rates around 50 cents per thousand including NCOALink processing, it seems silly to spend so much development time and still get the wrong answer. If you have a charity, startup with potential future revenues, or other good reason, DuoShare will provide free processing in exchange for web links and blog kudos. Then you can have USPS certified addressing and fully secured SOAP-based validation at very affordable prices, or free if you qualify with no upfront investment....pay as you go.
Post a Comment