A deterministic field from an address is the US State 2-digit code field. There are only 50 deterministic values acceptable, all others are rejected. These values can be cross-checked with the 5-digit zipcode (do not need a full 9-digit for State crosschecks) to ensure both the zipcode and the State code are in-sync. I like deterministic, easy to work with.
A non-deterministic field is the actual address line. Attempts to improve the data on the address line include seperating the STREET physical address line and the postal MAILING address line. But, just because there are two seperate fields doesn't mean the data will be in the right place...usually when you are asking for address information, it is from a human being and human nature will kick in.
improve non-deterministic data - standards/specifications
So what can you do about these address lines? For the most part, nothing - what you get passed as data is what you have to work with. However, if you have a specific intent where you need address information to be relatively accurate, you can do something. First, determine your intent:
- Accurate Mailing Address
- Bulk Mailing discounts with POSTNET/barcode/zipcode sorts.
- Seperation between Street address for carrier shipment vs passing a mailing address.
- individual person identification from different data sources (i.e. john smith at 1 west rd vs 34 baltic ave).
If you are 99% working with United States addresses and are concerned with address accuracy for actual mailings/shipments, look at some type of official CASS software. http://www.usps.com/ncsc/addressservices/certprograms/cass.htm
However, if you are trying to improve the data for the last option - individuality - and can not afford utilizing CASS software for this feature (which, btw, I highly recommend you do get CASS anyway because you can also enhance it with Address Change information), you can follow what is called 'USPS Publication 28' to standardize how the addresses look. This will not make your data foolproof by any means, but should greatly assist. Example is better:
- 1 West Road vs 1 WEST RD; 1 West River Road vs 1 W RIVER RD
- P.O. Box vs PO BOX vs POBOX vs P.O.BOX
- APARTMENT # 4, APT #4, APT 4, APARTMENT 4
- 400k distinct raw address lines.
- Java-based convertor for USPS Pub 28 specification.
- 345k processed distinct address lines.
By simply modifying the data to follow the specification, essentially 'corrected' more than 50k entries in the sample (13.75%). Now that is savings!