Wednesday, June 20, 2007

Carnal Knowledge API

Quite the title, eh?

This post is about API, services, or interfaces that are obscure and require 'internal' knowledge to use successfully. What do I mean?

Object result = doIt(object1);

There are two specific scenarios that I think about for obscure AI/services:
*Carnal Modification
*Carnal Returns

Carnal Modification
This happens only in API's where the language allows passing of references and the objects passed are non-immutable.

System.out.println(bean1.getValue()); //prints "default"
void modifyJavabeanValue(bean1);
System.out.println(bean1.getValue()); //prints "modified"

By simply calling a method, the objects you passed to it have changed. This may not be an expected result and you have to know that is the intent of the API...i.e., you have to have carnal knowledge about it. And, do not be fooled if it has a return-type, it can still modify the reference!

Carnal Returns
Carnal returns requires significant pre-knowledge on how to handle the return.

Object o = getMyStuff();

In the above example, you have no idea what is supposed to be returned, and even worse, it may return one of, say, five different types of objects that do not have common interfaces. Although you can check/reflect (pending language) what the actual object-type is supposed to be. Horrible!!!

String result = changeThis(String rawdata);

This example is almost as bad - the returned String content may be something unexpected: i.e., could be XML, could be comma-delimited string, could be raw java/perl/php code that you are expected to run. This can be allieviated easily with documentation AND specifying in the method signature the expected result:

String result = changeThisToXML(String rawdata); //returns XML

Just trying to share some awareness that just because you found a neat/cool way to pull something off, other people (or you using someone elses) may run into obscure or unexpected results related to Carnal Knowledge requirements. There are indeed times when you can only do it a certain way, just rememer to document and modify your method signatures to make it as clear as possible -- you never know, 5 years later you might have to use your own API/Service!

NEW: I recently learned that, surprisingly, there is functionality when writing Stored Procedures to *change* the fields in the resultset based on parameters passed in...and that people do this!! Exact same problem.

Monday, June 18, 2007

Data Improvement - Addresses

I titled this blog specifically as 'Data Improvement' instead of 'Data Assurance' or 'Data Quality'. The reason is quite simply because unless you have deterministic data coming in, you can not be assured what may be passed as data. Deterministic = there is a fixed number of values that will be accepted.

Addresses data

A deterministic field from an address is the US State 2-digit code field. There are only 50 deterministic values acceptable, all others are rejected. These values can be cross-checked with the 5-digit zipcode (do not need a full 9-digit for State crosschecks) to ensure both the zipcode and the State code are in-sync. I like deterministic, easy to work with.

A non-deterministic field is the actual address line. Attempts to improve the data on the address line include seperating the STREET physical address line and the postal MAILING address line. But, just because there are two seperate fields doesn't mean the data will be in the right place...usually when you are asking for address information, it is from a human being and human nature will kick in.

improve non-deterministic data - standards/specifications
So what can you do about these address lines? For the most part, nothing - what you get passed as data is what you have to work with. However, if you have a specific intent where you need address information to be relatively accurate, you can do something. First, determine your intent:
  • Accurate Mailing Address
  • Bulk Mailing discounts with POSTNET/barcode/zipcode sorts.
  • Seperation between Street address for carrier shipment vs passing a mailing address.
  • individual person identification from different data sources (i.e. john smith at 1 west rd vs 34 baltic ave).
USPS Publication 28/CASS software
If you are 99% working with United States addresses and are concerned with address accuracy for actual mailings/shipments, look at some type of official CASS software.

However, if you are trying to improve the data for the last option - individuality - and can not afford utilizing CASS software for this feature (which, btw, I highly recommend you do get CASS anyway because you can also enhance it with Address Change information), you can follow what is called 'USPS Publication 28' to standardize how the addresses look. This will not make your data foolproof by any means, but should greatly assist. Example is better:
  • 1 West Road vs 1 WEST RD; 1 West River Road vs 1 W RIVER RD
  • P.O. Box vs PO BOX vs POBOX vs P.O.BOX
My first attempt at following USPS Publication 28 in java has met with some success. I happen to code this originally as its own class, but adapted it to process the specificaiton rules through the Pentaho Data Integration (Kettle) product as a static method call in their javascript step; pushed over 3000 records/sec on my desktop which is sufficient for my intial needs.
  1. 400k distinct raw address lines.
  2. Java-based convertor for USPS Pub 28 specification.
  3. 345k processed distinct address lines.
> 13% data improvement
By simply modifying the data to follow the specification, essentially 'corrected' more than 50k entries in the sample (13.75%). Now that is savings!