Wednesday, November 28, 2007

EJB3 Seam GWT

I just finished a prototype to test EJB3, Seam 2.0.0.GA with GWT-Remoting, and GWT 1.4.60.

It was, for the most part, a success!

First, before people get excited, what does NOT work:
What has been proven to work:
  • Stateless Session Bean simply exposed for GWT Web Application consumption for regular types (String, Integer, Long, Float) both as returns and parameters.
  • SLSB with DTO's exposed for GWT Web Application consumption.
  • GWT Sample Web Application successfully presenting the results of the Service calls to associated SLSB's exposed for Seam-GWT remoting.
Caveat - I am no GWT expert by any means. This project is simply a prototype to proof the integration points between existing EJB3 applications and changing the UI presentation layer to GWT in the least-difficult manner.

HELP: How do I share my project through blogger.com? For now, you can click on the JIRA link and get the uploaded file there - read the readme.txt and modify the Model for IsSerializable.

Thursday, August 23, 2007

Digital Preservation - PST outlook files

I will probably have a big rant about Digital Preservation some day, but today just about personal files - the Outlook PST files.

First - why Outlook PST files, why not KMail/Thunderbird/Netscape/Sun/Whatever mail files? Well, quite frankly you will be hard pressed to find a business in the United States that does not use Outlook. One of those necessary evils as there still is no good open source PIM (i.e. email, contact, AND calendar) desktop tools that are DEPLOYABLE in a corporate environment.

Second - This is more a memo to myself when I have time, have not fully gone down this.

The primary concern I had with PST files was, well, they are proprietary. I want to fix that, and would prefer to be able to re-organize the many, many PST files and related e-mail entries I have (including, I'm sure, many duplicate email entries in different PST files).

-- Change from PST to something not PST.
* http://alioth.debian.org/projects/libpst/ - GPL, in C
* http://xena.sourceforge.net/index.html - GPL, in Java (still active, ODF conversions)

[XML output] As you can see by the sidenote, I'm leaning towards the ill-named Xena project as it is 1) still active, 2) in java, and 3) *may* be able to export in ODF. I say may, because it doesn't say it specifically regarding e-mail.

[mbox] The other, libpst, will convert the PST into a unix-style mbox format.

[maildir] Maildir would have been my preference, but with the ODF being a very close second. However, the only maildir open source export I could find was http://www.howtoforge.com/converting_outlook_pst_to_maildir, and this required PUTTING THE MAIL BACK ON THE SERVER TO RE-READ IT BACK THROUGH IMAP. Very cool for going-forward projects, not so much as a library to do a simple convert. And, any sysAdmin would have a fit if I put 5gig of PST files back onto an Exchange server.

Managing -
Absolutely nothing yet. Order, remove duplicates, partition to put into CD/DVD media, a simple stand-alone client that can be put on a CD/DVD to read the archived e-mails, etc.

Please comment if you have found an open-source approach to solving PST archival and management.

Monday, August 13, 2007

Humans and Content - Information

This post is based on reading an article on the increase in time people spend reading content. The article link is here: http://news.yahoo.com/s/nm/20070813/tc_nm/internet_study_dc_1

Now, my own personal opinion is that video content is very time consuming. Although it may be the easiest to digest, it is the most inefficient and getting information. Most of this blog is about getting information, not about entertainment purposes.

If you are reading this blog -- you probably speed-reading and just skipping most of the fluff to get some key pieces of information. That is good -- that is how it is suppose to be.

Video/audio content, however, you do not have that option. Written works have no time component, only an 'order of content' component that you can filter through quickly. However,
Video/audio content DOES have a time component meaning you are limited in efficiently digesting/absorbing information based on the time component established by whoever created the Video/audio.

I personally think podcasts/webcasts are one of the worse ways to send out information (good for advertising/entertainment, but poor for sharing information). I pretty much never watch a podcast/webcast for information -- instead I'll Google for a text version and read through that instead.

Now, that is just me - there are people/audience that do prefer video/audio content for getting information. And, I might be one of those people...if I didn't have to worry about time ;-)

Wednesday, June 20, 2007

Carnal Knowledge API

Quite the title, eh?

This post is about API, services, or interfaces that are obscure and require 'internal' knowledge to use successfully. What do I mean?

Object result = doIt(object1);

There are two specific scenarios that I think about for obscure AI/services:
*Carnal Modification
*Carnal Returns

Carnal Modification
This happens only in API's where the language allows passing of references and the objects passed are non-immutable.

System.out.println(bean1.getValue()); //prints "default"
void modifyJavabeanValue(bean1);
System.out.println(bean1.getValue()); //prints "modified"

By simply calling a method, the objects you passed to it have changed. This may not be an expected result and you have to know that is the intent of the API...i.e., you have to have carnal knowledge about it. And, do not be fooled if it has a return-type, it can still modify the reference!

Carnal Returns
Carnal returns requires significant pre-knowledge on how to handle the return.

Object o = getMyStuff();

In the above example, you have no idea what is supposed to be returned, and even worse, it may return one of, say, five different types of objects that do not have common interfaces. Although you can check/reflect (pending language) what the actual object-type is supposed to be. Horrible!!!

String result = changeThis(String rawdata);

This example is almost as bad - the returned String content may be something unexpected: i.e., could be XML, could be comma-delimited string, could be raw java/perl/php code that you are expected to run. This can be allieviated easily with documentation AND specifying in the method signature the expected result:

String result = changeThisToXML(String rawdata); //returns XML

Awareness
Just trying to share some awareness that just because you found a neat/cool way to pull something off, other people (or you using someone elses) may run into obscure or unexpected results related to Carnal Knowledge requirements. There are indeed times when you can only do it a certain way, just rememer to document and modify your method signatures to make it as clear as possible -- you never know, 5 years later you might have to use your own API/Service!


NEW: I recently learned that, surprisingly, there is functionality when writing Stored Procedures to *change* the fields in the resultset based on parameters passed in...and that people do this!! Exact same problem.

Monday, June 18, 2007

Data Improvement - Addresses

I titled this blog specifically as 'Data Improvement' instead of 'Data Assurance' or 'Data Quality'. The reason is quite simply because unless you have deterministic data coming in, you can not be assured what may be passed as data. Deterministic = there is a fixed number of values that will be accepted.

Addresses data

deterministic
A deterministic field from an address is the US State 2-digit code field. There are only 50 deterministic values acceptable, all others are rejected. These values can be cross-checked with the 5-digit zipcode (do not need a full 9-digit for State crosschecks) to ensure both the zipcode and the State code are in-sync. I like deterministic, easy to work with.

non-deterministic
A non-deterministic field is the actual address line. Attempts to improve the data on the address line include seperating the STREET physical address line and the postal MAILING address line. But, just because there are two seperate fields doesn't mean the data will be in the right place...usually when you are asking for address information, it is from a human being and human nature will kick in.

improve non-deterministic data - standards/specifications
So what can you do about these address lines? For the most part, nothing - what you get passed as data is what you have to work with. However, if you have a specific intent where you need address information to be relatively accurate, you can do something. First, determine your intent:
  • Accurate Mailing Address
  • Bulk Mailing discounts with POSTNET/barcode/zipcode sorts.
  • Seperation between Street address for carrier shipment vs passing a mailing address.
  • individual person identification from different data sources (i.e. john smith at 1 west rd vs 34 baltic ave).
USPS Publication 28/CASS software
If you are 99% working with United States addresses and are concerned with address accuracy for actual mailings/shipments, look at some type of official CASS software. http://www.usps.com/ncsc/addressservices/certprograms/cass.htm

However, if you are trying to improve the data for the last option - individuality - and can not afford utilizing CASS software for this feature (which, btw, I highly recommend you do get CASS anyway because you can also enhance it with Address Change information), you can follow what is called 'USPS Publication 28' to standardize how the addresses look. This will not make your data foolproof by any means, but should greatly assist. Example is better:
  • 1 West Road vs 1 WEST RD; 1 West River Road vs 1 W RIVER RD
  • P.O. Box vs PO BOX vs POBOX vs P.O.BOX
  • APARTMENT # 4, APT #4, APT 4, APARTMENT 4
My first attempt at following USPS Publication 28 in java has met with some success. I happen to code this originally as its own class, but adapted it to process the specificaiton rules through the Pentaho Data Integration (Kettle) product as a static method call in their javascript step; pushed over 3000 records/sec on my desktop which is sufficient for my intial needs.
  1. 400k distinct raw address lines.
  2. Java-based convertor for USPS Pub 28 specification.
  3. 345k processed distinct address lines.
> 13% data improvement
By simply modifying the data to follow the specification, essentially 'corrected' more than 50k entries in the sample (13.75%). Now that is savings!

Thursday, February 22, 2007

Useless reports and GIGO

I got pulled into a couple of meetings today. Both meetings involved doing some analysis of data submitted to me. Both results/reports had problems, per the reason for the meetings. I will talk about one of them as it is more specific to the point.

GIGO

A person (the manager) entered information through Excel. I sorted and categorized the information based on unique identifiers in one of the columns. The meeting started off with why does this report have 4 different numbers and identifiers for 'XYZ'? I should have merged together the 'X YZ', the 'XZY' (typo), and the 'X Y Z' identifiers together to get one correct number, this report is of no use, and where did I get this information?

GIGO

I tried to explain that this was the information that person entered in Excel. I then went forward to offer a couple of options on enumerated/listed identifiers so that they would be consistent. "But I do it this way because I can [enter uber entry technique process] it very fast". That is good...but why are we having this meeting? "Because the report is useless".

GIGO

The sooner you can control the input process of data, the better for everyone. I am looking for comments and suggestions to help control scenarios like this that do not involve nerf bats that have a big 'GIGO' label on it.

Monday, February 12, 2007

Pentaho BI

It has been several months since my last blog, and I feel like I should be struck with rosemary beads and dunked in holy water for waiting so long. But, without further ado -

Pentaho - http://www.pentaho.org - is a 'conglomerate' open source project that has put together several related projects under one Umbrella. BI, or Business Intelligence, is the entire process of obtaining, scrubbing, analyzing, reporting, and then re-analyzing/re-reporting on business data.

Pentaho has combined many of the elements to handle most of the BI stack under a friendly LGPL/no-cost license allowing you start using Business Intelligence in even the smallest of projects. That is huge...usually a project had to push over 1/4 of a million (pending your resources/randomly selected amount) before you really could engage Business Intelligence. Now, that is no longer the case. :-)

I had the wonderful opportunity to go to the Advanced Implementation Workshop in Orlando, FL from Jan 29-Feb1, 2007. Having previous projects using a portion of Pentaho, I had sufficient knowledge to get the most of the Workshop.

In addition, was able to meet with Matt Casters of the Kettle (Pentaho Data Integration) project, Julian Hyde of the Mondrian (Pentaho Data Analysis Services) project, and Thomas Morgner of the JfreeReport (Pentaho Reporting) project. This added confidence as talking with them I got a good impression they thoroughly understood their individual domains.

Pentaho is still 'in the rough' as there are occasional user-facing design and presentation items to clear up, as well as enterprise-functionality from a developer standpoint that needs to be resolved or added. Most of these are related to the full BI Suite and are usually minor (which itself is only a couple years old), while the individual projects have been around for quite some time.

Overall, I am very impressed and will start working more heavily with Pentaho, regardless if it is (initially) just reporting or more full-fledged BI solution. Now, if only they could replace jpivot...