Tuesday, January 15, 2013

Sharepoint 2013 w/ Apache Chemistry CMIS

Sharing my experience in trying to use the CMIS library to work with Sharepoint 2013.  As a prefix, I have existing code integrated into a CEVA (content-enabled vertical application, that seems to be the buzzword) using Alfresco 4.2 CE as a backend, and evaluating compatibility of the system with a Sharepoint 2013 backend (I'm not swapping to sharepoint, just cross-checking).

I use 'sp2013' for the server name, replace as appropriate.

  1. Work with CMIS-Workbench as your go-to tool for confirmation before working with your code.  This is like your SoapUI when working with Webservices, or your Database Editor tool when trying to write queries for your application.  Work through everything you want to do with CMIS Workbench *first* before you write code.
  2.  Sharepoint 2013 setup notes:
    1. Sharepoint Central Admin (http://sp2013:90/): 
      1. Security, under 'General Security' section, 'Specify Authentication Providers'. 
      2. Pick the default zone, or if you know Sharepoint the appropriate zone. 
      3. Under Windows Auth, I had to enable 'basic auth'...I also disabled integrated as my intent was to use Sharepoint soley as a repository, so no need to get the system confused between integrated or basic auth (obviously, if using this route in production, need to setup SSL).
    2. Site Settings (http://sp2013): 
      1. Pick the site you want to access through CMIS (for example, 'Documents').
      2. In the upper right, beside the login name, is a 'gear' icon for settings - click that, go to 'Site Settings'.
      3. Under 'Site Action' header is a 'Manage Site Features' link, click that.
      4. Activate 'Content Management Interoperability Services (CMIS) Producer'
      5. Repeat for each site you want to access. Each site will appear as a unique Repository from the CMIS point of view.
  3. Again, use CMIS Workbench for all your confirmation/testing.  Add some files/folders to the above Site(s)/Repo(s) you shared for CMIS.
    1. Connect to the URL http://sp2013/cmis/rest/?getRepositories through CMIS Workbench. You will likely use this one for your apache chemistry code as well.
    2. For Apache Chemistry, Lesson learned --
      1. DO NOT try to create a session by re-using your Map param and add in the repo ID...instead, get the Repository object directly, and use the Repository.createSession().
      2. For example of best-approach/usage of the Apache Chemistry CMIS library, look at the CMIS Workbench source code (that is how I learned the above error/correction).
  4. There are some CMIS functions that DO NOT work with Sharepoint 2013.  I ran into only one and have not done a thorough review, but this already delayed me significantly:
    1. SCORE() does not work in Sharepoint 2013
A HUGE kudos goes to  http://gauravmahajan.net/2013/01/06/sharepoint-2013-rtm-on-win-server-2012-virtual-machine-download/, I already spent enough time just dealing with Sharepoint and CMIS, much less getting all the infrastructure up and running - big thank you!

-Darren

Monday, December 03, 2012

Document Management - CMIS 1.1 protocol approved

With apparently very little fanfare, CMIS 1.1 passed the final votes to become an approved specification.

https://www.oasis-open.org/committees/download.php/47441/ballot_2311.html

Now, some people may come to this blogpost and ask the question: "What is CMIS and what is great about 1.1 being approved?"

CMIS is an attempt to standardize the protocol to communicate with document/content management systems (EDMS/ECM). These systems have been around for ages (>15 years?).  But they are ruled by large, proprietary giants who protect their investments by making sure that once you are integrated with them, you are locked into them without another large investment to re-develop/design all those integrations into another proprietary system.  These may not have been malicious decisions, but attempts to provide value add, but the end result is the same --- you get locked in.

History:
WebDAV - a protocol that some of them started to follow.  A good protocol.  But lacked standard query support and repository/administration support.

JCR - Java Content Repository (two different revisions over time).  Java-specific, attempt to define *how* to build a content repository, the underlying piece of an EDMS/ECM, but didn't exactly define a good integration/interaction protocol for clients or other tools.  However, this did plant the seed to create various open source alternative EDMS/ECMs, so thank you (although it is java-specific, at least someone started something!)

CMIS 1.0 - after JCR, CMIS came into play as a language-agnostic way to search and retrieve documents (atompub & webservice versions).

So...what is so great about CMIS 1.1?  It brings:

*  standard way to create custom object types (content models, document types, etc) through a common/standard protocol instead of relying on each vendor to provide their own mechnism.(2.1.10 Object-Type Creation, Modification and Deletion)

*  standard way to support 'mixin', or reuse, of properties using 'secondary types' (2.1.9 Secondary Object-Types)

With these two very important features, you can now create, search, retrieve, and (partially) maintain your content completely through a standard protocol, allowing creation of tools and interfaces against the protocol instead of vendor-specific implementations.

Do not get me wrong, the innovators in this space (Alfresco for example) provided vendor-specific value adds before the industry caught up, but some people, like myself, were resistant to those vendor-specific value-adds until can interoperate with other solutions.  To say you picked a solution *only* because a value add feels like a lockin.  To say you picked a solution above all others using the same features (CMIS 1.1 protocol) says a LOT more :-)

CMIS standard: http://docs.oasis-open.org/cmis/CMIS/v1.1/CMIS-v1.1.html

Monday, July 02, 2012

Laptop build

Although I was looking for a pre-installed linux laptop, found a too-good deal on a thinkpad x230t with sufficient capabilities for xen server needs while in a compact 12" formfactor.

 Step 1: Shrink the volume on the default Windows 7 install. Windows provides better support now adays for volume resizing. If you go to the control panel, search for 'partition' as a key word, you will see the disk management tools. This provides you the ability to shrink the volume....sort of. It appears it isn't an exact tool, and you will need to shrink, reboot, defrag, reboot, then shrink some more...repeating....until you get to the target size desired. I was aiming for 120GB, and it took 4 tries to get there.

 Step 2: Backup the Windows image Although there is the familiar Ghost image software if you have the money, I wanted to look at alternatives. Lenovo provide it's own backup/restore software that looks like it would work well. I got a backup USB harddrive (not USB flash), and the Lenovo Thinkvantage backup/restore did the MBR and backup images flawlessly. However, being an individual that wanted to avoid lockin, and try to move towards automation/repeatable provisioning, I kept looking.

Cobbler is a tool I'm keep falling back to for image-based provisioning (versus kickstart-based installs), and it has support for provisioning images from Clonezilla (http://clonezilla.org). Clonezilla has support for Windows imaging, sharing my findings:

 1) Reformat your external device (usb harddrive in my case) to have a smaller partition, such as 250MB, with a FAT32 filesystem as the first partition on that device. This is important to avoid a lot of trial/error - other versions of FAT will not work, and too-large volume causes problems. Don't worry, you still want a second partition that is much larger (at least 32GB) to store the actual images.

 2) Use tuxboot.exe. Clonezilla highly recommends it, and they are right to do so. Once you get the partition straightened out, everything else is cakewalk. And, yes, you can use it directly from Windows without requiring to have a linux install.

 3) plug your device into the Windows machine you want to image. If it is only for one machine, good-to-go. If you are trying to create a 'gold image' for distributing to multiple machines at once, look into 'sysprep' and other tools to prepare the windows install.

 4) reboot your machine, and use your bios to choose the alternate start location. If you do not see your device, some of the USB3.0 ports/devices are not recognizable as bootable locations, so plug into a usb2.0 port to be sure.

 5) the provided directions with clonezilla were excellent! If you want to review beforehand, you can check their site. Create the image, store it in the large partition, takes about 40min (minimal windows install) to create image then do a double check.

That is all for tonight, more updates later.....

Sunday, May 27, 2012

Pre-installed linux laptops

Looking around for pre-installed linux laptops. Although one can install it themselves, there is some time-savings around dealing with laptop components/linux driver support. My particular need is for a Xen/Virtual style environment for many 'servers' for development/research. With that in mind, here is what I have been looking for: 11"-14" primarily, 15" if they have 9-cell longer battery life, but I've only seen 9-cell on 17" thus far (and do not want that big a laptop). i7 CPU (or similar AMD, just have not seen many in laptops nowadays) 16GB ram, if 32GB ram option via 4 sodimm slots, great! 750GB/7200 rpm harddrive. No SSD. No Hybrid. no 5400rpms. If larger, great. Lan port + wireless N built in VGA/HDMI/similar video output for demo's/etc. And, based on current pricing, ~$1000. And, preferably Fedora or CentOS Dom0/Host OS that is full OS (xwindows, Eclipse IDE, etc support), with Xen VM support for guest OS's. Ubuntu/others if that is the only option, but would prefer Fedora/CentOS. So far, I found only a handful of companies that seem reasonably able to handle these kinds of requirements: http://zareason.com/ - has fedora support, and laptops within the above configuration range. https://www.system76.com - best 'known' linux laptop, only ubuntu. Still reviewing my options!

Monday, February 27, 2012

nosql/mongodb and experienced developers

Below is a great comedy with good technical and farmer references, since I grew up on a farm.

I'm not biased against any nosql db's, but it also isn't the silver bullet to anything and jumping straight to it would be futile without good experience with what you are doing....pretty much exactly what the other person is talking about :-)

Thanks my close friend at http://www.rentageekme.com for sharing!

http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale

Thursday, December 08, 2011

Document Capture and Tables/Tabular/Invoices (ocr)

One of the roles I fulfill is working heavily in Data, or Document, Capture.

This covers a wide range:

Document Capture (or Document Content Management/Records Management as the modern term) - Index a couple of fields to be able to search/retrieve the image/document later. The second part is where you store, search, retrieve after the indexes have been captured, but that's for another time and not the focus here.

Data Capture - Collect information from paperwork for use by systems. The original image/document is not relevant after capture except as a reference. Usually unstructured documents or low volume documents.

Forms Processing - Collect information from paperwork in a fast, repeatable process. The original image/document is not relevant after capture except as a reference. Forms processing is an advanced form of Data Capture where if you have consistent forms (structured documents) where the data elements are always in the same location on the form and there is (practically) no variance in the forms/data locations.

Back to the topic at hand - Tabular Capture, or being able to OCR and Key information that is in table format from images that may have come from output systems, scanning, faxing, or other means and trying to turn it BACK into data.

How do we obtain information from tables on paper?

Forms Processing - one answer, zones. Form Processing is designed to collect information from data points on the image/document where the data element is always in the same position. If the first column/first row of a table is always 5" from the top, 1.35" from the left side, has a width of 2" and a height of 1", you zone that area. By zoning, OCR knows where to go exactly for the information, and can be tuned in how it reads the elements (I only expect numeric values here, so there will be no lowercase-L or Oh's or upper case I's or Z's). Also, by zoning, manual entry becomes easy as well as they can look directly at the location. And then exporting, hey, you already know the context of the data element because it was in a specific location, so you already know it is row 1/column 1 to put it in the right location for your export.

Phew.....lots of good stuff with Zones, or sometimes called 'Zonal OCR'. And you don't even need OCR to use zones. Downside? Lot of time in setup and tuning. Lots of time. And you need the right tools in your capture suite to support it. And again, it doesn't even have to use OCR, just setting up zones for manual capture and your export is a gain.

So...what happens when the paperwork has tables but the paperwork is sporadic, non-consistent, unstructured, and may have a high rate of change you not only have no control over, but no upfront notification of the changes? Examples you ask -- Invoices are the biggest culprit, but there are many others out there.

Answer? Well....this is where some companies have innovative approaches to the problem, but from my point of view nothing has been great yet. The column locations are likely different between tables (i.e. first column on one invoice is the product ID, another it is the description, yet another invoice it is the quantity). Some approaches to using regular expressions (regex in shorthand) to detect the context of the data have been tried, but a unit price, calculated price, discount price, and total price all look the same and again could be shuffled around column-wise depending on the invoice. Others have some basic attempts at image analysis to do table detection, and try to OCR the headers for context of the columns (but, running into the problem that invoices have different column header names for the same semantic meaning, while in others the headers have inverse-coloring (white text on black background)))...of all, this is probably the best automation approach but is very immature at the moment.

All good attempts to automate the unstructured tabular capture problem, and maybe in controlled scenarios they work great. But in the real world, lets face it - a human being will need to help figure out how the table is structured and the context of the data elements so it can be captured appropriately (whether OCR or manual again doesn't matter), but done in such a way to be efficient and productive.

Posting here if anyone has found anything, if not, if you stumbled on this blog in a hope to solve this specific problem -- at least you are not alone!

Sunday, December 04, 2011

JavaEE 6 app servers compared

Thank you Antonio!

Baseline/platform sizing of different JavaEE containers (disk, ram, startup).

http://agoncal.wordpress.com/2011/10/20/o-java-ee-6-application-servers-where-art-thou/

The more complex metrics of scalability (cpu/mem increase as add more load), performance (first-call as well as high concurrency), and cluster/ha require constants on the OS/hardware/VM/JVM that takes quite a bit more setup and time. At least the above are relatively constant.