tag:blogger.com,1999:blog-311291362024-02-08T09:27:35.259-08:00Darren Hartford's Developer BlogSoftware Engineering, Development, Open Source, Java, and solutions to every day problems.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.comBlogger42125tag:blogger.com,1999:blog-31129136.post-82222048105725845562016-02-23T11:37:00.003-08:002016-02-23T11:37:37.020-08:00S3-dist-cp and recursive subdirectory groupingsWhen working with AWS (specifically AWS EMR hadoop), you can use the S3distcp to concatenate files together with the --groupBy option. What is really cool, this will work even on already-compressed (gzip) files!<br />
<br />
However, recursive sub-directories are not natively supported by S3distcp. So instead, need to stage it. To stage, we are going to use the distcp that S3distcp originated from as it has some other useful features not in s3distcp. <br />
<br />
Using AWS EMR you can create a Custom JAR step, and either use the /usr/lib/hadoop/hadoop-distcp.jar or upload your own version of hadoop-distcp.jar to S3 and reference that version. Then, for args you want to copy the contents with the --update to a destination staging area where the individual files are stored in a flattened directory structure. In this example, I'll filter to just csv.gz files.<br />
<br />
ARGS:<br />
--update s3://test/raw/**/*.csv.gz s3://test/staging<br />
<br />
After that, then you can use the <span style="background-color: #f0f9ff; color: #444444; font-family: "helvetica neue" , "roboto" , "arial" , "droid sans" , sans-serif; font-size: 14px; line-height: 18.2px;">command-runner.jar</span> to concatenate in any grouping defined by the regular expression. The example used is by 4-digits (years for examples) in the filenames, such that all the daily/monthly files are put together into a single year file. The -outputCodec gz ensures that the ending file is also compressed.<br />
<br />
ARGS:<br />
s3-dist-cp --src=s3://test/staging --dest=s3://test/grouped/ --groupBy .*([0-9][0-9][0-9][0-9]).* --outputCodec gzip<br />
<br />
<br />
If you get errors like "ERROR: Skipping key XYZ because it ends with '/'", this is usually because either there are no source files, or the regex in your groupBy is not quite correct and filters out to no files.<br />
<br />
<br />
<br />
dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com2tag:blogger.com,1999:blog-31129136.post-27465920855173293982015-10-10T17:33:00.001-07:002015-10-10T17:33:19.938-07:00Gamer Post (non development) Star Citizen coupon credits<div style="background-color: white; color: #141823; font-family: helvetica, arial, sans-serif; font-size: 14px; line-height: 19.32px; margin-bottom: 6px;">
For any gamers interested in the up and coming space epic <a class="profileLink" data-hovercard="/ajax/hovercard/page.php?id=426976044011090" href="https://www.facebook.com/RobertsSpaceIndustries" style="color: #3b5998; cursor: pointer; text-decoration: none;">Star Citizen</a> from Robert Space Industries / Cloud Imperium (and their pretty impressive FPS star marine for planet and capital ship capture integration), My referral code now can get you 5,000 credits in game that you wouldn't otherwise! (p.s. you can register an account now before you forget without obligation, the 5000 will be waiting for you)</div>
<div style="background-color: white; color: #141823; font-family: helvetica, arial, sans-serif; font-size: 14px; line-height: 19.32px; margin-bottom: 6px; margin-top: 6px;">
STAR-D9GQ-3T44</div>
<div style="background-color: white; color: #141823; display: inline; font-family: helvetica, arial, sans-serif; font-size: 14px; line-height: 19.32px; margin-top: 6px;">
enlist link <a href="https://robertsspaceindustries.com/enlist?referral=STAR-D9GQ-3T44" rel="nofollow" style="color: #3b5998; cursor: pointer; text-decoration: none;" target="_blank">https://robertsspaceindustries.com/enlist?referral=STAR-D9GQ-3T44</a></div>
<div>
<br /></div>
<div>
Trying to spread the word and give people a little boost to get started!</div>
dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com57tag:blogger.com,1999:blog-31129136.post-23587873439123394932015-05-10T12:07:00.001-07:002015-05-10T12:09:13.605-07:00New development (visual) perspectiveIt has been a while since my last post. I've been quite overwhelmed with additional challenges that have been overcome one at a time.<br />
<br />
However, being a passionate technologist, always looking for ways to make myself and my team more ready to take on the next challenge.<br />
<br />
For that goal, for the last couple of months I've been evaluating AR, VR, and 'Holo' alternatives to provide a different 'perspective' on heads-down development ecosystem/environment.<br />
<br />
<br />
Key findings (end of Q1/2015):<br />
<br />
<ul>
<li>AR, Augmented Reality / overlay over real world, such as those by Google Glass, Meta (Pro), Atheena, Vuzix, are all trying to take on many things at once. I, for one, do not need a head-mounted camera, especially when you will always have a phone with you which has a better camera anyway -- and as far as using the camera itself for AR, rather use the 'smartglass' as a portable monitor rather than full AR.</li>
<ul>
<li>Extra features not needed (camera)</li>
<li>needs to be 'socially acceptable', particularly in meetings. Privacy concerns exist also around the camera, so drop camera, or provide (obvious marked?) option that does not include the camera.</li>
<li>Intent One for supportive information, lookup/confirmation in meetings.</li>
<li>Intent Two for development environment where not limited to single-pane monitor. Need higher resolution (i.e. 1600x1200) to be useful.</li>
</ul>
<li>Hololens (microsoft). They do not provide any actual specifications. So...vaporware for the time being. Interesting concept, but not likely to impress compared to VR.</li>
<li>VR, Virtual Reality, where no attempt to overlay the real world. Provide 'high quality media' (i.e. gaming experience), so no camera, focus on video resolution, sound, and hopefully eye fatigue challenges.</li>
<ul>
<li>Project Morpheous, Sony/PS4 specific, not relevant as need for PC platform.</li>
<li>Occulus Rift. Very popular, but the newer consumer version is targeted for 2016.</li>
<li>Valve/Steam HTC Hive. Likely best candidate, high resolution display, with SteamOS intent for multi-OS environments (windows, linux, SteamOS), very good for both traditional business and gaming development shops. Target for end of 2015.</li>
<ul>
<li>Bonus: Valve/Steam is also a nice distribution platform in itself. One of my personal goals was to help provide 'optimized' development environments per projects. By Integrating into Steam as a distribution platform and provide 'modules' or 'add-ons' that provide the optimized environment/experience, a faster bootstrap process for new developers.</li>
</ul>
</ul>
</ul>
<div>
HTC Hive is not selling their development kits like Occulus, an interesting move. I've submitted an application for my team (as an independent) to try to provide another 'perspective' to development.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
http://steamcommunity.com/games/250820/announcements/detail/154588645663414820</div>
<div>
<br /></div>
<div>
<br /></div>
dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-38047789627083206192014-10-09T10:27:00.001-07:002014-10-09T10:35:06.500-07:00Alfresco Startup timeAs some of you may know, Alfresco is an EDMS that runs on tomcat (or other java container) in linux or windows. <br />
<br />
<br />
I've been working with Alfresco for several years in various capacity, and anyone who works with Alfresco Community Edition (CE) knows that configuration changes require restart, and the restart is painfully slow....like 5 minutes slow.<br />
<br />
<br />
Although some of the 'obvious' fixes are to move alfresco.war and share.war to independent tomcats, and only restart the one that you need, that is still configuration/integration/new possible issue point (particularly if you are doing low-volume sites and want to use only one tomcat/server).<br />
<br />
<br />
<br />
<br />
The 'performance' is always relative to the hardware you are using, and some the time will vary depending on big cpu/slow disk, low-end cpu/high-end disk, etc. As such, will simply provide a baseline which is Alfresco 4.2.f, the bitnami linux installer on fedora.<br />
<br />
<u><i>All of these tests are with a baseline alfresco install, no content, no indexing.</i></u><br />
<br />
<br />
<table border="2" bordercolor="#0033FF" cellpadding="3" cellspacing="3" style="background-color: #99ffff; width: 100%px;">
<tbody>
<tr>
<th>Relative startup time</th>
<th>Configuration</th>
<th>Notes</th>
</tr>
<tr>
<td>0% (baseline)</td>
<td>-XX:+UseG1GC -XX:MaxPermSize=256M -Xms1024M -Xmx1024</td>
<td>Bitnami base setup, 1024M heap.</td>
</tr>
<tr>
<td>-59% (slower, minutes dep. on hardware)</td>
<td>-XX:+UseG1GC -XX:MaxPermSize=256M -Xms256M -Xmx256M</td>
<td>Comparison when trying low memory (256M), how much of a difference it makes.</td>
</tr>
<tr>
<td>+0-5% (trivial)</td>
<td>-XX:+UseG1GC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
</td>
<td>Excess ram (for startup) does not impact startup time. Note however you usually should have 2G-8G for normal, real-world, usage</td>
</tr>
<tr>
<td>+12% faster (20 sec dep. on hardware)</td>
<td>-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
</td>
<td>Changing to a Concurrent GC (because we know CPU is maxed) actually made a good difference (assuming you are not heavily disk IO bound).</td>
</tr>
<tr>
<td>+14% faster </td>
<td>-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
<br />
put tomcat/alfresco/openoffice in a ramdrive to remove disk IO concerns (/dev/shm for example)
</td>
<td>Keeping to the most performing setup, try ramdrive to challenge disk IO issues...really aren't an issue in startup. However...your alf_data location absolute has an impact on real-world content.</td>
</tr>
<tr>
<td>+8% faster (slower than ConGC with defaults)</td>
<td>-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31
</td>
<td>Additional GC tuning (out of quick/ignorant, didn't want to spend a lot of time on this). Just used what was from Http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.6</td>
</tr>
<tr style="background-color: #ff99ff;">
<td>+18% faster</td>
<td>-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
<br />
Modify /alfresco/WEB-INF/web.xml and /share/WEB-INF/web.xml with metadata-complete=true and absolute-ordering
</td>
<td>Most performant memory setup, then added this find: http://wiki.apache.org/tomcat/HowTo/FasterStartUp Obviously not a solution if using stock or as-is installer approach, but if you can customize your WAR (not just the exploded directory) surprising gains.</td>
</tr>
<tr>
<td>+20% faster (memory change, small app conf change, ramdrive)</td>
<td>-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
<br />
Modify /alfresco/WEB-INF/web.xml and /share/WEB-INF/web.xml with metadata-complete=true and absolute-ordering, ramdrive tomcat/alfresco/openoffice
</td>
<td>putting it all together, the 'quickest' fast startup option if you have the ram for it, but disk io isn't an issue with startup. </td>
</tr>
</tbody></table>
So, some quick wins for decent return depending how much you want to modify your install. However, at the end of the day the startup is CPU (single-thread) bound. Now, some of you may remember the older Jboss application servers (4-6 series) and how they were getting slower and slower on startup, then they re-engineered there startup process for *controlled* parallel startup of services. That may be what is really needed to get the Alfresco CE startup time down.<br />
<br />
Recommend the second to the last - setting up a ramdrive for that little of a gain is not worth it, and if you are really trying to push startup times, then get your alfresco and share on different tomcat instances first before moving to ramdrive. dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com3tag:blogger.com,1999:blog-31129136.post-43817315196496827762013-01-15T11:20:00.000-08:002013-01-15T13:07:55.932-08:00Sharepoint 2013 w/ Apache Chemistry CMISSharing my experience in trying to use the CMIS library to work with Sharepoint 2013. As a prefix, I have existing code integrated into a CEVA (content-enabled vertical application, that seems to be the buzzword) using Alfresco 4.2 CE as a backend, and evaluating compatibility of the system with a Sharepoint 2013 backend (I'm not swapping to sharepoint, just cross-checking).<br />
<br />
I use 'sp2013' for the server name, replace as appropriate.<br />
<br />
<ol>
<li>Work with CMIS-Workbench as your go-to tool for confirmation before working with your code. This is like your SoapUI when working with Webservices, or your Database Editor tool when trying to write queries for your application. Work through everything you want to do with CMIS Workbench *first* before you write code.</li>
<li> Sharepoint 2013 setup notes:</li>
<ol>
<li>Sharepoint Central Admin (http://sp2013:90/): </li>
<ol>
<li>Security, under 'General Security' section, 'Specify Authentication Providers'. </li>
<li>Pick the default zone, or if you know Sharepoint the appropriate zone. </li>
<li>Under Windows Auth, I had to enable 'basic auth'...I also disabled integrated as my intent was to use Sharepoint soley as a repository, so no need to get the system confused between integrated or basic auth (obviously, if using this route in production, need to setup SSL).</li>
</ol>
<li>Site Settings (http://sp2013): </li>
<ol>
<li>Pick the site you want to access through CMIS (for example, 'Documents').</li>
<li>In the upper right, beside the login name, is a 'gear' icon for settings - click that, go to 'Site Settings'.</li>
<li>Under 'Site Action' header is a 'Manage Site Features' link, click that.</li>
<li>Activate 'Content Management Interoperability Services (CMIS) Producer'</li>
<li>Repeat for each site you want to access. Each site will appear as a unique Repository from the CMIS point of view.</li>
</ol>
</ol>
<li>Again, use CMIS Workbench for all your confirmation/testing. Add some files/folders to the above Site(s)/Repo(s) you shared for CMIS.</li>
<ol>
<li>Connect to the URL http://sp2013/cmis/rest/?getRepositories through CMIS Workbench. You will likely use this one for your apache chemistry code as well.</li>
<li>For Apache Chemistry, Lesson learned --</li>
<ol>
<li>DO NOT try to create a session by re-using your Map<string tring="tring"> param and add in the repo ID...instead, get the Repository object directly, and use the Repository.createSession().</string></li>
<li>For example of best-approach/usage of the Apache Chemistry CMIS library, look at the CMIS Workbench source code (that is how I learned the above error/correction).</li>
</ol>
</ol>
<li>There are some CMIS functions that DO NOT work with Sharepoint 2013. I ran into only one and have not done a thorough review, but this already delayed me significantly:</li>
<ol>
<li>SCORE() does not work in Sharepoint 2013 </li>
</ol>
</ol>
A HUGE kudos goes to <a class="urlextern" href="http://gauravmahajan.net/2013/01/06/sharepoint-2013-rtm-on-win-server-2012-virtual-machine-download/" rel="nofollow" title="http://gauravmahajan.net/2013/01/06/sharepoint-2013-rtm-on-win-server-2012-virtual-machine-download/">http://gauravmahajan.net/2013/01/06/sharepoint-2013-rtm-on-win-server-2012-virtual-machine-download/</a>, I already spent enough time just dealing with Sharepoint and CMIS, much less getting all the infrastructure up and running - big thank you!<br />
<br />
-Darren dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com7tag:blogger.com,1999:blog-31129136.post-57071080412559376842012-12-03T13:59:00.000-08:002013-01-15T13:07:45.117-08:00Document Management - CMIS 1.1 protocol approvedWith apparently very little fanfare, CMIS 1.1 passed the final votes to become an approved specification.<br />
<br />
https://www.oasis-open.org/committees/download.php/47441/ballot_2311.html<br />
<br />
Now, some people may come to this blogpost and ask the question: "What is CMIS and what is great about 1.1 being approved?"<br />
<br />
CMIS is an attempt to standardize the protocol to communicate with document/content management systems (EDMS/ECM). These systems have been around for ages (>15 years?). But they are ruled by large, proprietary giants who protect their investments by making sure that once you are integrated with them, you are locked into them without another large investment to re-develop/design all those integrations into another proprietary system. These may not have been malicious decisions, but attempts to provide value add, but the end result is the same --- you get locked in.<br />
<br />
History:<br />
WebDAV - a protocol that some of them started to follow. A good protocol. But lacked standard query support and repository/administration support.<br />
<br />
JCR - Java Content Repository (two different revisions over time). Java-specific, attempt to define *how* to build a content repository, the underlying piece of an EDMS/ECM, but didn't exactly define a good integration/interaction protocol for clients or other tools. However, this did plant the seed to create various open source alternative EDMS/ECMs, so thank you (although it is java-specific, at least someone started something!)<br />
<br />
CMIS 1.0 - after JCR, CMIS came into play as a language-agnostic way to search and retrieve documents (atompub & webservice versions).<br />
<br />
So...what is so great about CMIS 1.1? It brings:<br />
<br />
* standard way to create custom object types (content models, document types, etc) through a common/standard protocol instead of relying on each vendor to provide their own mechnism.(2.1.10
Object-Type Creation, Modiļ¬cation and Deletion)<br />
<br />
* standard way to support 'mixin', or reuse, of properties using 'secondary
types' (2.1.9 Secondary
Object-Types)<br />
<br />
With these two very important features, you can now create, search, retrieve, and (partially) maintain your content completely through a standard protocol, allowing creation of tools and interfaces against the protocol instead of vendor-specific implementations.<br />
<br />
Do not get me wrong, the innovators in this space (Alfresco for example) provided vendor-specific value adds before the industry caught up, but some people, like myself, were resistant to those vendor-specific value-adds until can interoperate with other solutions. To say you picked a solution *only* because a value add feels like a lockin. To say you picked a solution above all others using the same features (CMIS 1.1 protocol) says a LOT more :-)<br />
<br />
CMIS standard: http://docs.oasis-open.org/cmis/CMIS/v1.1/CMIS-v1.1.html dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com1tag:blogger.com,1999:blog-31129136.post-24752089472804582552012-07-02T20:14:00.000-07:002012-07-02T20:21:42.389-07:00Laptop buildAlthough I was looking for a pre-installed linux laptop, found a too-good deal on a thinkpad x230t with sufficient capabilities for xen server needs while in a compact 12" formfactor.<br />
<br />
<b> Step 1: Shrink the volume on the default Windows 7 install.
</b>
Windows provides better support now adays for volume resizing. If you go to the control panel, search for 'partition' as a key word, you will see the disk management tools. This provides you the ability to shrink the volume....sort of. It appears it isn't an exact tool, and you will need to shrink, reboot, defrag, reboot, then shrink some more...repeating....until you get to the target size desired. I was aiming for 120GB, and it took 4 tries to get there.<br />
<br />
<b>Step 2: Backup the Windows image</b>
Although there is the familiar Ghost image software if you have the money, I wanted to look at alternatives. Lenovo provide it's own backup/restore software that looks like it would work well. I got a backup USB harddrive (not USB flash), and the Lenovo Thinkvantage backup/restore did the MBR and backup images flawlessly. However, being an individual that wanted to avoid lockin, and try to move towards automation/repeatable provisioning, I kept looking.<br />
<br />
Cobbler is a tool I'm keep falling back to for image-based provisioning (versus kickstart-based installs), and it has support for provisioning images from Clonezilla (<a href="http://clonezilla.org/">http://clonezilla.org</a>). Clonezilla has support for Windows imaging, sharing my findings:<br />
<br />
1) Reformat your external device (usb harddrive in my case) to have a smaller partition, such as 250MB, with a FAT32 filesystem as the first partition on that device. This is important to avoid a lot of trial/error - other versions of FAT will not work, and too-large volume causes problems. Don't worry, you still want a second partition that is much larger (at least 32GB) to store the actual images.<br />
<br />
2) Use tuxboot.exe. Clonezilla highly recommends it, and they are right to do so. Once you get the partition straightened out, everything else is cakewalk. And, yes, you can use it directly from Windows without requiring to have a linux install.<br />
<br />
3) plug your device into the Windows machine you want to image. If it is only for one machine, good-to-go. If you are trying to create a 'gold image' for distributing to multiple machines at once, look into 'sysprep' and other tools to prepare the windows install.<br />
<br />
4) reboot your machine, and use your bios to choose the alternate start location. If you do not see your device, some of the USB3.0 ports/devices are not recognizable as bootable locations, so plug into a usb2.0 port to be sure.<br />
<br />
5) the provided directions with clonezilla were excellent! If you want to review beforehand, you can check their site. Create the image, store it in the large partition, takes about 40min (minimal windows install) to create image then do a double check.<br />
<br />
That is all for tonight, more updates later.....dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-91303762790505241022012-05-27T13:14:00.001-07:002012-05-27T13:14:14.920-07:00Pre-installed linux laptopsLooking around for pre-installed linux laptops. Although one can install it themselves, there is some time-savings around dealing with laptop components/linux driver support.
My particular need is for a Xen/Virtual style environment for many 'servers' for development/research. With that in mind, here is what I have been looking for:
11"-14" primarily, 15" if they have 9-cell longer battery life, but I've only seen 9-cell on 17" thus far (and do not want that big a laptop).
i7 CPU (or similar AMD, just have not seen many in laptops nowadays)
16GB ram, if 32GB ram option via 4 sodimm slots, great!
750GB/7200 rpm harddrive. No SSD. No Hybrid. no 5400rpms. If larger, great.
Lan port + wireless N built in
VGA/HDMI/similar video output for demo's/etc.
And, based on current pricing, ~$1000.
And, preferably Fedora or CentOS Dom0/Host OS that is full OS (xwindows, Eclipse IDE, etc support), with Xen VM support for guest OS's. Ubuntu/others if that is the only option, but would prefer Fedora/CentOS.
So far, I found only a handful of companies that seem reasonably able to handle these kinds of requirements:
http://zareason.com/ - has fedora support, and laptops within the above configuration range.
https://www.system76.com - best 'known' linux laptop, only ubuntu.
Still reviewing my options!dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-56590090242601848302012-02-27T07:19:00.002-08:002012-02-27T07:23:03.854-08:00nosql/mongodb and experienced developersBelow is a great comedy with good technical and farmer references, since I grew up on a farm.<br /><br />I'm not biased against any nosql db's, but it also isn't the silver bullet to anything and jumping straight to it would be futile without good experience with what you are doing....pretty much exactly what the other person is talking about :-)<br /><br />Thanks my close friend at <a href="http://www.rentageekme.com">http://www.rentageekme.com</a> for sharing!<br /><br /><a href="http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale">http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale</a>dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-47954479527594116132011-12-08T06:07:00.000-08:002012-07-30T07:30:49.655-07:00Document Capture and Tables/Tabular/Invoices (ocr)One of the roles I fulfill is working heavily in Data, or Document, Capture. <br />
<br />
This covers a wide range:<br />
<br />
Document Capture (or Document Content Management/Records Management as the modern term) - Index a couple of fields to be able to search/retrieve the image/document later. The second part is where you store, search, retrieve after the indexes have been captured, but that's for another time and not the focus here.<br />
<br />
Data Capture - Collect information from paperwork for use by systems. The original image/document is not relevant after capture except as a reference. Usually unstructured documents or low volume documents.<br />
<br />
Forms Processing - Collect information from paperwork in a fast, repeatable process. The original image/document is not relevant after capture except as a reference. Forms processing is an advanced form of Data Capture where if you have consistent forms (structured documents) where the data elements are always in the same location on the form and there is (practically) no variance in the forms/data locations.<br />
<br />
Back to the topic at hand - Tabular Capture, or being able to OCR and Key information that is in table format from images that may have come from output systems, scanning, faxing, or other means and trying to turn it BACK into data.<br />
<br />
<span style="font-style: italic;"><span style="font-weight: bold;">How do we obtain information from tables on paper?</span></span><br />
<br />
Forms Processing - one answer, zones. Form Processing is designed to collect information from data points on the image/document where the data element is always in the same position. If the first column/first row of a table is always 5" from the top, 1.35" from the left side, has a width of 2" and a height of 1", you <span style="font-weight: bold;"><span style="font-style: italic;">zone</span></span> that area. By zoning, OCR knows where to go exactly for the information, and can be tuned in how it reads the elements (I only expect numeric values here, so there will be no lowercase-L or Oh's or upper case I's or Z's). Also, by zoning, manual entry becomes easy as well as they can look directly at the location. And then exporting, hey, you already know the context of the data element because it was in a specific location, so you already know it is row 1/column 1 to put it in the right location for your export. <br />
<br />
Phew.....lots of good stuff with Zones, or sometimes called 'Zonal OCR'. And you don't even need OCR to use zones. Downside? Lot of time in setup and tuning. Lots of time. And you need the right tools in your capture suite to support it. And again, it doesn't even have to use OCR, just setting up zones for manual capture and your export is a gain.<br />
<br />
So...what happens when the paperwork has tables but the paperwork is sporadic, non-consistent, unstructured, and may have a high rate of change you not only have no control over, but no upfront notification of the changes? Examples you ask -- Invoices are the biggest culprit, but there are many others out there.<br />
<br />
Answer? Well....this is where some companies have innovative approaches to the problem, but from my point of view nothing has been great yet. The column locations are likely different between tables (i.e. first column on one invoice is the product ID, another it is the description, yet another invoice it is the quantity). Some approaches to using regular expressions (regex in shorthand) to detect the context of the data have been tried, but a unit price, calculated price, discount price, and total price all look the same and again could be shuffled around column-wise depending on the invoice. Others have some basic attempts at image analysis to do table detection, and try to OCR the headers for context of the columns (but, running into the problem that invoices have different column header names for the same semantic meaning, while in others the headers have inverse-coloring (white text on black background)))...of all, this is probably the best automation approach but is very immature at the moment. <br />
<br />
All good attempts to automate the unstructured tabular capture problem, and maybe in controlled scenarios they work great. But in the real world, lets face it - a human being will need to help figure out how the table is structured and the context of the data elements so it can be captured appropriately (whether OCR or manual again doesn't matter), but done in such a way to be efficient and productive.<br />
<br />
Posting here if anyone has found anything, if not, if you stumbled on this blog in a hope to solve this specific problem -- at least you are not alone!dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com3tag:blogger.com,1999:blog-31129136.post-52856069054359647962011-12-04T09:51:00.001-08:002012-07-30T07:30:38.594-07:00JavaEE 6 app servers comparedThank you Antonio!<br />
<br />
Baseline/platform sizing of different JavaEE containers (disk, ram, startup). <br />
<br />
http://agoncal.wordpress.com/2011/10/20/o-java-ee-6-application-servers-where-art-thou/<br />
<br />
The more complex metrics of scalability (cpu/mem increase as add more load), performance (first-call as well as high concurrency), and cluster/ha require constants on the OS/hardware/VM/JVM that takes quite a bit more setup and time. At least the above are relatively constant.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-21500920375484200962011-06-03T05:53:00.000-07:002012-07-30T07:30:30.344-07:00hypervisor (vm) and jvm (java) and SLA and costsI've been testing several approaches to optimize the platform that the applications run on. This blog post is just a brain dump without any clear direction other than current thoughts.<br />
<br />
Most of the applications I work it would fall under the equivalent of the JavaEE6 web-profile (jpa/web or jpa/ejb/web) with a couple that have messaging that, in reality, could be modified to work with other async-style approaches (while messaging also supports distributed work efforts, most of the applications aren't reaching a critical mass where then need to distribute that work).<br />
<br />
So, what are we talking about platform wise?<br />
<br />
*jboss or tomcat (or, more appropriately, the new TomEE as an option)<br />
<br />
*jvm<br />
<br />
*OS to run it on (preferably with iSCSI and similar large-disk-space mounting support).<br />
<br />
*hypervisor to run multiple guest OS/vm/appcontainers.<br />
<br />
Some of the general goals are reduce diskspace/memory, maximize the number of applications that can run on a piece of hardware, while still protecting or segregating applications from each-other so if in our haste to 'time to market' an application will only hurt itself and not any others. Failover/disaster-recovery is also a consideration, with a minor emphasis on time-to-increase-capacity-and-associated-downtime but that is not as critical.<br />
<br />
<span style="font-weight: bold;">App Container</span><br />
<br />
Jboss has been doing some wonderful things with the new jboss7 AS stack. I haven't finished my memory review, but I hope they got the 'memory bloat' under control. Jboss 4.0.x series with one application can run in under 128MB in most cases, while Jboss 5.x and 6.x series for the SAME app need to double-to-triple to 256MB/364MB.<br />
<br />
-jboss deployment bonus: The ability to deploy an application's 'configuration' beside it as a SAR in the same deployment directory as the application WITHOUT needing to modify the server itself is HUGE. I do not understand why people do not take more advantage of the SAR benefits. You create your application binary once, then vet/test with one SAR configuration, take the SAME binary to your staging/pre-deploy/uat/stress-testing/etc environments with different SAR configurations, then again move the SAME binary to production with a different SAR configuration. What you tested is what went live.<br />
<br />
-And, once you setup the SAR configuration for the environment...leave it there and update the application binary with changes (assuming no additional configurations). The least variables to mess around with the better!<br />
<br />
TomEE is a new player and haven't reviewed it yet. <br />
<br />
Jonas unfortunately has never gave reason to peak my interest.<br />
<br />
Geronimo & Glassfish are additional options, but also do not provide any significant reason to change from Jboss (which I have the most experience/skill in).<br />
<br />
Tomcat/Jetty are decent web-only platforms, but would not be considered as part of the strategy related to inability to support the full necessary stacks.<br />
<br />
Conclusion: Jboss still in the win, but if Memory is a constraint be wary of jboss5/6 versus the older jboss 4.0.x series. The new Jboss7AS is a significant rewrite and will hopefully address this, as well as additional scenarios.<br />
<br />
<br />
<br />
<span style="font-weight: bold;">jvm/os</span><br />
<br />
This is where it gets interesting....<br />
<br />
*jboss again comes out with the Boxgrinder project so that you can have predictable/repeatable platforms. This is kind of an outsider as it doesn't directly relate to any of the above areas, but is a way towards combining and using them in a cool (or more predictable...less variables) fashion.<br />
<br />
*Azul has their new Zing JVM/OS combo-solution that will run on hypervisors (and is optimized). But, at a price of $5k-$6k per 'server', but I haven't touched/tested/or discussed if a server represents a single JVM that can run multiple appcontainers or not.<br />
<br />
*Oracle has a not-very-discussed JVM/OS combo-solution that will also run on hypervisors called Maxine Virtual Edition: http://labs.oracle.com/projects/guestvm/ <br />
-GPL licensed/forever open sourced.<br />
-takes queues from openJDK, so will continue to keep updated with recent JDK updates.<br />
-not 'production' ready...if this can get some more steam, this is definately a good place to go.<br />
<br />
Away from the cool stuff, and back to reality --<br />
<br />
Just Enough Operating System (JEOS) continues to be a buzzword but with no real meat or applied solutions. The Boxgrinder project above does try to help with some pre-defined approaches to a JEOS for the different linux OS distributions. CentOS is still a popular choice for low-cost options, and the guys there are trying there best to get CentOS 6 out the door even while RHEL 6.1 gets released -- if you want the faster turn around, pay for it and get the benefit of testing and security announcements, otherwise free CentOS is free but help them out.<br />
<br />
<br />
<span style="font-weight: bold;">hypervisor (virtualization)</span><br />
<br />
Hypervisor battle is pretty hot right now, with no real clear winner yet. <br />
<br />
With Xen and KVM as the current front-runners on the open-source server hypervisor segment (with others close behind), it's not really black and white which one to pick although Xen has a little bit of an edge with Citrix backing and Paravirtualization support.<br />
<br />
VMWare, hyper-v (which announced CentOS support?!), and other commercials also offering some competitive advantages over the open source alternatives (for a price).dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-43439748324730230592011-05-04T06:30:00.001-07:002012-11-19T08:36:30.622-08:00Alfresco as an Image Archive Server (TIFF/fax/scan images)Currently evaluating Alfresco CE 3.4.d for use as an Image Archive/Record Content Management Server. Definition is to store multi-page TIFF images that have 2-6 custom attributes that must be searchable to retrieve the associated images.<br />
<br />
The most common usecase that doesn't involve company-specific attributes as an example is storing incoming Fax images where you want to store attributes such as the number dialed to come in (enterprise w/ DID or similar fax setup), date it came in, number it came from (if available). For the number dialed in, you could instead say 'Department'.<br />
<br />
Anyway, this post isn't about the custom attributes piece, this is for the image piece.<br />
<br />
<span style="font-weight: bold;">Req 1, allow to store and view multipage TIFF images (preferably without requiring a TIFF plugin that will likely change on Office upgrades).</span><br />
Alfresco by default does not handle multipage TIFF. In fact, 3.4.d the supplied ImageMagick doesn't even support TIFF (see /alfresco/common/bin 'convert -list configure', DELEGATES line, should see TIF and it isn't there). 3.4.e DOES support TIF, but only for windows and 64-bit linux, and only the *first* page of the TIF.<br />
<br />
Luckily, this wonderful community member of the open source product Alfresco already had a solution: <a href="http://fabiostrozzi.eu/2010/10/27/improving-tiff-preview-in-alfresco-share/">http://fabiostrozzi.eu/2010/10/27/improving-tiff-preview-in-alfresco-share/</a><br />
<br />
With additional modifications to remove ImageMagick, OpenOffice, and other ancillary services that were not needed for something soley to be a TIFF-based Image Server, a rather slim solution that with the default 'SHARE' interface is a good solution. I do have 3.4.d working with this solution, and will be doing a more enterprise-oriented tomcat deploy opposed to the installer approach and feel quite confident in how Alfresco team architected the product to support each companies' unique needs.<br />
<br />
Current problem: The FLASH previewer is good, but the challenge with multi-page TIFF is that the tiff2pdf conversion isn't that bad....it's the pdf2swf that is taking 1/4 to 1/2 a second per page.<br />
<br />
<br />
<hr />
<br />
Research notes for TIFF 2 PDF conversion those interested:<br />
<br />
<br />
<br />
ImageMagick 6.5.4 seems to work, but has huge/escalating memory requirements as TIFF's grow for tiff2pdf:<br />
<br />
Memory requirements of 600MB-3GB of system ram (non jvm heap) per image conversion (but fast, 1-4 seconds).<br />
3GB is related to a 7mb test file that seems to have some bad TIF encoding, however<br />
3GB is only because moved to swap space, it may be more.<br />
<br />
instead, use a newer version:<br />
<br />
wget ftp://ftp.imagemagick.org/pub/ImageMagick/linux/SRPMS/ImageMagick-6.6.9-7.src.rpm<br />
<br />
<br />
sudo yum groupinstall "Development Tools"<br />
sudo yum install rpmdevtool libtool-ltdl-devel<br />
<br />
sudo yum install djvulibre-devel tcl-devel freetype-devel ghostscript-devel libwmf-devel jasper-devel lcms-devel bzip2-devel librsvg2 librsvg2-devel liblpr-1 liblqr-1-devel libtool-ltdl-devel autotrace-devel<br />
<br />
rpmbuild --nodeps --rebuild ImageMagick-6.6.9-7.src.rpm<br />
<br />
cd /home/dhartford/build/RPMS/i686<br />
sudo rpm -ihv --force --nodeps ImageMagick-6.6.9-7.i686.rpm<br />
<br />
In the end, same memory requirements (600MB-3GB).<br />
<br />
<br />
Alternatives reviewed:<br />
A separate medium has been suggested, such as TIFF to GIF, then GIF to PDF:<br />
<value>${img.exe} ${source} gif:- | convert gif:- ${target}</value><br />
slightly better, but the edge case of 3GB ram still occurs. Also increases diskspace with additional medium.<br />
<br />
Switches to work around potential problem areas do not seem to matter:<br />
<value>${img.exe} -monochrome -compress Fax ${source} ${target}</value><br />
No difference.<br />
<br />
TIFF to PNG, may get more performance from GraphicsMagick:<br />
http://superuser.com/questions/233441/use-imagemagick-to-convert-tiff-to-pngs-how-to-improve-the-speed<br />
--not tested<br />
<br />
libtiff has a direct **tiff2pdf** that simply 'wraps' the image with PDF headers without<br />
doing dpi/sizing/re-rendering like the ImageMagick/GraphicsMagic approach (which,<br />
under the covers, uses libtiff to read the tiff then sends the resulting image<br />
through image processing for dpi/resolution modifications and then sends it <br />
to Ghostscript to generate the resulting PDF). Note that imagemagick and<br />
graphicsmagick under the covers also uses libtiff anyway for TIFF decoding.<br />
<br />
BEST OPTION from testing, tiff2pdf modification testing seems to be around:<br />
Memory requirements of 10MB-80MB of system ram (non jvm heap) per image conversion, ~1 second fast.<br />
--some issues around if bad TIF encoding sending to stdout/stderror, creates an exit status preventing completion in Alfresco transformer.<br />
Asking mailing list if there is a quiet/silent mode so tries best-attempt at conversion without<br />
causing the exit status.<br />
There is no 3GB ram issue (instead 80MB over ~10 sec for the 7MB tiff/99 pages). <br />
*NOTE: The 7MB example came back as 99 pages in SWF previewer. Using separate system TIFF and PDF viewers, also 99 pages, so consistent.<br />
<br />
<br />
<hr />
<br />
Research notes on the PDF viewer(s) when used with TIFF 2 pdf conversion:<br />
http://wiki.alfresco.com/wiki/Installing_Alfresco_components#Linux_and_Unix_Installation<br />
<br />
version 0.8.1 does not paginate tiff2pdf conversions, causing repeating cycle in the flash previewer.<br />
<br />
<br />
<br />
NOTE: alternate viewer: http://swfviewer.blogspot.com/<br />
<br />
<br />
<br />
REVIEWED: http://packages.sw.be/swftools/, only has rpms up to 0.8.1, and there have been several releases since then.<br />
<br />
TODO: 64-bit centos binary: http://wiki.alfresco.com/w/images/1/1d/Swftools-centos54-x86_64.tar.gz<br />
<br />
<br />
mkdir /opt/swftools<br />
cd /opt/swftools<br />
wget http://www.swftools.org/swftools-0.9.1.tar.gz<br />
<br />
tar -xzvf swftools-0.9.1.tar.gz<br />
<br />
yum install zlib-devel libjpeg-devel giflib-devel freetype-devel gcc gcc-c++ make<br />
<br />
cd swftools-0.9.1<br />
./configure --disable-lame --prefix=/opt/swftools/swftools-0.9.1-bin/<br />
make<br />
make install <br />
<br />
<br />
Diskspace footprint for /opt/swftools including source code, configure, make, and binary:<br />
46MB<br />
<br />
<br />dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com10tag:blogger.com,1999:blog-31129136.post-8542995301378536442011-02-15T10:52:00.000-08:002012-07-30T07:30:12.314-07:00Javamelody performance & usage statisticsOne of the hidden gems in the open source world is a project called Javamelody.<br />
<br />
I've been using this since late 2009 to help refactor/modify design and code based on usage-based findings. It is not a profiler, not a click-n-fix, not a quickly-fix-your-problems tool. It is a tool to get you the information, over time, that you need to make Strategic decisions about design/code.<br />
<br />
http://code.google.com/p/javamelody/<br />
<br />
It gets all tiers of statistics within a single application -> the application's UI calls, business (ejb/facade/spring) calls, and sql calls.<br />
<br />
Recently I finally submitted a patch for GWT-RPC detailed statistics I've been using for a while to help, again from a strategic point of view, refine some products.<br />
<br />
Enjoy!dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-6064075331025214102011-01-24T16:40:00.000-08:002012-07-30T07:30:02.908-07:00Web UI upgradabilityOne of the areas that has been an issue over time is taking an application, say deployed to jboss 3.0 or 3.2.3, and try to upgrade it to jboss 4.0.5. Or tomcat 4 to tomcat 5. Or any upgrades at all.<br />
<br />
Real-world experience with struts (1.0/1.1)/JSP sites have several challenges in upgrading. Whether they are container based or implementation base, I have never had success with 'easy' upgrades.<br />
<br />
JSF appears to have similar issues. I myself kept running into performance issues everytime I've attempted a JSF implementation, so relying on this post to confirm similar issues: http://jsfunit.blogspot.com/2010/12/jsf-on-jboss-as6-final.html<br />
<br />
Now, onto one known savior - GWT is upgrade compatible. I have successfully upgraded 1.3 to 1.5, 1.5 to 2.0, 2.0 to 2.1, and 1.3 to 2.0 (I haven't tried direct to 2.1). The only upgrade issues were 1.3 to higher versions dealing with RPC changes.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-53459231691579457492011-01-17T06:52:00.000-08:002012-07-30T07:29:52.750-07:00Testing GWT-RPC, and why to be careful about jumping to RequestFactoryJust a copy of what I posted on StackOverflow: http://stackoverflow.com/questions/4119867/when-should-i-use-requestfactory-vs-gwt-rpc/4714437#4714437<br />
<br />
<br />
The only caveat I would put in is that RequestFactory uses the binary data transport (deRPC maybe?) and not the normal GWT-RPC.<br />
<br />
This only matters if you are doing heavy testing with SyncProxy, Jmeter, Fiddler, or any similar tool that can read/evaluate the contents of the HTTP request/response (like GWT-RPC), but would be more challenging with deRPC or RequestFactory.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-50109426032393688202010-08-23T13:14:00.000-07:002012-07-30T07:29:45.205-07:00openEJB unit testing for jboss deploysSome notes using mavenized /src/main/resources/META-INF/openejb-jar.xml:<br />
<br />
<openejb-jar><br />
<!-- make backward compatible with jboss style deployments. For EAR deploys prefix the .format = EARname/{deploymentId} --><br />
<properties><br />
openejb.deploymentId.format = {ejbName}<br />
openejb.jndiname.format = {deploymentId}/{interfaceType.annotationNameLC}<br />
</properties><br />
</openejb-jar>dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-37385457172929012622010-08-09T06:07:00.000-07:002012-07-30T07:29:30.432-07:00Eclipse JPA tooling, Hibernate (jboss) toolingWorking on ways to improve the tooling/work environment when in a JPA project.<br />
<br />
In the past, pretty much hand-code everything and rely on maven/unit-tests to catch errors.<br />
<br />
Quicknote experiences:<br />
<br />
* To get JPA Tooling working, need to map the jdbc driver manually/directly to the filesystem jar location through the Eclipse->DataManagement features.<br />
<br />
* More on JPA tooling, particularly with maven layout, here: <a href="http://www.eclipse.org/forums/index.php?t=msg&goto=508143">http://www.eclipse.org/forums/index.php?t=msg&goto=508143</a><br />
<br />
* To get Hibernate Tooling working, need to add the jdbc driver to the classpath, EVEN IF you are using Database Connection:JPA project configured option (i.e. see above direct jar filesystem mapping does not carry over to Hibernate Tooling).<br />
<br />
* In the persistence.xml, to avoid dealing with a lot of issues, remove JTA requirements. This works for me as the Entity class/domain are in a project seperate from the Session Bean (the Entity Managers), so the Entity class/domain has a non-JTA persistence.xml, while the Session Bean (entity manager) project has a JTA persistence.xml. I hate inconsistencies, but only way this seems to work.<br />
<br />
<br />
Gains:<br />
<br />
* In JPA tooling, immediately checked the model to the database structure, and identified a couple of case-sensitivity issues between the field name and the column name that were easy to fix.<br />
* In Hibernate tooling, can test-run jpa-ql queries to see if they work as expected, timing, and review results. Can also look at the Dynamic SQL Preview to see the actual sql used for future index optimizations.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-45354530165917274432010-07-26T10:23:00.000-07:002012-07-30T07:29:18.780-07:00(CI) Building Eclipse PDE plugins from Maven...is a pain in the arse.<br />
<div>
<br /></div>
<div>
After evaluating maven-pde-plugin, which one would think would make it easy, turns out not so much.</div>
<div>
<br /></div>
<div>
I've swapped over to using Tycho (because it appears to better support multiple build options, like update sites and RCP apps directly instead of just plugins and features), but that isn't proving trivial even in the most basic sense still.</div>
<div>
<br /></div>
<div>
But, using Tycho 0.9.0 from the ibiblio org.sonatype.tycho groupid (not to be confused with org.codehaus.tycho...or several other groupId's I've run into) you still have issues:</div>
<div>
<br /></div>
<div>
Errors like: "Cannot find lifecycle mapping for packing: 'eclipse-plugin' come up a lot. Looking at the off-chance there is a dependency issue, you are required to use an unstable release version of Maven 3 (as of 7/26/2010 at any rate). Using maven 3.0-beta-1 you now get "Unknown packaging: eclipse-plugin"...so not much help there either.</div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;"><br /></span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;">Searching for help on either of these issues get posts like 'fixed in Tycho 0.5.0', or 'you need to modify how you build from source'...which if you get the binary from a public maven repository one would hope would work as expected (per why most people want to use maven so you DONT run into these issues).</span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;"><br /></span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;">Other people mention 'update m2eclipse'...except I'm running this from the command line for the purpose of eventually moving to Hudson/Continuous Integration. Maybe I mis-understand the purpose of this maven plugin and it must be used in eclipse with m2e?</span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;"><br /></span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;">Please help if you read this!</span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;"><br /></span></div>
<div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: verdana,geneva,helvetica,arial,sans-serif; font-size: 13px;"><br /></span></div>
<div>
<span class="Apple-style-span"><span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;">EDIT: reason for chasing down why I want to automate Eclipse PDE builds is</span></span></div>
<div>
<span class="Apple-style-span"><span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;">1) I have an RCP app I would like to migrate over (from Eclipse 3.0 unfortunately)</span></span></div>
<div>
<span class="Apple-style-span"><span class="Apple-style-span" style="border-collapse: collapse; font-size: 13px;">2) primary reason was to pre-load company JDBC drivers for use in Eclipse (</span></span><a href="http://www.eclipse.org/forums/index.php?t=msg&goto=549384">http://www.eclipse.org/forums/index.php?t=msg&goto=549384</a>)</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
ANSWER: do not assume the 'convention':</div>
<div>
<br /></div>
<br />
<div>
<br />
<code><br />WRONG plugin artifactId: maven-tycho-plugin<br /><br />CORRECT plugin artifactId: tycho-maven-plugin</code></div>dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com2tag:blogger.com,1999:blog-31129136.post-2883327905604120132010-07-21T09:07:00.000-07:002012-07-30T07:29:11.064-07:00Web Browser plugins, how I loathe thee, let me count the ways.....I have had a passionate dislike for web browser plugins. Yes, they add new exciting features...that you may or may not be able to control, or have a predictable behavior across the world-wide-web.<br />
<div>
<br /></div>
<div>
Take for example two very common plugins that I usually have to deal with for reporting, document management/archive, etc.</div>
<div>
<br /></div>
<div>
PDF plugins (Adobe)</div>
<div>
TIFF plugins (variety)</div>
<div>
<br /></div>
<div>
Adobe PDF plugins - </div>
<div>
<ul>
<li>Versions/upgrades regularly, users have to regular update 'the site', even though it's not the site, it is the plugin asking for upgrades.</li>
<li>To embed, not embed, dealing with pop-ups allowed.</li>
<li>And...here is a good one....the web-embed adobe plugin making *multiple* http requests for the same content, and if your logging didn't account for that - multiple logs (see http 206, byte serving/byte range requests).</li>
</ul>
<div>
TIFF plugins -</div>
</div>
<div>
<ul>
<li>Variety of plugins with different options/features/control (and even something 'simple' like if the plugin allows multiple page viewing....apparently not standard?!)</li>
<li>You, your client/customer, or someone, has Outlook/Office installed and it has an update, a critical update, a security update, whatever -- and reverts to using the MS Tiff viewer by default despite your best effort to use a different TIFF plugin.</li>
<li>TIFF encoding/compression formats (i.e. g3/fax compression that has an X-Y ratio difference, that some plugins understand and show 'correctly', and others that show without the appropriate ratio and have 'crunched' images).</li>
<li>And the occasional TIFF that has a byte that isn't understood by Plugin XYZ, or other plugin, but yes on this plugin....search for it, they happen.</li>
</ul>
<div>
Then you add in other plugins like flash/shockwave, java applets, activex/silverlight, codec/encoding video players (whether to use quicktime, realplayer, windows media player, divx, ......), and developers just can not wait until HTML 5 becomes a real-world/real-usage deal.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
</div>dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com1tag:blogger.com,1999:blog-31129136.post-92018359376236384672010-04-23T13:35:00.000-07:002012-07-30T07:28:58.679-07:00BigDecimal v Float/float or Double/double for java transportAs I have posted previously, quite often I get involved in some type of financial portion of a solution, or the entirety of the solution is financial.<br />
<div>
<br /></div>
<div>
In java, BigDecimal is where you go for computational accuracy -- but what about if you just need to transport the data?</div>
<div>
<br /></div>
<div>
So, I reviewed information in the Sun/Oracle JDK site, and if you go search and read it, it isn't overly definitive (from a 'do I want to or not use') on float/doubles.</div>
<div>
<br /></div>
<div>
After going through many other posts, mailing list searches, and reviews, I broke down and posted a question here:</div>
<div>
<br /></div>
<div>
<a href="http://forums.sun.com/thread.jspa?messageID=10977271&#10977271">http://forums.sun.com/thread.jspa?messageID=10977271&#10977271</a></div>
<div>
<br /></div>
<div>
I also started doing some manual tests myself, and finally got the 'answer' I was looking for:</div>
<div>
<br /></div>
<div>
float: 9 'locations'</div>
<div>
double: 15 'locations'</div>
<div>
<br /></div>
<div>
What are locations? My testing, I found that float can accurately store and retrieve 6 numbers before the decimal, and 3 after....or 3 before/6 after, or any variation of that theme. Similar for double - 9 before/6 after, and other variations.</div>
<div>
<br /></div>
<div>
Needless to say, that's why it is vague as it matters what scale you are storing after the decimal as to how much you can store before the decimal.</div>
<div>
<br /></div>
<div>
So, unless you can get a definitive max value and precision rule for a financial application, you might want to stick with the heavyweight of BigDecimal just to be sure.</div>
<div>
<br /></div>
<div>
----</div>
<div>
<br /></div>
<div>
Edit: I forgot to post *why* I was even looking at this!!</div>
<div>
<br /></div>
<div>
We were having some memory issues with an outsourced application (that lacked pagination), that had a DTO with 12 monetary value field...12 BigDecimals per DTO. The List sizes ranged from 300->2000->40k. The 40k (most extreme) was taking up 45MB of memory! Changing the BigDecimal to float primitive for the 12 fields dropped the same List size down to 15MB (1/3!!!!!). </div>
<div>
<br /></div>
<div>
However, the accuracy needed for this application was not satisfied by float, so although I'm evaluating Double I may opt to play it safe and keep accuracy as more important than saving memory (and, instead, actually paginate the results!).</div>dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-63817357790330374042010-01-14T07:43:00.000-08:002010-01-14T07:52:19.478-08:00Embedded DB - Sort Stability, PaginationWe use <span style="font-weight: bold;">application-level pagination</span>. I wont go into the reasons, but several of them are business reasons.<br /><br />What is application-level pagination? Someone wants to view 50000 records through a web screen (just stay with me...business reasons). <br /><ul><li>Make the query, default/starting sorting order.</li><li>Cache results locally on the application layer the current set (in our case, cache into a hypersonic, h2, derby database that writes to file as too much to fit in memory).</li><li>Return first <page> results back to the web screen (say 50 records per page).</li></ul>--person goes to 'next page', get next 50 records from local db result set.<br /><br />--person re-sorts the existing resultset, re-sort from local db result set (instead of re-querying the origin db), return first <page>.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Problems we ran into:</span><br /></span><br />Certain embedded databases we found to not work out well for this challenge. Hypersonic and H2 both didn't see to handle (at least with default settings) the multi-user/asychronous(web/ajax) request nature of the sorts and were causing the result sets to not be accurate when 'pushed too hard' (a user requests a sort, then changes their mind in the middle of a sort and changes the sort again).<br /><br />Derby however did seem to resolve this issue for us. Yes, there are different ways to handle pagination, however need to solve the business request of how the behaviour was expected to act.<br /><br />If someone has some similar experiences with application-level caching of large result sets, re-orders, pagination, please share!dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-37783286152544878582009-08-06T06:20:00.001-07:002012-07-30T07:28:48.620-07:00least-invasive development improvement<span style="font-size: 130%;"><span style="font-weight: bold;">LIDI - Least Invasive Development Improvement</span></span> (team-oriented)<br />
<br />
Attempting to coin a term for what I have been attempting to do for the last 8 years in maturing a very small development team that supports many projects.<br />
<br />
<span style="font-style: italic; font-weight: bold;">Definitions</span><br />
Small development team defined as under 10 people, including UI, Server, DB, and internal dev QA.<br />
<br />
Many projects defined as 10-25 active, supported solutions, with >50% of them being unique solutions, while the rest may be re-tooled/variations of existing solutions.<br />
<br />
Small to medium projects defined (from a LOC standpoint) of between 10k and 500k. Most are web-based, but some are thick client. Most are 3-tier/n-tier, some are 2-tier thickclient-DB type solutions.<br />
<br />
<span style="font-style: italic; font-weight: bold;">Key Words</span><br />
*logging<br />
*unit testing<br />
*performance testing and review<br />
*scalability testing and review<br />
*security testing and review<br />
*configuration management<br />
*runtime management<br />
*runtime dependency checking/management (i.e. notification of issues)<br />
****Business problem solved<br />
****Expectations met<br />
<br />
<span style="font-style: italic; font-weight: bold;">Prefix</span><br />
I put the last two, <span style="font-style: italic; font-weight: bold;">business problem solved</span> and <span style="font-style: italic; font-weight: bold;">expectations met</span> with many asterisks because, like many developers have experienced, doing all the performance/scalability/security testing in the world won't help you if you have to recode/redesign it again and then have to re-do all the performance/scalability/security testing and review again.<br />
<br />
<span style="font-style: italic; font-weight: bold;">Discussion</span><br />
When working with a small development team that is already fighting with project priority conflicts, short deadlines, short requirements, and constant support and change-requests, the last thing on your or their mind is ADDING more work.<br />
<br />
The above term, least-invasive, is on purpose -- there is no free lunch, there is no silver bullet. There will be compromises, but if you maintain a goal of trying to make it as least invasive as possible, and able to show and provide reasons and results that are touchable/matter, you will mature and progress!<br />
<br />
<span style="font-style: italic; font-weight: bold;">Experiences</span><br />
<br />
<span style="font-size: 180%;"><span style="font-style: italic;">Step 1: Baseline</span></span><br />
<br />
I know your first thought - I don't have time to come up with baseline metrics, we are already going nuts! Guess what, I'm *not* talking about baseline metrics! I am talking about getting the development **process* repeatable and stable -- that is your base for everything you do.<br />
<br />
*Baseline: Convention<br />
<br />
Yes, I borrowed this term from the Maven team. Make sure all your projects follow a similar folder layout, for example all java code is in /src/main/java, all html/jsp is in /src/main/webapp, etc. Get the team to the point where someone can checkout a project they never touched before and be able to know where to go/what to do.<br />
<br />
*Baseline: Independent builder<br />
<br />
Either find a person who will always be an independent builder, or setup some type of continuous integration system. Having someone/something ELSE do builds than the developers will greatly stabalize the process and document/flush out any outstanding issues you have in the build process. This is a deadly experience I learned from my VB6, and now that I'm in java I've choosen http://hudson.net as an independent build tool, while continuum/cruisecontrol are other alternatives.<br />
<br />
*Baseline: Build system/dependency management/versioning<br />
<br />
I'm sure you just ran into a snag -- with an independent builder, you are learning there are different ways people are building, or worse, they are relying entirely on an IDE for the builds. Moving to ANT or Maven2 build system in java, in my case Maven2, helped to ensure that the builds are *consistent* and any gotcha's can actually get caught EARLIER than later. Let me say that again with an example - "This maven2 project is not building on my desktop, what a piece of crap." actually translates to "the project doesn't build on JDK1.4/JDK5/Windows/Linux/needs a library I forgot to add, lets fix it now while we're actively on the project instead of when we check it out 6 months later to fix a different issue.".<br />
<br />
Maven2 also helps with the dependency management problem, and the versioning problem. If you are always renaming your jars to be mylibrary.jar to include in your application, and you aren't sure which version that library is after-the-fact and trying to identify an issue, you know the problem.<br />
<br />
*Baseline: Promotion process<br />
<br />
This will be the most difficult baseline to adjust - a promotion process. What I mean by this is, based on my experience that seems to be working, you develop and deploy to a DEV environment. Work out the kinks as you know it. If you are lucky enough to have an internal QA, have them review it on DEV. Then, when things look o.k., *promote* to a STAGING environment (including different DB, server, everything). NEVER make custom tweaks on Staging, instead always modify your promotion, or migration, scripts/process as those migration scripts/process is exactly what you are also testing that will be used when you promote to Live. On Staging, you do UAT/Customer Acceptance, have them push it back if needed, make fixes on DEV, them promote back up to Staging for another review. THEN promote to live.<br />
<br />
<span style="font-size: 180%;"><span style="font-style: italic;">Step 2: Improve ability to identify and fix basic stuff<br /></span></span>What I mean by this, is let the developers use the tools they are already comfortable with. Unit Testing. Diagnostic Logging (or normal logging if you aren't familiar with the different logging types).<br />
<br />
Unit Tests: Junit is a great. Nunit exists for the other side. Having a way to test the code is doing what you want to, AND BE ABLE TO REPEATABLY AND AUTOMATICALLY run those tests is the goal. This is not integration testing, just basic module/unit testing that the code is behaving as expected to whatever business expectations can be resolved in the code.<br />
<br />
Diagnostic Logging: Making sure your code can log somewhere, that you can retrieve, and provide useful information to make a correction, is this goal. "It broke!" well, you need to know what caused it to break, and the *quicker* you can do that, the more time you'll have for other things. Rather than re-testing manually with system outs, get your logging taken care of. This will not fix all your issues, but if you can get the easy 80% out of the way, that's huge. My experience we are still having some challenges, as there are some custom ways that are already in place, and people have a hard time breaking out of the sysout approach. I think I'm satisfied with using SLF4J, then just letting the Log4j implementation for logging and controlling the log verbosity (and the formatting of the logs....nothing worse than custom logging that has many different outputs, get it consistent!).<br />
<br />
<span style="font-size: 180%;"><span style="font-style: italic;">Step 3: Your walking, your walking, lets try jogging.<br /></span></span>By this point, you should be o.k. now, and taking care of business. Now should be able to look at some more technical things to improve the development process.<br />
<br />
Codability: This is where the static code analysis tools come in, that are, again, not that invasive to use. Whether you do it from a report generation standpoint, or integrated into the IDE, tools like PMD, Checkstyle, FindBugs have the potential to ferret out potentially poor code. This is no replacement for peer review in any fashion at all, this is just a convenient way to identify common issues (note: these tools are not stone-cold rules, there are times things have to be coded a certain way).<br />
<br />
Testability: Coverage tools like Clover, Emma, Jcoverage can help on your unit testing side to see if you can increase the amount of testing, and catch certain flow-changes (if/else/case/etc) in the code that aren't tested as well right now.<br />
<br />
<span style="font-size: 180%;"><span style="font-style: italic;">Step 4: The in-deep stuff</span></span><br />
Once you reached step 4, you can look at the items I listed back at the top of this page. Notice that I really didn't hit certain items --<br />
<br />
Profiling is actually an invasive process most of the time, I haven't found a tool that can easily identify memory issues or performance bottlenecks *for* you, instead they, rightfully, require you to review the information and come to your own conclusions. Tools like TPTP, Jprobe, jprofiler, etc aren't quick-fix tools, you need to learn them and understand them, and are useful in different scenarios.<br />
<br />
Multi-tier profiling/review: Tracking from the Web tier, through the server tier (rules/workflow/business logic), to the database tier (sql/db, or sproc) to help identify where in the tiers a particular slowdown or issue is occuring from. Not something easy to do - some tools, like deprecated InfraRED and Glassbox, attempted to make this easier for us, but they don't seem to be active.<br />
<br />
Integration testing: Actually being able to do business testing across an entire integrated system programatically, performance testing SOA or the full cycle of pressing a button in the UI are all desireable goals, but not easy to setup and do.<br />
<br />
Automated UAT/User Interface testing: Some neat tools, like Selenium, can help step through testing a website and ensure things continue to work as expected. It's a great tool, but if you are constantly making changes, keeping those Selenium tests up to date can get time consuming. Also, need to know they do NOT identify blemishes/non-intuitive interface, only that the interface is continuing to work as expected.<br />
<br />
Scalability testing: testing 10-years later equivalent worth of data, testing 5,10,50, 1000 concurrent users, evaluating estimate load ability per setup (proxy/multi-app servers/single db, db clustering, etc). You can also throw in disaster/recovery as part of the scalability testing as well. These are all very manual, very concious development purposes and is definitely invasive and time consuming.<br />
<br />
<br />
<span style="font-size: 180%;"><span style="font-style: italic;">Conclusion</span></span><br />
Well, this looks more like a brain-dump than an organized blog, but sometimes just dropping information can be helpful to other people, and could solicit useful feedback!<br />
<br />
Post-Edits:<br />
A good article on related subject: http://www.ddj.com/architect/184415470dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-2675788632431895022009-07-23T11:51:00.000-07:002012-07-30T07:28:37.848-07:00GWT 1.6 jvm crashes (ParameterizedMethodBinding)Very quick blog. GWT 1.6 causing JVM crashes.<br />
<br />
Sources:<br />
http://grack.com/blog/2009/04/14/gwt-16-crashes-and-a-fix/<br />
<br />
http://www.mail-archive.com/google-web-toolkit-contributors@googlegroups.com/msg04852.html<br />
<br />
http://osdir.com/ml/GoogleWebToolkitContributors/2009-04/msg00044.html<br />
<br />
We were experiencing similar issues, but the issues were difficult to sort out. It appeared Sun 1.5 JVM's did not have this issue, most of the Sun 1.6 JVM's did (but a newer one did not), and the OpenJDK 1.6.0-b09 also had the issue (issues spanning Windows and Linux boxes).<br />
<br />
In the end, for our shop, I found the true resolution to the problem to be:<br />
<br />
Remove the '-server' option when running the GWT compiler.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0tag:blogger.com,1999:blog-31129136.post-44033888989391539292009-07-01T06:27:00.000-07:002009-07-01T07:02:28.955-07:00Microsoft, Open Source, Barriers to Entry/Barriers to Deploy<span style="font-weight: bold;">Microsoft, Open Source</span><br /><br />I had to make a concious decision to have the above title have a comma between Microsoft and Open Source. Putting something like 'and' or 'versus' or 'with' may set the wrong stage for the intent of this blog.<br /><br />links:<br />http://www.microsoft.com/opensource/<br />http://www.codeplex.com/<br /><br /><br /><span style="font-weight: bold;">Focus:</span><br /><ul><li><span style="font-weight: bold;">Barriers to Entry for a development environment</span></li><li><span style="font-weight: bold;">Barriers to Deploy a solution</span></li></ul><br />Traditionally, 'open source' is associated directly with programming languages, with the two more prominent (not only, just prominent) ones being Perl and Java. <br /><br />So what makes them successful? A number of things do, but I wanted to mention two, from a business and adoption standpoint, key drivers -- Barriers to Entry and Barriers to Deploy.<br /><br />Rather than have a lengthy paragraph, just going to bullet/summerize (BE = Barrier to Entry removed, BD = Barrier to Deploy removed):<br /><br />Perl<br /><ul><li>BE: You can get perl language for free.</li><li> BE: You can get various IDE's to use perl for free...or use notepad/vi.</li><li>BE: Large number of commercial books out there for Perl.</li><li>BE: Large number of free articles out there for Perl.</li><li>BE, BD: CPAN, a large centralized repository of code you can use, learn from, and deploy at will for free. A lot of problems, but obvious and obscure, have already been solved and are free for you to use and/or modify.</li><li>BE, BD: Interpreted, can make changes on-the-fly and immediately see the results (good for learning and prototyping and fast support, questionable for enterprise apps).<br /></li><li>BD: Multi-OS environment support (with availability of free OS as deployment environment).<br /></li></ul> Java<br /><ul><li>BE: You can get a java compiler and java VM without cost. Also, several options of compilers and VM's.</li><li>BE: Eclipse/Netbeans IDE are free.</li><li>BE: Large number of commercial books out there for Java.</li><li>BE: Large number of college courses and training classes for Java (varying level of quality however).</li><li>BE: Large number of free articles out there for Java, with code examples.</li><li>BE, BD: Several tested/documented solution paths and design patterns for more complex solutions (OSGi, Spring, JavaEE).</li><li>BE, BD: Many repositories of code and binaries available, free to use and modify -- sourceforge, codehaus, java.net, as well as Maven library repositories.</li><li>BE, BD: Free to use and deploy build systems (ant, Maven) that are not tied to an IDE, and allows anyone to 'check out or download' code and just start working with it.<br /></li><li>BD: Multi-OS environment support (with availability of free OS as deployment environment).</li><li>BD: Java has several servers (JavaEE container servers - Tomcat, Jetty, Jboss, Jonas, Glassfish, Geronimo, etc) that are also free to develop and deploy on.</li><li>BE,BD: The JCP and/or common solutions usually have competition that continues innovation, and gives developers choices depending on the scenarios presented to them.</li></ul><br />Microsoft<br /><ul><li>BE: Large number of commerical books on Microsoft .NET programming language platforms.</li><li>BE: Large number of college and training courses (relatively stable quality).</li><li>BE: Various programming language options for the .NET platform.</li><li>BE: Commerical MSDN access as a repository of solutions, code examples, etc.</li><li>BE,BD: Graphical/UI builds through the singular, commericial (which is both good and bad, as it's a constant) IDE - Visual Studio.</li><li>BD: You know exactly where it is going to deploy - commerical MS Server OS on MS IIS/Biztalk/etc licensed servers.</li><li>BE: Lot of packaged solutions, some integration solutions, all commercial, are available. However, require licensing for deployments.<br /></li></ul>If Microsoft is going to try to adopt an open source community, they need to take a look at the Barriers to Entry and the Barriers to Deploy, particuarly from the commercial standpoint -- the companies that can spend the money aren't going to give their code back for free, while companies/developers that have low costs for the development and deployment environment are less in a pinch and like having their source code out there to help improve it's quality, particularly when the Barrier to Entry for someone else to look at published code is low.dhartfordhttp://www.blogger.com/profile/17083942553852687561noreply@blogger.com0