Tuesday, February 23, 2016

S3-dist-cp and recursive subdirectory groupings

When working with AWS (specifically AWS EMR hadoop), you can use the S3distcp to concatenate files together with the --groupBy option.  What is really cool, this will work even on already-compressed (gzip) files!

However, recursive sub-directories are not natively supported by S3distcp.  So instead, need to stage it. To stage, we are going to use the distcp that S3distcp originated from as it has some other useful features not in s3distcp.

Using AWS EMR you can create a Custom JAR step, and either use the /usr/lib/hadoop/hadoop-distcp.jar or upload your own version of hadoop-distcp.jar to S3 and reference that version. Then, for args you want to copy the contents with the --update to a destination staging area where the individual files are stored in a flattened directory structure. In this example, I'll filter to just csv.gz files.

  --update s3://test/raw/**/*.csv.gz s3://test/staging

After that, then you can use the command-runner.jar to concatenate in any grouping defined by the regular expression.  The example used is by 4-digits (years for examples) in the filenames, such that all the daily/monthly files are put together into a single year file.  The -outputCodec gz ensures that the ending file is also compressed.

s3-dist-cp --src=s3://test/staging --dest=s3://test/grouped/ --groupBy .*([0-9][0-9][0-9][0-9]).* --outputCodec gzip

If you get errors like "ERROR: Skipping key XYZ because it ends with '/'", this is usually because either there are no source files, or the regex in your groupBy is not quite correct and filters out to no files.

Saturday, October 10, 2015

Gamer Post (non development) Star Citizen coupon credits

For any gamers interested in the up and coming space epic Star Citizen from Robert Space Industries / Cloud Imperium (and their pretty impressive FPS star marine for planet and capital ship capture integration), My referral code now can get you 5,000 credits in game that you wouldn't otherwise! (p.s. you can register an account now before you forget without obligation, the 5000 will be waiting for you)

Trying to spread the word and give people a little boost to get started!

Sunday, May 10, 2015

New development (visual) perspective

It has been a while since my last post.  I've been quite overwhelmed with additional challenges that have been overcome one at a time.

However, being a passionate technologist, always looking for ways to make myself and my team more ready to take on the next challenge.

For that goal, for the last couple of months I've been evaluating AR, VR, and 'Holo' alternatives to provide a different 'perspective' on heads-down development ecosystem/environment.

Key findings (end of Q1/2015):

  • AR, Augmented Reality / overlay over real world, such as those by Google Glass, Meta (Pro), Atheena, Vuzix, are all trying to take on many things at once.  I, for one, do not need a head-mounted camera, especially when you will always have a phone with you which has a better camera anyway -- and as far as using the camera itself for AR, rather use the 'smartglass' as a portable monitor rather than full AR.
    • Extra features not needed (camera)
    • needs to be 'socially acceptable', particularly in meetings.  Privacy concerns exist also around the camera, so drop camera, or provide (obvious marked?) option that does not include the camera.
    • Intent One for supportive information, lookup/confirmation in meetings.
    • Intent Two for development environment where not limited to single-pane monitor.  Need higher resolution (i.e. 1600x1200) to be useful.
  • Hololens (microsoft).  They do not provide any actual specifications. So...vaporware for the time being. Interesting concept, but not likely to impress compared to VR.
  • VR, Virtual Reality, where no attempt to overlay the real world.  Provide 'high quality media' (i.e. gaming experience), so no camera, focus on video resolution, sound, and hopefully eye fatigue challenges.
    • Project Morpheous, Sony/PS4 specific, not relevant as need for PC platform.
    • Occulus Rift. Very popular, but the newer consumer version is targeted for 2016.
    • Valve/Steam HTC Hive.  Likely best candidate, high resolution display, with SteamOS intent for multi-OS environments (windows, linux, SteamOS), very good for both traditional business and gaming development shops. Target for end of 2015.
      • Bonus: Valve/Steam is also a nice distribution platform in itself.  One of my personal goals was to help provide 'optimized' development environments per projects.  By Integrating into Steam as a distribution platform and provide 'modules' or 'add-ons' that provide the optimized environment/experience, a faster bootstrap process for new developers.
HTC Hive is not selling their development kits like Occulus, an interesting move.  I've submitted an application for my team (as an independent) to try to provide another 'perspective' to development.


Thursday, October 09, 2014

Alfresco Startup time

As some of you may know, Alfresco is an EDMS that runs on tomcat (or other java container) in linux or windows.

I've been working with Alfresco for several years in various capacity, and anyone who works with Alfresco Community Edition (CE) knows that configuration changes require restart, and the restart is painfully slow....like 5 minutes slow.

Although some of the 'obvious' fixes are to move alfresco.war and share.war to independent tomcats, and only restart the one that you need, that is still configuration/integration/new possible issue point (particularly if you are doing low-volume sites and want to use only one tomcat/server).

The 'performance' is always relative to the hardware you are using, and some the time will vary depending on big cpu/slow disk, low-end cpu/high-end disk, etc.  As such, will simply provide a baseline which is Alfresco 4.2.f, the bitnami linux installer on fedora.

All of these tests are with a baseline alfresco install, no content, no indexing.

Relative startup time Configuration Notes
0% (baseline) -XX:+UseG1GC -XX:MaxPermSize=256M -Xms1024M -Xmx1024 Bitnami base setup, 1024M heap.
-59% (slower, minutes dep. on hardware) -XX:+UseG1GC -XX:MaxPermSize=256M -Xms256M -Xmx256M Comparison when trying low memory (256M), how much of a difference it makes.
+0-5% (trivial) -XX:+UseG1GC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M Excess ram (for startup) does not impact startup time. Note however you usually should have 2G-8G for normal, real-world, usage
+12% faster (20 sec dep. on hardware) -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M Changing to a Concurrent GC (because we know CPU is maxed) actually made a good difference (assuming you are not heavily disk IO bound).
+14% faster -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
put tomcat/alfresco/openoffice in a ramdrive to remove disk IO concerns (/dev/shm for example)
Keeping to the most performing setup, try ramdrive to challenge disk IO issues...really aren't an issue in startup. However...your alf_data location absolute has an impact on real-world content.
+8% faster (slower than ConGC with defaults) -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 Additional GC tuning (out of quick/ignorant, didn't want to spend a lot of time on this). Just used what was from Http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.6
+18% faster -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
Modify /alfresco/WEB-INF/web.xml and /share/WEB-INF/web.xml with metadata-complete=true and absolute-ordering
Most performant memory setup, then added this find: http://wiki.apache.org/tomcat/HowTo/FasterStartUp Obviously not a solution if using stock or as-is installer approach, but if you can customize your WAR (not just the exploded directory) surprising gains.
+20% faster (memory change, small app conf change, ramdrive) -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MaxPermSize=256M -Xms2048M -Xmx2048M
Modify /alfresco/WEB-INF/web.xml and /share/WEB-INF/web.xml with metadata-complete=true and absolute-ordering, ramdrive tomcat/alfresco/openoffice
putting it all together, the 'quickest' fast startup option if you have the ram for it, but disk io isn't an issue with startup.
So, some quick wins for decent return depending how much you want to modify your install. However, at the end of the day the startup is CPU (single-thread) bound. Now, some of you may remember the older Jboss application servers (4-6 series) and how they were getting slower and slower on startup, then they re-engineered there startup process for *controlled* parallel startup of services. That may be what is really needed to get the Alfresco CE startup time down.

Recommend the second to the last - setting up a ramdrive for that little of a gain is not worth it, and if you are really trying to push startup times, then get your alfresco and share on different tomcat instances first before moving to ramdrive.

Tuesday, January 15, 2013

Sharepoint 2013 w/ Apache Chemistry CMIS

Sharing my experience in trying to use the CMIS library to work with Sharepoint 2013.  As a prefix, I have existing code integrated into a CEVA (content-enabled vertical application, that seems to be the buzzword) using Alfresco 4.2 CE as a backend, and evaluating compatibility of the system with a Sharepoint 2013 backend (I'm not swapping to sharepoint, just cross-checking).

I use 'sp2013' for the server name, replace as appropriate.

  1. Work with CMIS-Workbench as your go-to tool for confirmation before working with your code.  This is like your SoapUI when working with Webservices, or your Database Editor tool when trying to write queries for your application.  Work through everything you want to do with CMIS Workbench *first* before you write code.
  2.  Sharepoint 2013 setup notes:
    1. Sharepoint Central Admin (http://sp2013:90/): 
      1. Security, under 'General Security' section, 'Specify Authentication Providers'. 
      2. Pick the default zone, or if you know Sharepoint the appropriate zone. 
      3. Under Windows Auth, I had to enable 'basic auth'...I also disabled integrated as my intent was to use Sharepoint soley as a repository, so no need to get the system confused between integrated or basic auth (obviously, if using this route in production, need to setup SSL).
    2. Site Settings (http://sp2013): 
      1. Pick the site you want to access through CMIS (for example, 'Documents').
      2. In the upper right, beside the login name, is a 'gear' icon for settings - click that, go to 'Site Settings'.
      3. Under 'Site Action' header is a 'Manage Site Features' link, click that.
      4. Activate 'Content Management Interoperability Services (CMIS) Producer'
      5. Repeat for each site you want to access. Each site will appear as a unique Repository from the CMIS point of view.
  3. Again, use CMIS Workbench for all your confirmation/testing.  Add some files/folders to the above Site(s)/Repo(s) you shared for CMIS.
    1. Connect to the URL http://sp2013/cmis/rest/?getRepositories through CMIS Workbench. You will likely use this one for your apache chemistry code as well.
    2. For Apache Chemistry, Lesson learned --
      1. DO NOT try to create a session by re-using your Map param and add in the repo ID...instead, get the Repository object directly, and use the Repository.createSession().
      2. For example of best-approach/usage of the Apache Chemistry CMIS library, look at the CMIS Workbench source code (that is how I learned the above error/correction).
  4. There are some CMIS functions that DO NOT work with Sharepoint 2013.  I ran into only one and have not done a thorough review, but this already delayed me significantly:
    1. SCORE() does not work in Sharepoint 2013
A HUGE kudos goes to  http://gauravmahajan.net/2013/01/06/sharepoint-2013-rtm-on-win-server-2012-virtual-machine-download/, I already spent enough time just dealing with Sharepoint and CMIS, much less getting all the infrastructure up and running - big thank you!


Monday, December 03, 2012

Document Management - CMIS 1.1 protocol approved

With apparently very little fanfare, CMIS 1.1 passed the final votes to become an approved specification.


Now, some people may come to this blogpost and ask the question: "What is CMIS and what is great about 1.1 being approved?"

CMIS is an attempt to standardize the protocol to communicate with document/content management systems (EDMS/ECM). These systems have been around for ages (>15 years?).  But they are ruled by large, proprietary giants who protect their investments by making sure that once you are integrated with them, you are locked into them without another large investment to re-develop/design all those integrations into another proprietary system.  These may not have been malicious decisions, but attempts to provide value add, but the end result is the same --- you get locked in.

WebDAV - a protocol that some of them started to follow.  A good protocol.  But lacked standard query support and repository/administration support.

JCR - Java Content Repository (two different revisions over time).  Java-specific, attempt to define *how* to build a content repository, the underlying piece of an EDMS/ECM, but didn't exactly define a good integration/interaction protocol for clients or other tools.  However, this did plant the seed to create various open source alternative EDMS/ECMs, so thank you (although it is java-specific, at least someone started something!)

CMIS 1.0 - after JCR, CMIS came into play as a language-agnostic way to search and retrieve documents (atompub & webservice versions).

So...what is so great about CMIS 1.1?  It brings:

*  standard way to create custom object types (content models, document types, etc) through a common/standard protocol instead of relying on each vendor to provide their own mechnism.(2.1.10 Object-Type Creation, Modiļ¬cation and Deletion)

*  standard way to support 'mixin', or reuse, of properties using 'secondary types' (2.1.9 Secondary Object-Types)

With these two very important features, you can now create, search, retrieve, and (partially) maintain your content completely through a standard protocol, allowing creation of tools and interfaces against the protocol instead of vendor-specific implementations.

Do not get me wrong, the innovators in this space (Alfresco for example) provided vendor-specific value adds before the industry caught up, but some people, like myself, were resistant to those vendor-specific value-adds until can interoperate with other solutions.  To say you picked a solution *only* because a value add feels like a lockin.  To say you picked a solution above all others using the same features (CMIS 1.1 protocol) says a LOT more :-)

CMIS standard: http://docs.oasis-open.org/cmis/CMIS/v1.1/CMIS-v1.1.html

Monday, July 02, 2012

Laptop build

Although I was looking for a pre-installed linux laptop, found a too-good deal on a thinkpad x230t with sufficient capabilities for xen server needs while in a compact 12" formfactor.

 Step 1: Shrink the volume on the default Windows 7 install. Windows provides better support now adays for volume resizing. If you go to the control panel, search for 'partition' as a key word, you will see the disk management tools. This provides you the ability to shrink the volume....sort of. It appears it isn't an exact tool, and you will need to shrink, reboot, defrag, reboot, then shrink some more...repeating....until you get to the target size desired. I was aiming for 120GB, and it took 4 tries to get there.

 Step 2: Backup the Windows image Although there is the familiar Ghost image software if you have the money, I wanted to look at alternatives. Lenovo provide it's own backup/restore software that looks like it would work well. I got a backup USB harddrive (not USB flash), and the Lenovo Thinkvantage backup/restore did the MBR and backup images flawlessly. However, being an individual that wanted to avoid lockin, and try to move towards automation/repeatable provisioning, I kept looking.

Cobbler is a tool I'm keep falling back to for image-based provisioning (versus kickstart-based installs), and it has support for provisioning images from Clonezilla (http://clonezilla.org). Clonezilla has support for Windows imaging, sharing my findings:

 1) Reformat your external device (usb harddrive in my case) to have a smaller partition, such as 250MB, with a FAT32 filesystem as the first partition on that device. This is important to avoid a lot of trial/error - other versions of FAT will not work, and too-large volume causes problems. Don't worry, you still want a second partition that is much larger (at least 32GB) to store the actual images.

 2) Use tuxboot.exe. Clonezilla highly recommends it, and they are right to do so. Once you get the partition straightened out, everything else is cakewalk. And, yes, you can use it directly from Windows without requiring to have a linux install.

 3) plug your device into the Windows machine you want to image. If it is only for one machine, good-to-go. If you are trying to create a 'gold image' for distributing to multiple machines at once, look into 'sysprep' and other tools to prepare the windows install.

 4) reboot your machine, and use your bios to choose the alternate start location. If you do not see your device, some of the USB3.0 ports/devices are not recognizable as bootable locations, so plug into a usb2.0 port to be sure.

 5) the provided directions with clonezilla were excellent! If you want to review beforehand, you can check their site. Create the image, store it in the large partition, takes about 40min (minimal windows install) to create image then do a double check.

That is all for tonight, more updates later.....