Our grid provides a substantial amount of computing power, and overall works well. But given the nature of the actual compute nodes — i.e. they are desktops in K-12 schools scattered across the state, often in rural areas — individual processes ( Jobs ) get cycled with a reasonable frequency. There are lots of reasons: a machine crashes, gets rebooted, drops off the network, runs out of space, etc. Not a big deal, the system will detect that the process has been lost, and restart the Job somewhere else. Only thing is a Job can run 12+ hours, and it might take an hour or two for it to be detected and rescheduled. Also not really a big deal when you’ve got 10s of thousands of them to run. But if that one Job is the last of a 50,000 Job set ( Run ), and the researchers need those results to get to their next step, the wait can be annoying. A 50K Job Run might be 99.5% complete in 48 hours, but it might take another 72 for those last 250 Jobs to complete.
I considered things like double or triple running the last X Jobs. That could work, but it would get complicated really quickly on the software and scheduling side. Plus it might not always help unless you were sure each job went to a geographically different area, otherwise a network or power cut would take out both copies. Again more complexity in the software. And worse, it would be wasting computing power, doing double or triple work for some percentage of each Run.
So what if we could assign the last X jobs to a set of agents that don’t often get cycled. Software-wise that’s still not trivial, but an order of magnitude less complex. And might not take a huge amount of cores. I priced out some medium density couple year old ‘off-lease’ Xeons, and we even had a tentative place to put them. But a loss in funding ( Thanx, Gov Bevin, Lobbyists, et. al ), that was no longer practical.
Then I came across the Intel Compute Sticks. I had seen them a year or two ago, cropped in some home-automation / IoT discussion, but I didn’t give them much consideration. However, now the original Atom versions are available at greatly reduced prices, and many of the driver and other issues appear to have been solved.
I went out on a limb and picked one up on sale for $45. The puny 1G RAM / 8G flash Ubuntu unit. I figured it would at least do as a proof of concept. I set it up, upgraded the supplied Ubuntu 14.02 ( to 14.04 ) and added the minimally required packages to actually execute a Job. I didn’t port all the agent code, as that will be a whole undertaking, but I moved over just enough to yank some Jobs out of an active Run and manually run them on the machine.
Somewhat surprisingly it ran just fine and finished quicker than expected ( about 80% of the average run time ). Even more interesting was that, after I killed off all the GUI bits, there was plenty of RAM available. The Atom in the Stick has 4 cores. So I fired up 4 Jobs ( all are single threaded ), and while it heated the stick up a bit, and the run time of each job extended a bit ( still just u under 100% of the avg run time ), it all ran fine and didn’t run out of RAM.
Unusually the output of the Job was not an exact match to the output when ran on my iMac. However, the researchers assured me that this was normal and expected across operating systems.
So, time to budget and scale up.