Saturday, March 1, 2014

OSX analog of the *nix 'watch' utility

OSX does not have a watch utility. Yes, it can be easily simulated using the while loop:

while true; do ls -l logfile.txt; sleep 5; done

I found that having a function for that purpose in the .bash_profile is more convenient. Additionally it timestamps new and weeds out repetitive output. Adjust to your taste:

function lg(){
  while true
    curr="$($* 2>&1)"
    if [ "${prev}" != "${curr}" ]
      echo $(date +"%H:%M:%S"):
      echo "${curr}"
    sleep 5

Here is an example use with output:

$ lg "ls -AFGhlO log_*"
-rw-r-----  1 vlad  staff  -    0B Mar  1 20:00
-rw-r-----  1 vlad  staff  -    0B Mar  1 20:00
-rw-r-----  1 vlad  staff  -    0B Mar  1 20:00
ls: log_*: No such file or directory
-rw-r-----  1 vlad  staff  -    0B Mar  1 20:06

Note, that the parameter command needs to be quoted or escaped properly to execute correctly.

read more ...

Monday, November 4, 2013

Version info and enforced commits

One of prerequisites to successfully operate multiple software servers is to have a good production code inventory. Read an excellent example of an impact of a failed software inventory if you have any doubts about it. Below is some guidance for building software with inventory data compiled in to the code. It provides running systems with data needed for factual inventory. It also has a benefit of committed code enforcement.

In a rush to patch a hot issue developers may deploy a dirty codebase build. Dirty in a sense that the build happens while there are changes to the code which are not committed to the repository, even if local. If that dirty build goes to production while source code gets more changes before committing, one will never have certainty why that production build behaves a certain way should a problem arise.

A way to combat it is to put version and repository status information into the build executable. The example below is a generic guideline for Microsoft Visual Studio 2012 C++ (VC++ in this post) solution building from a Git repository. It also shows a possible integration with GitHub service, if needed. This post gives you a workflow outline, which can be applied to most environments I know of.

The general steps from build to execution are:

  1. The build process has a pre-build script, which gathers version and repository status information. In our case, VC++ uses the Windows PowerShell version_info.ps1 script.

  2. The pre-build script generates a source code file, which is expected by the rest of the codebase. The generated file has everything needed to allow or deny running the version. In our case it is a C++ header file version.h.

  3. A code contains function which checks if the build is legitimate to run, and logs and stops the process as needed. That example code is in the versionLogAndVet function of the version.cpp example file.

  4. At run time, the versionLogAndVet function allows to run only permitted combinations of repository status / build configuration. It also logs version information.

In this specific environment there are three build configurations. DEBUG and RELEASE configurations are standard, while the LOGGED configuration turns additional logging without debug code overhead. It is important, is that the code in version.cpp should only be invoked from server start-up routines, so that tests still can run on uncommitted code in production build configurations.

Here is the source code and notes:

It makes sense in most VC++ solutions to configure these checks on a per-project basis. Specifically configure it for projects which generate executable files. In the line below you likely need to customize two things - a path to the version_info.ps1 script and the word "server" - which will become the namespace for storing constant values describing the repository state and information.

powershell.exe -ExecutionPolicy RemoteSigned -File "$(SolutionDir)Util\Build\version_info.ps1" server "$(ProjectDir)\" "$(SolutionDir)\"

A word of precaution - the interpreter expands special character sequences on that command line. It took me a lot of time to realize, that the $(...) variables filled by Visual Studio have backslashes as path separators and also end with a backslash. Which means, that PowerShell takes the last backslash as an escape character for the following '"' and breaks the command line. For that reason there is an extra backslash before '"' which turns the last backslash of the path into itself after expansion. The unsolved problem, though, is if your paths have directories starting with escapable characters. In that case, directories may start with a bad character and break things. Microsoft did not provide any help with this issue (see here and the footnote).

As we use VC++, it does not allow to specify the PowerShell script directly - we need to specify the powershell interpreter and desired security policy. Also, this step should be embedded in an XML project properties file so that it can be assigned to multiple projects and build configurations:

<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="">
  <ImportGroup Label="PropertySheets" />
  <PropertyGroup Label="UserMacros" />
  <PropertyGroup />
      <Command>powershell.exe -ExecutionPolicy RemoteSigned -File "$(SolutionDir)Util\Build\version_info.ps1" server "$(ProjectDir)\" "$(SolutionDir)\"</Command>
  <ItemGroup />

Here is how the step will look in the project properties (note, that the value is not bold, meaning it is inherited from the .props file.

The pre-build step calls the version_info.ps1 Microsoft PowerShell script. The version_info.ps1 script runs a few git commands which query current repository information and generate a header file to be used later in the build. This is the script which you want to edit to provide more or different repository information to your software:

Param (

Push-Location -LiteralPath $GitRoot

$VerFileHead = "`#pragma once`n`#include <string>`n`nnamespace $Namespace {`n"
$VerFileTail = "}`n"

$VerBy   = (git log -n 1 --format=format:"  const std::string VerAuthor=`\`"%an `<%ae`>`\`";%n") | Out-String
$VerUrl  = (git log -n 1 --format=format:"  const std::string VerUrl=`\`"$VerPrefix%H`\`";%n") | Out-String
$VerDate = (git log -n 1 --format=format:"  const std::string VerDate=`\`"%ai`\`";%n") | Out-String
$VerSubj = (git log -n 1 --format=format:"  const std::string VerSubj=`\`"%f`\`";%n") | Out-String

$VerChgs = ((git ls-files --exclude-standard -d -m -o -k) | Measure-Object -Line).Lines

if ($VerChgs -gt 0) {
  $VerDirty = "  const bool VerDirty=true;`n"
} else {
  $VerDirty = "  const bool VerDirty=false;`n"

"Written $Project\" + (
  New-Item -Force -Path "$Project" -Name "$HeaderFile" -ItemType "file" -Value "$VerFileHead$VerUrl$VerDate$VerSubj$VerBy$VerDirty$VerFileTail"
).Name + " as:"
Get-Content "$Project\$HeaderFile"


A few notes regarding this script. It's placement is not very important, but it is important to include it in the repository. A git executable should be on the PATH. The way this example is organized it makes possible for each project to independently choose to create or not to create the header version file. That version header file mentioned on the fifth line can be named differently - either via an extra parameter when calling PowerShell or as the default value on that fifth line. It is important to include the header file name (version.h in this case) in .gitignore file so that it is ignored by git. Finally there is no automation I came up with to generate an html link to the repository on GitHub. You will need to manually edit the <USER> and <REPO> placeholders to point to your repository. I am not going over the script in greater detail for a simple reason - if you can not make sense out of it, you should not be using it - even if it is not destructive.

As an example, the version_info.ps1 script will generate a version.h header file in a project root directory, which may look something like this:

#pragma once
#include <string>

namespace server {
  const std::string VerUrl="<USER>/<REPO>/commit/<SHA_WILL_BE_HERE>";
  const std::string VerDate="2013-10-22 15:00:00 -0700";
  const std::string VerSubj="properly-tested-commit-enforcement";
  const std::string VerAuthor="The Developer <[email protected]>";
  const bool VerDirty=true;

Since each project generates version information by running it's designated pre-build step and the PowerShell script set up to write data into the project's directory I found no issues with parallel builds. I tried to force it by running the pre-build script on about twenty simple projects set to compile in parallel - and did not find any indication of a problem. That said, parallel execution is often wacky and you should check it as a culprit if run into weird/inconsistent behaviors.

Finally, let's pull this into the code base and have it log version information and enforce production builds running only out of clean repositories. Here is an example of what I am doing. I trust it you will know how to adjust the namespace and variables you use to customize the above steps:

#include "version.h"

void Server::versionLogAndVet() {

  Log::Info("Version Date: ",       server::VerDate);
  Log::Info("Version URL: ",        server::VerUrl);
  Log::Info("Version Info: ",       server::VerSubj);
  Log::Info("Version Author: ",     server::VerAuthor);
  Log::Info("Version Repo Clean: ", (server::VerDirty? "NO, it is DIRTY" : "yes" ));

#if defined(_DEBUG)
  Log::Info("Build configuration: DEBUG");

#elif defined(LOGGED)
  Log::Info("Build configuration: LOGGED");

  Log::Info("Build configuration: RELEASE");


#if ! defined(_DEBUG)

  if (server::VerDirty)
    Log::Error("Must NOT run production code build from a dirty repository. Server process STOPPED.");

    std::cerr << std::endl;
    std::cerr << "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" << std::endl;
    std::cerr << "Must NOT run production code build from a dirty repository." << std::endl;
    std::cerr << "STOPPED.    Press  Enter key  to exit  or close the window." << std::endl;
    std::cerr << "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" << std::endl;
    std::cerr << std::endl;




In that code all build configurations but those defining _DEBUG macro considered production builds and are not allowed to run the server process built from a dirty repo. Note, that this is a safety guard - this is NOT a security feature.

You can download all code exaples from this post as an archived file from the gist page.

Microsoft is apparently not considering the undefined parameter expansion behavior problem as a bug. They have closed the corresponding ticket with the "by design" reasoning, meaning the undefined behavior is OK with them and they will not fix it. Which means, that if any of the developers trying to use this solution run into the escaping issue, they will need to move or potentially restructure the affected repository.

read more ...

Monday, August 19, 2013

On justice in Bhutan

While in Bhutan, we naturally happened to compare crime in Bhutan and Western societies when talking to locals, and then switched to how communities deal with criminals. Apparently in Bhutan courts encourage local, "communal" conflict resolution for non-violent issues. This is more than a judicial attitude nuance.

As far as I understand, any society has some people demonstrating deviant behaviors. I suspect it is the context which society creates around deviant behaviors, which determines scope of their cumulative impact.

In Western societies, especially in the US, there is little personal touch in crime prevention. The punishment is often disproportionate. And community re-integration is hard and rare.

In Bhutan, with it's 700K population and about 100K in the capital, most people live in small communities (by Western measures). Crime prevention is naturally personal. People know who may be up to no good, observe what's going on, and catch many troubles before they happen. Of course I am not quoting from a statistical survey but from a single witness, a.k.a. anecdotal evidence. Yet it fits well into the rest of the picture we saw in Bhutan.

If, like I was told, judicial system is very dependent on a community, it creates all the difference in judging. No matter how emphatic is a judge, he or she judges a non-related person. And while it is the whole point of blind justice, it has a clear downside of formal law application. The problem is worsened by commercial interests lobbying to fill prisons. When villagers judge, they are very much considering somebody who is not foreign to them, and with whom they lived and will live for a while. They are also likely neighbors of the offender's family. It is a very different conflict resolution process with a natural re-integration done by the community.

I am not trying to romanticize Bhutan's court of community. It's potential dangers should be recognized. It is dependent on a particular Buddhist culture. It is not uniform - outcomes are unpredictable case-by-case. It will not scale to larger, weakly-connected communities. Heck, I know very little about it. Nevertheless I can not help but observe some of it's positive properties in the specific environment. And note it as a good example of a government being mindful of the nation's culture instead of adopting an "all-out" Western approach.

(Also, a somewhat related observation published in my other Google+ post on Bhutan)

read more ...

Monday, March 4, 2013

Patent-less information technology / what is your job

It seems that patent systems frustrate more and more people in information technology these days. From what I hear so far mostly patent holders, enforcers, their more aggressive practices, and abuse of intent of law are blamed for the trend.

I think there is another side of the story. Here is the foundation of my reasoning:

An invention patent is an exchange in it's core. A society promises an inventor the right to exclude others from using the invention for a period of time in exchange for the public disclosure of the invention.

For a patent holder this exchange is valuable as long as they are able to exploit the exclusivity period. The cheaper is the enforcement and the longer is the exclusivity period - the better business opportunity it presents.

For the society the exchange is valuable for two reasons. First, it guarantees availability of a non-trivial invention to a general public after a fixed period of time. Secondly, if done properly, it creates a double network effect - encouraging and enabling further inventions.

There are two concerns: (1) how did parties (en-masse) took care of their perceived obligations under the exchange and (2) whether the original premises are still as valuable and relevant as before.

The first concern is much talked about. The combination of patent being a sellable economic right, society's failure to guarantee exclusivity at a reasonable cost, and patenting trivial inventions - all contributed to a madhouse full of patent trolls, gorging lawyers, and companies clawing each other's throats. With disenfranchised individual inventors in the midst of it.

The second concern does not get enough attention - yet I believe it is directly responsible for the recent uproar against the patent system.

Up until recently (in historical time scale) inventors were rare and far between. There simply were not that many inventions around and businesses had the survival need to acuire inventions that were worth execution. It was also society's interest to avoid businesses hoarding inventions. Inventions themselves had value past the patent exclusivity period - meaning that it was more likely that the patent would expire without being re-invented, than someone else to come up with the same invention before the expiration. Think about it - if another inventor is very likely to solve same problem sooner than the exclusivity period is over, why is it a society's interest to protect the original inventor? The game of the exclusivity timeframe is a bet - where a society's bet is that it is more efficient to protect one inventor and give public access to the invention after a period of time, than to have the problem solved ad hoc individually by those who need it solved.

Nowadays a percentage of creative workers in the workforce is of no comparison to a hundred years ago. In information technology companies actually do foster conveyor-style innovation and blur the borderline between innovation and invention. When paired with proper prioritization, innovation happens with a fast enough pace to be considered invention after a short period of time. Invention is no longer driven by the lure of exclusively seizing an opportunity. It is driven by the basic modern business need to «innovate or die». Invention is no longer a differentiator. It is on par with the rest of business survival prerequisites: skills, priorities, motivation, and execution. Inventions no longer drive businesses. Businesses drive inventions.

What does this mean for patents?

For one thing it means that a society's incentives to grant exclusivity are much reduced. There is no upside the society has in exchange for the right to exclude. An ability to execute is a more scarce resource now, than innovation. Most benefit for the society is to incentivise not those who know how. But those who can prioritize, motivate, and deliver. And economy is doing it anyway. A business which gets priorities, motivation, and delivery right will find people to innovate and invent for it just fine - at the very least in information technology. May be in other industries as well, but I did not observe them enough in my life.

There is also no dire need for public disclosure of an innovation, should the innovator decide to keep it secret. It is very likely that someone will reinvent it faster, than the patent exclusivity period would have expired.

Looks like there is no benefit for the society to grant patents. Worse, patents get on a way and create frictions. Should information technology patents be just...... abolished?

No need to scream - “But who will innovate?!!!”

Employees and freelancers will. And academics at universities will - many tax-sponsored. Innovation is a part of a job for hire from now on.

No magic.

No exclusivity.

And no Bosch-hell-like patent underworld.

Creative Commons LicenseThis "Patent-less IT / what is your job" by Vlad Didenko is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

read more ...

Monday, February 25, 2013

Data Science as a science

For some time I felt unsatisfied every time going to the Wikipedia article on Data Science, a canonical Radar article and some other sources about Data Science. It does not feel, that there are any actual definitions of the term "Data Science". We see more descriptions of what Data Science includes, requires, or produces. Most of those seem to have a healthy dose of marketing or job security build inside. Kind of like the "Big Data" feeling — which may have some meaning to in a narrow circle of academics, yet completely lost it when abused by marketing teams.

If you look at the Wikipedia's Data Science Illustration, you will find that at least three leafs "feeding" Data Science concept are not defined themselves on the Wikipedia as of January 22d, 2013 - half a year after the file was uploaded. Specifically, "Domain Expertise", "Hacker Mindset", and "Advanced Computing" are intuitively familiar concepts, but not defined by themselves. Strictly speaking they are not fit to be a basis of another definition, being complex concepts themselves.

I think the reason for that is simple. The definition is overly complex. We should make it simple to have a change to solve problems. My suggestion is to apply the Occam's Razor principle. Cut everything but essentials off and see if that is enough.

A simple definition

Data Science is a data science.

In other words, consider defining the term Data Science through the combination of notions of data and science.

More practically, we may put it more verbose like this:

Data Science is an accumulated knowledge about transforming data streams using scientific method for the purpose of ongoing value extraction from alike data streams.

Data sets in this context need to have reoccurring commonalities to be valuable for consideration. With enough commonalities they effectively can be considered discrete forms of data streams.

An (incomplete) break down of Scientific Method

A common (among others) break down of scientific method is mentioned in the Overview of Scientific Method on Wikipedia as follows:

  • Formulate a question
  • Hypothesis
  • Prediction
  • Test (Experiment)
  • Analysis

(I was writing this when the excellent +Daniel Tunkelang post came along)

Later in the Wikipedia article and other resources (including Daniel Tunkelang) add extra steps to the process, most importantly:

  • Replication
  • Application

Application being a practical use of the obtained knowledge to create value.

Applying Occam's Razor tests

We now need to walk through the common definition (rather, description) of Data Science and see, if this short definition of "Data Science" as "data science" implies what is expressed in the wider common description from Wikipedia. I have selected what I feel most representative claims of Data Science from the Wikipedia article. Clearly, this is my biased selection, and I do not address repeating concerns.

Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.

A result of scientific method is what is considered a "proven fact", meaning that a consumer of that fact can use it without possessing a special skills needed for the proof itself. Not very clear, what is understood as "telling a story" in the article, but I count that as check.

Data scientists solve complex data problems through employing deep expertise in some scientific discipline.

Complex data problems - check, deep expertise in some scientific expertise - check (considering that Data Science is a scientific discipline by definition).

It is generally expected that data scientists are able to work with various elements of mathematics, statistics and computer science… Data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines

All mentioned areas of knowledge are necessary in a curriculum of one claiming themselves a Data Scientist in this definition - check. Must the scientific research in this context be a team effort - yes (see below on hardware knowledge and data size) - check.

Good data scientists are able to apply their skills to achieve a broad spectrum of end results [followed by examples of that spectrum]

Given data taxonomy or lingo (see below) at a right level of abstraction results of data science research can apply uniformly to data sets from diverse fields of knowledge - check.

What is the next step?

To live up to be a science, Data Science need to be able to pose questions in a way which allows hypothesis to be tested on multiple data streams. And as it stands right now, we cannot do it as a general case. It is very rarely we can predict, that "a data stream with qualities X,Y, and Z satisfies hypothesis H with α confidence." When we can make such a statement, it is great. However, there are no commonly accepted ways to describe a data stream. Without being able to describe the subject of our experiments precisely in its nature and scope we will not enable others to reasonably replicate or analyze any experiment we conduct.

With the need to describe our subject of study — data streams — it seems to me that the most dire need of Data Science is for a modern Carl Linnaeus to come to scene and create some sort of taxonomy of data and data streams. Although foundations of such taxonomy may already be present as branches of semiotics (syntax, semantics, and pragmatics), there is no unifying data taxonomy effort I am aware of, which would enable posting a meaningful hypothesis. Not to say that I am right on this one :-) .

It does not have to be a taxonomy. It may happen to be a common lingo, specific to each sub-field, like it happened in mathematics. The point is that to meaningfully follow scientific method when exploring data streams we need the ability to describe the nature and scope of data elements in streams being tested for a hypothesis.

Does size matter?

With all the disrespect to marketing uses of "Big Data" label, it is still important to understand if the data size matters for Data Science.

Like in every other science it does. Sample size may be too small to proof anything or too big and waste valuable resource. Think of this - you do not need a bowl of soup to decide if it is overly salted. If it is reasonably mixed, a table spoon will be enough. Same with data samples in any science - they need to be above threshold of statistical significance for us to be confident in the results.

Consider the confidence illustration in the article on statistical significance, also shown above. If signal to noise ratio will be determined by data quality and the quality of our data taxonomy or lingo, then the rest of confidence comes from less-then linear correlation with sample size. Bigger samples will improve the confidence, but processing more data once a comfortable confidence level achieved is just a waste of compute resources.

Given poor signal to noise ratio in some internet-generated streams, as well as potentially a very selective hypothesis (the one which makes a statement about a small subset of data), there is and will be a need to process massive amounts of data for the sake of a proof, especially on the intake end. That pulls in hardware, data layout and architecture expertise as a prerequisite to projects under those conditions.

Finally, consider the application step of the scientific process loop. As results of Data Science process are applied in practice, the value of the process increases with each chunk of data processed. By the nature of the way Data Science produces value it encourages more data being processed. Even if a hypothesis did not take much data to get to a proof, its' application, the engineering chunk of the process, may end up dealing with cumulatively large data sets.

A contentious issue of tools

It should matter if a prediction is tested in Hadoop or MongoDB only in one sense - that results are replicable using technology of any capable vendor. Likewise, when chemists test a spectral analysis prediction, it is not ultimately important which brand of spectrometer is used in the experiment, but it is important, that the prediction is confirmed outside of a single tool vendor's ecosystem.

Multiple vendor consideration may put extra constrains on how to specify data stream parameters in hypothesis.

Is it going to happen?

Science is traditionally carried out in academia and proprietary research labs. However, corporate research labs most often focus on engineering innovations, not theoretical science progress - so the likes of Linked-in, Facebook, or Google are unlikely to pick up this fairly abstract topic.

Some colleges offer what they consider Data Science training. Judging by course descriptions, though, those lean towards practical number-crunching skills, and not the application of scientific method to data. It is yet to be seen if any of them tackle generic abilities of Data Science - starting with precise definitions of data streams.

I will cite my lack of understanding of how academia works and withhold a prediction if a widespread scientific approach in Data Science (in common sense) is a possibility. Given current commercial focus of universities my bets are low.

And that is a pity, because we do not know a more effective methodology than scientific method to achieve a reliable knowledge.

This work is licensed to general public under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit

read more ...

Thursday, January 31, 2013

Compensation components and other terminology

The way I think of financial interactions between an enterprise and an employee is relatively simple. It has six components, which come together for a total package, if we put other benefits, like pensions, insurance, tuition, and similar aside for now. But before defining them, let me define three time horizons for the purpose of this discussion:

  • short-term - time comparable with the enterprise's financial planning period. From a quarter to couple years.
  • mid-term - time comparable with an average employment duration at the enterprise. From two to a dozen years.
  • long-term - Anything beyond mid-term. Often tied to expected company longevity or major shareholders' interest span.

The resulting thought I want to leave behind is that different compensation tools have specific purposes. I will also try to show how dangerous are attempts to achieve multiple effects with a single instrument, seemingly tuned for wider application. Finally it should show the importance of clear communication of compensation terms and intents behind them.


A guaranteed income for a defined set of functions with clear performance and quality expectations. The purpose of the salary is to foster a sense of stability and balance the urgency of "need it done tomorrow" with "will have to deal with my outcomes later". Undersized salary creates a negative attitude, hostile interactions between individuals and enterprise. What I have seen is that oversized salaries drove anxiety up, encouraged political posturing and other bad things. Yearly salary changes wider than 5% without changes in underlying expectations destroy the sense of stability and put people in the "underpaid" or "overpaid" buckets, with detrimental effects mentioned above. This is the only compensation component I know of which may and should send the stability message.


A portion of revenue promised to an individual or a group based on their specific and ongoing performance regardless of the enterprise bottom line. My experience in commission-driven sales teams brings the observation that commission drives very short-term-minded behaviors, usually tied to the reporting period, with planning horizon no further than 2-4 such periods. An interesting thing to observe in such teams is how different their definitions of time horizons. Something farther than four sales compensation cycles is beyond their view of long-term and does not worth more than a lounge chit-chat mention.


A non-guaranteed, non-recurring achievement recognition. A tool to drive or encourage a short-term behavior. To be effective, bonus either can be a surprise "thank you" payment, or tied to an extremely thoroughly defined set of objectives. Things I observed as problematic when using bonuses: (a) creating an expectation of a bonus; (b) having vague bonus criteria; (c) for whatever reason refusing to pay bonus if criteria met; (d) announcing a bonus pool / making one person's bonus in reverse relation to another person or a group.

Specifically on each point:

  1. When bonus becomes routine, there is a conflict of expectations. An enterprise still believes it can freely change the bonus amount to drive short-term behaviors. An affiliate then sees bonus as a part of the salary and stability message. When enterprise actually fluctuates the bonus, then employees experience negative effects similar to salary fluctuation.
  2. A poorly defined bonus criteria drives manipulative behaviors, especially for hard to achieve tasks. People do "just enough" to be able to argue for the criteria met, or come up with reasons why effort should be compensated as well as the result would have. Often vaguely defined criteria is a sign of a lazy or incompetent management, one which feels they will not be held accountable, or one which is not motivated to even care about company's mid-term success.
  3. Companies and departments operate in dynamic contexts. Available funds may grow or shrink. It is totally wrong to expose that uncertainty to people at the bonus level. As a tool to drive a specific objective it deteriorates very fast if promises are broken. Employees start to see it as a carrot on a stick tied to their head. They will not run faster next time management wants them to. Because of that I think it is a very good idea to put promised bonus money aside in accounting procedures, almost in an escrow account.
  4. It so obviously creates inner tension in an enterprise, which takes forms of inner politics, manipulation, and sabotage, that it is mind-boggling how companies refer to it as "constructive inner competition". I have only seen it hurt companies in mid-term and beyond.

Profit sharing

A portion of an enterprise or a division profit promised to an individual on an ongoing basis. It is good to use as a mid-term priorities driver. To be effective for the mid-term horizon it needs to be relatively stable and well defined. It addresses people who are expected to see and impact a bigger picture both horizontally (multiple lines of business) as well as vertically (both top and bottom lines, or income and expenses).

Given that: profit sharing can not exceed a certain portion of a balance sheet; need to be relatively stable to have the desired mid-term impact; and should accommodate profit-sharing flexibility of future employees - it is important to avoid over-allocation of it, or to build in time-guards and absolute caps. Profit sharing is one type of compensation, which at least in one case I heard of it linearly declining in percentage over a course of 10 years down to a third of the original formula, with a review possibility.

In terms of what can be done wrong: revenue sharing does not belong to lower-impact levels in the organization. Without material ability to impact mid-term balance sheet most people I saw considered it as a part of salary, quite often as an insignificant corporate gimmick.

A frequently (yearly or similar) changing formula of profit sharing (which includes shifting percentages between departments), creates two problems.

  1. First, it's impact becomes short-term. The benefit itself plays an effective role of a bonus - and people start to treat it as such.
  2. Second, profit sharing is by it's nature is a piece of a pie in a very damaging way. So it becomes a bonus where one person's income is in reverse proportion to another person income. The company with such an arrangement will quickly find on it's hands the bonus problem (d) - and the culture will be destroyed in no time.

I have seen companies trying to combine bonus and profit sharing programs. In the specific case I am thinking about, quarterly bonus was tied to some very vague objectives and depended on a shared profit formula. This is close to an ultimate wrong - achieving none of the benefits, yet bringing all the damages.

Equity (affiliates)

I can think of two objectives when giving up equity to people working at the enterprise.

One is to motivate a long-term positive impacts from people, who are in a position for such impacts. It often takes form of option grants to leverage the motivation impact of the equity share. The pitfalls have to do with sizing and pricing the grants, as well as assessing cultural impact on the enterprise when the information becomes public - if the public's perception matters.

Another objective is to motivate early contributors in a situation when there is not enough money to immediately compensate them otherwise. What they really get is a recognition of their time and effort investment (or other resources) in a form of a portion of the company future value. By the nature of startup effort it is a high risk venture for those involved already. Adding uncertainty to the mix greatly reduces the perceived value of the reward. Reducing the uncertainty increases the perceived value.

Two major components of the uncertainty are being able to control a liquidity event and a pricing formula. Even though specific time or price are not known upfront, the clear definition of who and when can "press the button" is a must in all situations I have seen. For all I have heard for startups, equity vests after the first year - on average about when business viability becomes clear. Pricing formula is often not addressed for this early equity allocation for companies which drive towards a capital event - with a common expectation of a cash out to pay for the invested time and effort. I have seen startups which do not plan a capital event in a foreseeable time to spell out reasonable repurchase conditions, may be in a multiplier of rolling revenue or similar.

The biggest problem in both scenarios (start up and long-term motivation) I saw was when the status of equity at the time of separation was not clear - and there is no public market for that equity. When company leaves itself an option to buy or not to buy the equity out, then people stay longer than they should without much interest in their work. They hope that something will get clear "soon" - and rumor mill always finds something to feed the anxiety in that regard. All in all it is more beneficial to have a well defined separation process - and a valuation as soon as possible. That helps to move out of the way those who non-productive faster before the atmosphere becomes toxic.

The second issue creeps up is when shares are diluted faster then they appreciate absolute value. Equity owners obviously do not like to see the asset value to go down. Especially when they see it happening not because of market forces, but because they think management needs to please other people by issuing extra shares.

Equity (third parties)

As this write up talks about relationships between an enterprise and an employee, then the third parties with equity are most likely are past employees or heirs of past employees. Depending on equity transfer rules equity may also be gifted, split at a time of divorce, etc. For publicly traded company this likely is not an issue. However, for private equity company this may become an issue. So the units transferability should not be a part of the equity benefit, except for survivorship, inheritance, and where required by law.

I do not see any benefit or legitimate use for a company to encourage equity transfer away from current or past employees. I do see benefits of better focus when such equity is routinely bought out and kept to an unavoidable minimum.

The effects of mixing things

As I have mentioned at the very beginning, I saw much damage done when tools with different expected impact horizons - and purposes - stirred together. There is a reason the options are presented above in the order they are. A common management pattern is to use a concept made to benefit a farther horizon, but to modify it, so that it effectively plays role of a closer-horizon tool. Above I mentioned bonuses becoming part of a salary, profit sharing treated as a bonus. In many trading companies people talk about bonuses - when really those are commission arrangements. Even equity may be arranged in an operating agreement in a way to mimic profit sharing features.

A valid question is: why does it matter? Supposedly we can address the bad scenarios mentioned above, watch out for new problems and not to worry about intricacies of compensation terminology.

It needs to be addressed because quality of used terms affects the quality of communication. In a good case it causes people spend extra time to get on the same page with each new person involved. In a bad case it causes lasting misunderstanding and wrong business and personal choices. In a worst case it enables someone to use the ambiguity tor manipulation, create an arbitrage on different understandings of used concepts. There is no positive outcome to ignore the terminology.

I see three reasons for discrepancy between the intent and description to happen - besides an low-functioning plot to confuse employees.

The one reason I am on board with is potential tax benefits. If it is good for one or both of the sides, everyone clearly understands intentions and how they are arranged in compensation package, and everything is legal - nothing to worry about, the discrepancy is worth the nuisance.

Another observed reason for such discrepancy is when the compensation structure is created with very clear intentions and proper verbiage initial. However, it later shifts to accommodate new scale, tax, or legal realities. Sometimes in this process the original intentions are lost and new concepts serve current perceptions of the intentions, or new intentions altogether. This is not a good scenario, as it goes hand-in-hand with poor communication to employees. Really, if the management can not maintain continuity of intentions, what would it communicate to employees?

Finally, it is possible that the management acting with best intentions, however does not clearly map those intentions into compensation simply because of the lack of expertise or perceived priority. This is theoretically easier to fix than the previous case. It will take time and effort which are unavoidable. If not done carefully it may also leave a cultural scar behind. This is the scenario, which is easier to avoid than to fix.

This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit

read more ...

Thursday, October 25, 2012

IT Architects in bonus-driven organisations

Over years I observed people being excited and disappointed when approaching IT architecture efforts in mid-size (under 500-600 people) bonus-driven organizations. I can not recall one instance, where a forethought architecture allowed a continuous effort to survive - and continue to extract value from past efforts - for more than three years. Even when business environment was relatively stable, individuals' need to justify bonus drove unnecessary changes in to the system. With bonus periods ranging from 3 month to 1 year, there seem to be a mis-alignment with the time line of benefits from a thoughtful architecture realizing over 1 to 5 years.

I have put together some thoughts on risks IT Architects may encounter in mid-size bonus-driven environments. From this unscientific elaboration it seems to boil down to power balance.

There are organizations, which set architects on business and technology sidelines, with no organizational and financial authority over development process. The common rhetoric in this case is: "If your architecture suggestions are so good, the people will take them to better their own future." There are few problems with that:

  • Like almost anyone else architects are too busy to address improvement of a working system and mostly address broken stuff. Which means that in 99% cases they are trouble messengers.
  • More often than not, teams and individuals have vested interests in maintaining a status quo - pride, complacency, inertia, misplaced priorities, lack of resources all contribute to it. That makes first point really negative in their eyes, instead of "let's improve" positive twist.
  • No matter what "leadership" BS floats around about pervasive communication skills, no sweet or pervasive talk will override people's financial and career incentives in the long run. As architectural changes are most about long run, mis-aligned incentives either require an architect to possess organizational power or make architect's job impossible.
  • No organization I know of can achieve reasonable alignment of individual's incentives and long-term organization incentives in a context of a bonus system, so prominent in financial companies. That enhances adverse effects of previous points.

So, which environments seem to be receptive and able to extract long-term value of architect's skills?

Obviously there are environments, which explicitly delegate power to architects. Usually those are larger organizations. The upside is that architectural effort and continuity is possible. The downside is that those organizations are often perceived as too bureaucratic and mind-numbing environments.

Secondly, there are situations, when technology managers are architects by nature, experience, or training. That is where architects by role get the necessary powers implicitly. The upside is that that leader's specific group will benefit from architectural effort. The downside is that it will not necessarily be aligned with organization as a whole, but rather with that manager's incentives.

Another scenario, is when individual technologists possess talents of an architect. There are tangible benefits for an organization at that. Those developers usually worth their weight in gold every quarter - you get the idea. They are, however, subject to the same downsides as a manager-architect - that is their incentives impact architecture too much.

Next scenario is an executive in a mid-size organization has a strong conviction of the necessity of architectural work and heavy-handily enforces the power of an architect's word. The downside it that the architect's position is very politically unstable. It is also hard to fulfill with an external architect (vs home-grown). The problem is that an experienced architect will have a hard time taking a promise of organizational support. Only a few mid-size organizations are willing to build architect's powers into an org chart with real executive weight behind it - exactly because they are so bureaucracy-avert and that seems to them as a bureaucratic thing to do.

There is one other scenario which I can speculate about. Consider an architect is an agent of business operations, paid by business operations, who interfaces with technologists and, as a customer representative, can demand a certain architecture. That may only work if technologists are not defensive about someone telling them what to do in such a detail. There are two other problems with this scenario. I have never seen it done (or talked to a witness). This scenario also risks to transform into the "throw-over-the-fence" development model - the one I do not like.

In a flat-compensation organization where bonuses play below-noticeable role some of these concerns are minimal or not applicable.

If I missed a scenario - let me know. I am not saying that an architect's position is a no-go in a bonus-driven organization. I am merely raising awareness of the risks present. Understanding how (and if) an organizational architecture supports a role of architect.

read more ...

Sunday, September 30, 2012

Cross your "tee"

In dealing with bash scripts we often need to troubleshoot a data pipeline. Many times I saw injecting tee utility into the pipeline as means to copy data into a log file. Often those tee calls are left behind - as it seem innocuous. It is a pass-through, changes nothing, right?

Not necessarily so. Look at the way we write pipelines:

producer | tee file.log | consumer

The fact that producer sends data to tee through the pipe means that there is a buffer allocated for the data transfer, which is most commonly a mere 64 Kb (4 x16 Kb buffers) size. If the receiving process is not fast enough to take data off of that buffer and the buffer is full, then the producer process will block on the next write. It is a totally reasonable architecture in many cases. However, if you have a fast producer-consumer pair and inject a slow tee between them, you will pay a performance penalty. Look at this very simple scenario.

Here is what in the directory:

$ l
total 19531296
-rwxr-x---  1 vlad  staff  -   38B Jun 21 16:30 cat2null*
-rwxr-x---  1 vlad  staff  -   44B Jun 21 16:22 catcat2null*
-rwxr-x---  1 vlad  staff  -   50B Jun 21 16:24 catcatcat2null*
-rwxr-x---  1 vlad  staff  -   60B Jun 21 16:23 catntee*
-rw-r-----  1 vlad  staff  -  9.3G Jun 21 16:21 large_file
-rwxr-x---  1 vlad  staff  -  343B Jun 21 16:21 mk10g*

Which is a 10 gigabyte file and a few scripts. The large_file consists of 99-char-long lines.

The cat* scripts have a relatively slow producer, reading data from local disk (SSD in my case). We have some very fast consumers, which either promptly discard the data using /dev/null, or put it through other supposedly fast interim pipes.

Note how scripts with supposedly fast pass-through stages take significantly longer with each extra buffered StdIO pipe involved (cache is primed to have comparable conditions):

$ cat large_file >/dev/null; for i in ./cat*; do echo ""; echo "----- $i -----"; cat $i; echo -n "-----"; time $i; done; echo ""

----- ./cat2null -----
cat large_file >/dev/null
real 0m7.929s
user 0m0.027s
sys 0m3.130s

----- ./catcat2null -----
cat large_file | cat >/dev/null
real 0m8.688s
user 0m0.242s
sys 0m6.435s

----- ./catcatcat2null -----
cat large_file | cat | cat >/dev/null
real 0m8.668s
user 0m0.393s
sys 0m9.941s

----- ./catntee -----
cat large_file | tee /dev/null | cat >/dev/null
real 0m9.399s
user 0m0.639s
sys 0m12.258s

It is very telling. Interim cat filters and tee for /dev/null are not perceived as expensive, yet they turn out to consume significantly more system time. I am running on a single-CPU, multi-core MacBook Pro and the scripts do not exhaust available cores. So the wall (real) time does not grow as fast as the system (sys) time does. That is, if about 9% jump for the first pipe introduced is not a big thing for you. Keep in mind, that in case of a busy system, or with a highly concurrent software, where existing cores are tightly scheduled, that sys time will swiftly spill over to real time. And you will not like the slow-down.

I have seen a production process on a twelve-core server speed up eight fold when the logging tee which pointed to an NFS mount was removed. The point is that mindless cruft has a good chance to hurt your system.

Beware. Keep your scripts clean. Remove unnecessary pipe components, especially if they do IO.

P.S. If you would like to reproduce the test on your system, here is the mk10g script to generate that 10 gigabyte test file:


rm -f 1m 1g large_file

for (( i=0 ; i<1000 ; i=i+1 ))
  echo -en ${k} >> 1m

for (( i=0 ; i<1000 ; i=i+1 ))
  cat 1m >> 1g

for (( i=0 ; i<10 ; i=i+1 ))
  cat 1g >> large_file

rm -f 1m 1g

read more ...

Monday, April 2, 2012

Should consumer privacy be a regulated issue?

Feels like the recent criticism of Google's unified privacy policy is a misplaced focus. The controversy centers on company, while it should focus on practice. There are three new problems with large media companies:

Ecosystems are too big to ignore

Commercial ecosystems on internet become big and unavoidable by definition of their success. It is important, as there are more and more professions, whose professional success requires participation in certain ecosystems. In my mind, LinkedIn and Facebook are front runners in by that criteria. Making people aware of privacy policies and turning them away if they do not agree is a strong-arming policy which serves only the ecosystem operator entity, but not participant people or society at large. It is a pretend choice, not a real one.
An example is LinkedIn for some types of businesses. Can a technology recruiter survive these days without a LinkedIn contract? I do not think so.

Sender forces you to subscribe for a policy

People often operate as guests of ecosystems. They may have financial or personal needs to attend to content offered by ecosystem participants. There are other drivers which deprive visitors of real choice.

All or nothing approach - especially in hidden contracts

When an application offers a user to accept it's privacy policy, there are privacy-related functions which the usen has to take or leave as a package. Even though a privacy setting may affect a specific function, not the whole application, the user will need to abandon the application.
In the matter of fact, Google shows an example of an opposite - one can browse Google Maps on Android with or without GPS turned on.
This is especially a problem, when one buys a phone at a carrier shop, which is broadly advertised having an application ecosystem and specific applications in it.
A user, who buys into the contract after they liked the advertised application set, is unpleasantly surprised with the terms and conditions (in addition to the privacy policy) of, for example, Google Play. Talk about feeling strong-armed.

So, if all this coercion by companies is so prominent and still is unchecked at large, is it time to ask for regulation?

Monday, February 20, 2012

Parameters for setTimeout payload (JavaScript)

A step in building an implementation of Conway's Game of Life with my son - make the game turns going on the web page. What is not obvious for him yet is why it is important to avoid global objects to feed as parameters to setTimeout function or just keep around. But it can be done - easily:

var game = function(times, interval){
    game.times = times;
    game.interval = interval;
    game.turn = function(){
        if (times > 0) {
            console.log("You have " + times + " turns left");
            times = times - 1;
            setTimeout(game.turn, interval, game);
        } else {
            console.log("Game over!");

game( 10, 1000 );

Tuesday, January 31, 2012

NodeJS: install oddball tarballs with NPM

It is useful at times to install a head version of a module or a branch which is not in the npm.js repository. No big deal, if the module hosted on Npm can install from tar files available over http, and Github allows simple URL acces to tarballs.

The UI process: go to the repository of your choice, click on Code and select a branch. Otherwise go to Tags or Downloads on the right side of the page. Right-click on a tarball link of your liking and copy it to the clipboard. Then simply npm install <paste_here> install the module.
To directly enter the URL run

npm install

Keep in mind, that you may royally mess up versioning using this method. Be mindful.

Monday, December 26, 2011

Calculate created files' size

Quick note to self on calculating cumulative size of regular files created in a certain timeframe:
find . -type f -a \( ! -ctime +6 \) -a -ctime +5 \
       -a -xdev -a -exec du -b \{\} \; |
cut -f1 -s |
( sz=0; while read fsz; do sz=$((${sz}+${fsz})); done; echo ${sz} )
In the example above it will add up sizes of files updated between 5 and 6 days ago. Mac OSX users remove the -b option.

Wednesday, June 15, 2011

To ask or not to ask

While Marina waits for eel to be skinned in a Chinatown shop, Dennis contemplates by the store's frog tank:

- I wonder, if Marina may cook that frog...

- (me) Ask her.

- Oh, no, she will cook it!

Friday, April 1, 2011

Questions about nuclear power industry impact

It would be interesting to learn about studies answering some questions I did not find answers to:
  1. What is the environmental and human impact of the whole supply chain for nuclear energy, not just the power plants? That would include warehousing, transportation, waste disposal.

  2. Most statistics about industrial influence on human life focus on the immediate mortality and length of life. It would be interesting to have some sort of measurement of life quality for accident victims. Than may include their disease history and community metrics. For example, my current perception is that victims of nuclear accidents will receive more per person support from a US government, than those suffering from the coal industry impact. True or not?

  3. How many of the impacted people lost a bodily function - sight, hearing, olfaction, limbs' functions?

  4. How many people had to take medicine for life or generally be under medical supervision for life?

  5. What is a percentage of people who gained full employment after an incident?

  6. For how long they were able to sustain the employment?

  7. How about their average income? Is there statistically significant change?

  8. Did their living space square footage change?

  9. Were they required to relocate and how far?

Any reference to materials suitable for general public are appreciated. Also, other questions measuring life quality would be interesting to hear.

Seth's Blog post: The triumph of coal marketing

The Seth's Blog post: The triumph of coal marketing raises interesting questions. Mr.Godin positions the questions as marketing ones. Overall I understood the general thought flow of his post as:

(a) The graphic allows to experience the perception bias against the nuclear power industry. (b) It is nothing but coal industry's marketing, which moved public perception against the nuclear power industry. (c) Marketing is powerful.

The (a) and (c) statements are quite agreeable to me, albeit a bit trivial. The (b) statement seems to be a careless public disservice - and here is why I think so.

After asking myself, which issues would make me emotionally side with coal or nuclear energy, I found the following (completely unscientifically, just sifting through my memory):

  • Nuclear energy installations' ability to produce less frequent, but large events, while there seem to be more of smaller events, or continuous damage from the coal industry. I believe it is a natural property of our mind to emphasize large infrequent events, than almost routine stream of small events.

  • Highly emotional attachment of nuclear technology to it's first use as a weapon in Japan. Quite reverse (even though centuries ago), coal was initially used to keep houses warm. No wonder coal has a more positive clout from the beginning, vs. the negative one for nuclear power. Modern nuclear arms issues and fears do not help the positive image either.

  • Traditional distrust to many government-provided reports. Such distrust may have diverse roots. For example, for the population of former Soviet Union, it may stem from a long history of fabrications, half-truths and secrecies throughout the government. For American population it may come from the overall secrecy of the original use of nuclear technology by the military. The distrust is also fueled by the questionable reporting practices. For example, "The triumph of coal marketing" blog post is based on an original article at Next Big Future blog. That article in turn bases it's coal and nuclear industries' numbers on a 10-year old ExternE publication, 12-year-old IAEA publication, an unspecified source for WHO data on yearly deaths, a broken link to Metal/Nonmetal fatalities, and a non-specific (not separating coal vs. other) document at the US Department of Labor, a WHO Chernobyl study (single event, not industry-wide), and other resources not related to coal or nuclear power. So, there are about 10 years of missing data about nuclear energy related deaths and 10 years of non-applicable data about mining. Even though the data may be right in the ballpark and fits the purpose of the original article just fine, it's extraction for Mr.Godin's post is bad enough to make this reporting either sensational, fuel that same distrust, or both (no wonder that chart unsettles a lot of people). Such quality of "reporting" comes from all sides of the debate, which does not help either.

  • Perceived choice of living proximity to a source of trouble seems to be different for nuclear and coal industries. Dangerous coal industry's locations perceived to be mostly coal mines at known locations. Many people would not think twice about living nearby those. Nuclear industry's main perceived troublemakers are energy plants - right in everyone's backyard, or at least in a metropolitan area. There also seems to be a perception that a nuclear plant may come to a neighborhood and there is nothing people can do about it. Clearly, it is not the first thought to associate coal industry dangers with coal distribution (transportation) or power plants - people do not consider it a close proximity danger as much as they do coal mines. Although I know nobody who would be thrilled to live by coal power plants, either. The point is that there feels to be less of of them build and with less buzz surrounding them.

  • Perception that long-term deaths (a.k.a. shortened life) and overall impact caused by nuclear industry are more plentiful and more impactful, whatever that means. There is little awareness which I notice about specific long-term effects on life expectancy for both coal and nuclear energy. There is enough awareness of generic long term effects of both. So most people I know will claim that negative side-effects or events of either technology potentially shortening an individual's lifespan in general. However, I can only think of a handful people I know, who would contemplate that the averaged lifespan change is about the same for people impacted by accidents in both industries. Most people will perceive the radioactive incidents to have much bigger impact on life expectancy and quality - both stronger and longer. From what I understand, that comes from a few sources:

    1. The medical fall-off from the Hiroshima and Nagasaki bombings (as in the last sentence of the RERF FAQ)
    2. General high-school physics knowledge that some isotopes have long half-lives.
    3. Mass evacuations and relocations (proper or not) which happened throughout the nuclear industry history.
    4. Government and journalistic emphasis on nuclear energy and radioactivity events in general. That seems to be coming from both of those groups' attention seeking behavior, which goes after the novelty of a subject and the subject's large events.

I did not see enough coal industry marketing to attribute my mindset to it. Unless it is an "invisible marketing"".

Based on the thinking above I can only see the "marketing theory" as a conspiracy theory. Mr.Godin says "it was advertising, or perhaps deliberate story telling". For completeness, he brings "the stories we tell", yet still converges on marketing. To me it sounds like a conspiracy theory. Here is why:

  • The "marketing theory" does not seem to explain more of the evidence than a mainstream story.
  • It employs a fallacy to jump to a conclusion.
  • There is no logic offered whatsoever to support the main thesis: "any time reality doesn't match your expectations, it means that marketing was involved".
  • No credible whistle-blowers or examples named at all, less so to support the scale of the claim.
  • The "marketing theory" is hard to falsify in this application.

Here are some reasons I can think of for people to promote conspiracy theories:

  1. if people have no personal interest promoting a conspiracy theory:
    1. they may be dumb - clearly not Mr.Godin's case
    2. they may be bored - also an unlikely case for the busy businessman
    3. they act mindlessly
  2. if people do have personal interest promoting a conspiracy theory:
    1. they may represent special interest groups. Mr.Godin gives neither representations or disclaimers, however, I dismiss that thought as improbable (*).
    2. they seek cheap publicity, often riding on a wave of some public concern.

As one may guess, my bets are on scenarios 1.3 and 2.2 for Mr.Godin's post. Both are unfortunate as the resulting post creates a negative societal value, i.e. a disservice. It is quite disheartening to see it from such a prominent public figure. Especially assuming that many readers will not care about the intricacies of the intentions.

Some needed closing comments and disclaimers:

I do not belong to any special interests party in the debate other than being a concerned citizen. I do believe that data used is correct in general and nuclear energy should have it's place in the future. I do agree that marketing is a powerful tool, which may be used for good and bad. I do not like when people exploit a community trouble without aiding the troubled community in that process.

My contributions in this post are completely unscientific. Rather they are personal observations and speculations, much like the original post of Mr.Godin.

* Also, without the relevant disclaimer a post like that would likely to violate FTC Rules if author received a compensation for it.

Monday, February 7, 2011

About notes on document-oriented NoSQL by Curt Monash

I feel that couple observations need to be added to the Notes on document-oriented NoSQL by Curt Monash.

MongoDB does not store JSON documents, but rather JSON-style documents - specifically BSON ( It has important performance benefits for mostly numeric flexible-schema stores (read - health and social statistics, finance). Effectively the data does not need to serialize in and out of character stream between application objects and the store.

That also allows MongodB to manage storage as a set of memory-mapped files so that the DB server has little need and overhead of managing data persistence on disk. A side effect of memory mapped files efficiency is that objects are capped in size. I believe the current limit is 8MB, but do not quote me on that.

Many RDBMS implementations can have explicit (foreign keys) and implicit (joins) references between data items. That allows to build an arbitrary, albeit complex, data graph and have it persisted in the data's meta-data or at least somewhere between an application and DB. For example in queries, views, and stored procedures.

BSON, like JSON, represents inherently acyclic data graphs - effectively directed trees. It has no build-in mechanism to keep record of any relationships in data except for containment at below object level. That seems to be consistent with MongoDB's philosophy of disclaiming any significant responsibility for meta-data. If schema management is not in the DB engine, then why should meta-schema be in the DB engine? That is a blessing and a curse, as one needs to use a proprietary format if they want to persist the structural information in the data store.

In XML-based stores one has an option of using the family of XML-related standards to record and query edges of a data graph. One can check a schema validity for an XML document. I doubt MarkLogic does it natively today however it is very conceivable to have XPath references from inside one document into innards of another and have the relationship followed as a part of a query. This is a bit more than a "true document" notion as it is a cross-document relationship.

Same thing - it is a blessing and a curse as it brings to the table character serialization penalty and an easy way to make a very convoluted data design. And I do not even want to go into performance issues of a random graph traversal.

Saturday, December 25, 2010

Question to Google

Dear Madam or Sir,

As of today, Dec. 25, 2010, the Google Terms of Service in paragraph 8.3 seems to allow Google to censor the content provided by users. It does not have a reference to overriding documents.

My concern is about email usage. Also as of today, Gmail's terms do not touch on the issue and in fact refer back to that same document which seems to allow censorship.

Does that mean that Google acquired the legal capacity to censor any materials it wants in emails? Please, note, that I am not asking about an intent, but about the current legal capacity granted by users who accepted the above mentioned terms and conditions.

Vlad Didenko

Sunday, September 12, 2010

Meta-creation, anyone?

Why some religions are so anti-evolution? Would not God which created (programmed) such simple and cool evolution rules be cool? Kinda, meta-creation-cool? I mean, that may move dogma a bit, but it may actually win some extra friends, would not it?

Friday, September 10, 2010

Pain Ray from Raytheon

The NPR story on the Raytheon device followed today the ongoing ACLU reporting and the specific coverage of the terrible invention.

Very, horribly bad development. Patching the result, not the root cause of the problem - and deteriorating the society in process. The device's safety is absolutely unclear. We are still convinced that microwaving eye's cornea to 2/3 of it's depth is a bad thing. Witness accounts tell that is feels like scalding their skin. Would anyone consider scalding one's eyes healthy? Other health concerns are posted in the same ACLU article.

But the worst effect is the moral one, mentioned by some commenters to the NPR story. Using the device is like a video game for prison staff - they are removed from the action. It is an ultimate humiliation and indignity for inmates, being tortured by a human "above them". The device application is not improving inmates' social outlook, it's making them feel whipped animals and gives a moral "permit" to be anti-social. All because bureaucrats mis-balanced prison occupancy and laws and now quick-fixing their mistakes at the expense of human lives.

Zimbardo's study is directly applicable here with it's effects and discoveries.

The other issue is that the NPR story is very unbalanced, showing lots of "pro" material and no elaboration on "cons" considerations. We hope NPR sticks to better reporting.

Tuesday, August 31, 2010

Locking daemons in bash

NOTE: Scripts in this post were tested in CentOS release 5.3, Ubuntu 14.04 and may not work in other Unix dialects. For example, flock(1) utility is absent from OS X 10.6.


Shell scripts designed to have only one copy running often require manual clean-up after an ungraceful shutdown or process crash. This solution provides a "self-cleaning" alternative. The full example script is posted on the gitHub page. Run two instances of the script to see it work. As a script writer you are still responsible for any other resource cleanup to achieve a proper startup.


When writing a daemon shell script we traditionally use *.pid files to store the process ID for later use and to avoid multiple copies running simultaneously and potentially colliding. The common pattern of such use is:

self=$(basename $0)
if [ -f ${lock} ]
    echo "Another copy of ${self} potentially running." >&2
    echo "Check the ${lock} lock and remove if necessary." >&2
    exit 1
    echo $$ >${lock}
    trap "rm -f ${lock}" EXIT

The problem with this approach is that after a process or system crash the lock is stale and the new copy of the daemon will not start without manual cleanup.

Another approach, documented in the flock(1) man page and used by system programmers all along, is to use a process file descriptor locked exclusively on a “pid” file. In this case, the pid file becomes a bearer of the lock instead of being a lock itself. A file descriptor lock on a file exists only as long as the file descriptor is alive. When file descriptor owner processes die, the file descriptor is destroyed together with file locks associated with it. In essence, the process holds the lock in it's memory, not in file, and lock is cleaned by the operating system when the process quits or dies. That solves the crash problem of the first approach.

Let’s look at the code:

1:   pidf=/tmp/${self}.pid                 # Define lock file name
2:   exec 221>${pidf}                      # Open the file with descriptor 221
3:   flock --exclusive --nonblock 221 ||   # Attempt to acquire the lock
4:   {
5:       # Lock acquisition failure code   # Your custom error handler here
6:       exit 1                            # optionally exit the script
7:   }
8:   echo $$ >&221                         # Put the PID in lock file

After this block of code, the .pid file has a lock from the current process and it’s new children. Children receive the lock with file descriptors, which they inherit from the parent.

To release the lock, one can either close all file descriptors holding the lock, or use the /sbin/flock --unlock ... command to explicitly release the lock.

A user may check the presence of a lock on a file using the fuser(1) utility:

$ /sbin/fuser -v

                     USER       PID ACCESS COMMAND        auser    28576 F....  monitord
                     auser    28579 F....  ping

Note, that line 8 of the locking code which stores PID of the locking process in a .pid file is redundant, as the information can be retrieved for the lock itself. It is only stored in the file for convenience later in the code, when we need to use the PID to inquiry or manage the process. It is very important to note, that a mere presence of the .pid file in this arrangement does not mean, that the lock is active. It only records a PID of the process which currently holds the lock or was the last one to hold it.

To programmatically test for the presence of the lock we need to attempt to grab the lock. If we fail, then there was another lock already on the file. If we succeed, then there was no lock on the file. In any case we need to close the file descriptor to avoid inadvertently holding the lock ourselves.

if flock --exclusive --nonblock 232 232<${pidf}
    echo "Open"
    echo "Locked"
exec <&232-

This testing technique is only good to test the presence of the lock. It is more convenient to use the fuser(1) utility to send KILL or other signals to locking processes, like this:

/sbin/fuser -k ${pidf}

The fuser utility will send the kill signal to all processes with file descriptors holding the lock, which should take care of the parent and children processes, if any.

There is more to graceful daemon writing in bash. Other topics are logging with rotation, sleeping, and configuration. Be mindful that these issues do not magically disappear when using bash - you need to address them just like you would when writing any other program.