Monday, May 11, 2015

What the tag line from Highlander can teach us about Exlusion Constraints

If you are trying to get your head wrapped around exclusion constraints via a Google search you may get the impression that the only thing it's good for is solving the double booking problem common to reservation systems. The PostgreSQL documentation expands the possibility a teeny tiny bit by mentioning preventing overlapping circles. But that's it! So here is this truly phenomenal feature that thanks to an absence of varied examples you may not think you have a use for.

I'm here to expand the possibilities by introducing what I call the the Highlander use case. But before I do I want to explain why all the examples you find on Google are about the double booking problem. The fact is prior to the existence of exclusion constraints the double booking problem was a really really really really hard problem to solve in the relational database world. And if you managed to solve it, chances are you really didn't, and it was slow at scale because of all the locking required. So when Jeff Davis brought exclusion constraints to PostgreSQL all the write-ups naturally focused on the fact that there was finally a safe and scalable solution to the double booking problem.

What are exclusion constraints? Before we answer that let's start by explaining what a constraint, in general, is. A constraint is a rule expressed as statements that evaluate to a Boolean value (i.e., true or false). In essence a constraint is simply a Boolean expression. Constraints are normally checked when you try to INSERT a new row into a table or try to update an existing row. If the expression evaluates to true the INSERT or UPDATE is allowed to proceed, otherwise it fails and an error is raised.

The 4 constraints that just about everyone who has created a relational database is familiar with is:
  1. NOT NULL - A column cannot accept NULL as a value.
  2. UNIQUE - A column or combination of columns cannot have duplicates.
  3. PRIMARY KEY - Combination of #1 & #2.
  4. FOREIGN KEY - A column (or columns) in Table1 must have a matching entry in Table2.
An exclusion constraint is just a supped up unique constraint in that it has a lot more expressive power to define exactly what unique means. Now on w/ the use case.

Let's say you sell a system where owners create events and their customers attend the events. One of the features you probably want in this system is the ability for customers to provide feedback on their experience. So basically you want surveys.

Lets assume that there are more than one type of event, therefore you may want to have more than one survey and would like to associate specific surveys w/ specific events. Let's visualize that a tiny bit:
Event_1 -> Survey_1
Event_2 -> Survey_2
Event_N -> Default Survey.

Note Event_N and Default Survey. Event_N is any event that has no explicit survey associated w/ it. And when the system sees one of these it needs to send the default survey to the customers that attended those events. So that means we need a way to mark a survey as the default survey. (Hold on to your hat folks the Highlander reference is almost here!) And since, logically speaking, "there can be only one" default survey it would be great if there was a way to express that at the database level because we can't depend on the application nor people not to violate the rule that there should be only one default survey.

So let's quickly summarize what the Highlander use case is by comparing it to the standard unique constraint. With a standard unique constraint there can be no duplicates, period. But in the Highlander use case, you take one value and make it special by saying you can duplicate anything else, but not this; for this, "there can be only one"!

Exclusion constraints makes addressing the Highlander use case trivial. Let's look at some SQL.
CREATE TABLE surveys (
  id SERIAL PRIMARY KEY,
  name VARCHAR(64) NOT NULL UNIQUE,
  survey_questions JSONB NOT NULL,
  is_default BOOLEAN NOT NULL DEFAULT FALSE

The is_default column solves the identity problem. Next we'll finish the SQL off w/ our exclusion constraint:
...,
  EXCLUDE (is_default WITH =) WHERE (is_default)
);
EXCLUDE (is_default WITH =) is equivalent to having a unique constraint on is_default. But that's not enough, for the same reason that a unique constraint doesn't solve the problem. So we need more expressive power. Exclusion constraints gives us that expressive power by allowing us to specify multiple comma separated criteria in the EXCLUDE block and we can specialize even more by adding a WHERE clause. So that's what we've done here. In layman's terms our exclusion constraint says, "the value of the is_default column of the row being inserted or updated must not equal any other row's is_default column value, but only when is_default IS TRUE". This means we can have a gazillion rows where is_default IS FALSE but only one row where is_default IS TRUE.

If you have a good grasp of SQL's syntax in general and PostgreSQL in particular, then the exclusion constraint syntax should be obvious. But if you are thrown by the double use of is_default let me explain.

The first one is a reference to a column's name. In this case the column that we want to reference is named "is_default". The second one needs to be understood in the context of how WHERE clauses work. All WHERE clauses must be Boolean expressions, and since is_default is defined as a Boolean it's by definition a Boolean expression too. So we take advantage of that fact and make it the entirety of the WHERE clause. But if we wanted to be verbose we could have written the WHERE clause as:
WHERE (is_default IS TRUE)
or as
WHERE (is_default=true)

[UPDATE:]Thanks to an anonymous commenter I've learned that PostgreSQL is even cooler than I gave it credit for because the Highlander use case is better solved w/ a pgified unique constraint. Basically, PostgreSQL allows us to apply a WHERE clause to your basic unique constraint, giving us the same expressive power that I was using an exclusion constraint for, but w/ better performance. The lesson here is simple, if you are coming to PostgreSQL from another database or have been living the lie of using only spec compliant syntax, read the PostgreSQL docs carefully because chances are PostgreSQL improves upon the standard in extremely powerful ways (a.k.a, pgified).

For completeness let's include the new version of the DDL:
CREATE TABLE surveys (
  id SERIAL PRIMARY KEY,
  name VARCHAR(64) NOT NULL UNIQUE,
  survey_questions JSONB NOT NULL,
  is_default BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE UNIQUE INDEX ON surveys (is_default) WHERE (is_default);

Saturday, October 11, 2014

JavaScript and the DOM, The Rent is Too Damn High!

Let x be a HTMLCollection of elements w/ a particular class name. Let's call it foo.

Let y be an Array of elements which need to have foo applied to it.

The algorithm is roughly:
  • For all elements in x remove the class name via Element.classList.remove().
  • For all elements in y add the class name via Element.classList.add().

Now let us assume that there is some overlap between the elements in x and y. It turns out that for some x and y it can be more than twice as fast to compute the intersection of x and y and to use that set to avoid adding and removing foo.

What makes this so weird and interesting is the fact that computing the intersection is O(x.length * y.length), while the algorithm w/o the intersection is O(x.length + y.length). In other words, the version of the algorithm where we don't compute the intersection should be faster than the version where we do. But that's not what actually happens. It's slower, 2x slower in my tests.

The only explanation I can come up w/ is that crossing the JavaScript/DOM divide remains extremely expensive compared to staying on the JavaScript side of the divide. Thus, the rent is too damn high!

Thursday, July 28, 2011

Warped Logic & the Open Web

I really want to shout loudly about what's the most annoying thing to me in Google Chrome that Firefox gets right but won't cause they might fix it then Firefox would [currently] have no usability advantage over Chrome and I want openness/Firefox to win. BTW, winning doesn't mean, let there be one ring to rule them all. Winning does not mean domination by a single corporate interest over the browser and thus access to the web. This must NEVER happen again (I'm talking to you Microsoft/IE6). There MUST always be an open and free (as in speech) kick-ass alternative or two available. The "kick-ass" bit is really important because people behave more pragmatically than idealistically and if the free (as in speech) & open offering(s) is/are not compelling freedom dies and bondage resumes.

Monday, May 16, 2011

Liar! A.K.A This one is for my PGEast 2011 peeps


If you attended PGEast 2011 then you know that a certain company had tons of coffee mugs everywhere. They couldn't give them away even though they were giving them away. It got so bad that the conference organizer thought that maybe the attendees didn't realize that the mugs where for them to take so he made an announcement that the mugs were for us to take.

So let's recap. PGEast 2011 almost 2 months ago & ungodly number of unwanted coffee mugs from NoSQL vendor who will not be named.

So today I start reading this blog post. Then I get to the 15th paragraph and PGEast comes crashing back because I know EXACTLY who he is calling out!

I just loved the way the author called them out w/o naming them, yet absolutely naming them.

Finally, I strongly encourage you not to skip ahead to the 15th paragraph because you need the first 14 to get the full effect of the none name dropping of names in 15.

Monday, June 29, 2009

Breach

I am an IT Consultant. Exactly what I do depends on who I'm working for because I'm capable of doing many things. The reason I'm being explicit about being a consultant is because I want you to understand that I don't have a single employer. I have clients, and depending on the contract sometimes their problems become my problem. This weekend was one of those times.

A system was compromised and for a brief period of time bad things may have happened in a client's name. Though the process of identifying, diagnosing, and fixing the problem took me less than an hour the ramifications were more far reaching than I could have imagined. It could have been a Titanic moment.

I really want to write about the who, what, where, when, and why of this experience but I'm not at liberty to do so. But one day I will be, so I'm putting this out there now as a reminder to myself to finish this conversation.

Monday, June 22, 2009

Eli's Dirty Jokes

The, Eli's Dirty Jokes, videos on YouTube are absolutely hilarious.

Monday, June 15, 2009

Stuck!

Even though the bulk of my professional writing is technical, meaning the data is mostly available I just need to put it into a format understandable by my audience (aka, the people writing the check), I still sometimes encounter writer's block. Like right now. What the hell does that mean exactly? Because here I am writing this blog entry and I'm not having any trouble finding the words to express my frustration. Maybe I'm just naive in my understanding of what technical writing demands of an author. Maybe it requires the same level of access to the creative mind as a non technical work? Who knows? Today I'm not offering answers. I'm just putting pen to paper in an attempt to find my way.

Thursday, June 04, 2009

Quote of the Day, 4 June 2009

Me? I'm going back to running the browser on my UNIX box. It's way too frustrating trying to be a UNIX Engineer via the Windows platform.

Tuesday, June 02, 2009

PGCon2009 Summary

I'm back from PGCon2009. No, I didn't just get back. I've been back for a smidgen over a week now. When I first got there I decided I would blog daily about it, but time didn't permit me to write in any detail. So I decided I would make notes and summarize it all when I got back. The notes thing didn't pan out thanks to twitter. It was simply easier to tweet my thoughts as I thought them than collecting them in little text files and revisiting them later. The summary idea didn't work out either because (a) I didn't have any notes, and (b) all the good stuff was already said on Planet PostgreSQL. But it's been more than a week since the conference and people are still posting summaries so now I feel like I have to say something. Here goes!

Great conference, awesome people, awesome community, and PostgreSQL is really cool technology that I'm confident in trusting my [clients'] data with.

My only gripe with the conference was the keynote address. It's nobody's fault really. The original speaker couldn't make it so they had to slap something together at the last minute. It showed. At the time, my thought was, if the rest of the conference is like this it's going to suck!

...

As I write this I've just realized that the keynote incident is a metaphor for the larger PostgreSQL project. If you attended or watched the "How to Get Your Patch Accepted" talk, it is apparent that quickly slapping something together is not how the PostgreSQL code base is developed or maintained. So I shouldn't have been surprised that the keynote was not a valid indicator about the rest of the conference. I hope I haven't disparaged any of the speakers. They did their best and everyone else in the auditorium enjoyed their keynote.

Wednesday, May 27, 2009

My Apologies to opensolaris.org Blogs

For the last three years all my blog posts have been showing up at http://opensolaris.org/os/blogs/, even though a big chunk of them had nothing to do with Solaris. At the time, I had no idea how to filter just the Solaris specific entries. Now that I think about it, I don't think Blogger even supported tagging back then, and without tags, filtering would have been impossible. Nevertheless, I apologize.

Yesterday I learned how to filter blog posts and today I've finally rectified the problem. There will be no more posts about how to make a great hot dog, book reviews, cloud computing notes, a rant about twitter, and other silliness.

Tuesday, May 26, 2009

PGCon2009 Postscript: Unit Test Your Database!

I started watching some of the PGCon2009 videos that I didn't attend while at PGCon. Last night I watched, Unit Test Your Database!, by David Wheeler. I have had my come to Jesus moment on unit testing years ago, so I'm really happy that there is a solution for testing strictly at the database level without depending on the application layer.

Some developers make the mistake of treating the database as a glorified file system and therefore assume it doesn't need any testing. They are wrong! From the application's point of view, the database should be a black box and application level testing of the database should be limited to the interaction between the application and the interfaces the database exposes, like stored procedures and views. In this development paradigm the database is an independent entity and therefore needs its own set of tests to ensure that it's self consistent. This is where the speaker's own testing framework, pgTAP, excels. It's not limited to just testing the public interfaces the database exposes. It allows you to validate the database itself, i.e., verify the structure of the tables in the database and their relationships, verify triggers, verify the existence of indexes, verify the existence of functions, verify the behavior of functions, etc.

If you are still on the fence about whether unit testing your database is a worthwhile endeavor then watch the video, it may convince you. But if you've already got unit testing religion and are looking for a tool for testing your PostgreSQL database then pgTAP is going to be hard to beat.

Tuesday, May 19, 2009

The Nail in the Coffin

My experimentation with Solaris/OpenSolaris is over. Amid the uncertainty that the Oracle purchase of Sun Microsystems has introduced I got some advice at PGCon2009 today that put the nail in the coffin. I asked someone, what OS was the best for running PostgreSQL? His response was, "The OS you are most familiar with". That OS is GNU/Linux.

In recent years Sun has tried really hard to change the image of Solaris from old Unix to Linux killer (specifically RedHat Linux) to Linux like. To drive the point home, the newer releases of OpenSolaris use bash as its default shell. But it's not enough. The biggest problem Sun had with shaking it's legacy image is that it's still legacy. For example, there are about 5 different flavors of the ps command in a default Solaris install. Sun was fanatic about maintaining backwards compatibility. The problem with that kind of religious fervor is that all the mistakes of the past become a permanent part of the system and haunts it in the present. So if you are a newcomer to Solaris and there is no one there to hold your hand, it is difficult to figure out what is the best way to accomplish a task or the best flavor of a particular tool to use. In essence Solaris newcomers are acutely susceptible to The Paradox of Choice.

So for all the great technology that is in Solaris, the investment in trying to learn it just isn't worth the return right now because I can do everything I need to do in Linux in a fraction of the time and with less frustration (i.e., Solaris still doesn't have a decent package manager). And in all the cases that are important to me, Linux and it's applications run faster than the Solaris equivalent.

But although the nail is in the coffin I'm not going to say goodbye. Who knows, one day I may have to call upon the Cruel Tutelage of Pai Mei and once again embrace the way of the Sun (err ... Oracle).

Choking on Birdseeds

Five days ago I joined twitter. Today I deleted my account. So called micro blogging is just not for me. It just seems totally pointless and I'm way too old to care about being hip. As far as I'm concerned if a technology doesn't make you more productive, entertain you, or help you express yourself, it's a waste of time. Twitter may be great for everybody else, it's just not for me. I'll stick to plain old blogging. It may not be sexy anymore, but it feels just right.

Friday, May 15, 2009

Going To PGCon2009

It's official. I'm going to PGCon2009. I just registered and booked my flight. It will be my first conference in quite some time. About 3 years ago me and a buddy attended a Sun Developer Day event in Atlanta. Loved the city and it's people. The event was okay. The key insight I walked away with was I should give OpenSolaris a serious look.

The fog of uncertainty that has enveloped the MySQL community and its code is the perfect opportunity to look at other solutions. So I'm going to check out what the PostgreSQL community has to offer. Does that make me some sort of database slut?

Thursday, April 02, 2009

Aspect Ratio & You

There are no shortages of libraries and toolkits available to programmers for scaling images. But if you ever find yourself in a position, as I recently have, where you need to roll your own (or maybe you are just curious) I'll explain everything you need to know about maintaining the aspect ratio of a scaled image.

When the issue of scaling images landed on me, the first thing I did was to google it. The search results were not very satisfactory, thus this blog entry.

So what are we talking about when we use the term "aspect ratio"? It's the relationship between an image's height and its width. From a programmatic point of view, the aspect ratio can tell us how much the width of an image should change if the height changes and vice versa. Aspect ratios are normally expressed in the form H:W (i.e., 1:1, 7:3, 4:5, etc). It can also be expressed as a fraction (i.e. 1/1, 7/3, 4/5, etc) and finally, for programmatic purposes, a decimal. The formula for the aspect ratio is:

A = H/W
where A is the aspect ratio, H is the height of the image, and W is the width of the image. Using a bit of algebra we can rewrite the formula to solve for any of the variables. So given that A = H/W then H = A*W and W = H/A.

Lets assume we have an 485px x 1024px image that we need to generate a thumbnail for. The first thing we need to do is determine the aspect ratio of the image:

A = H/W => A = 485/1024 => A = 0.4736328125
Lets also assume that we have this rule that says a thumbnail image must be no more than 140 pixels high. We now have enough information to figure out what the width must be in order to maintain the image's aspect ratio:
W = H/A => W = 140/0.4736328125 => W = 295.587628866
We know the new width maintains the aspect ratio because 140/295.587628866 = 0.4736328125. Now let's look at some code:
/** 
 * Scale <tt>src</tt>'s dimensions to <tt>max</tt> pixels starting w/ the largest side. 
 * 
 * @param image      The source image. 
 * @param max        The maximum number of pixels in each dimension(HxW). 
 * @param heightOnly Indicates that only the image's height should be scaled. 
 * 
 * @return The scaled image. 
 */ 
public static BufferedImage scale(BufferedImage image, final int max, boolean heightOnly) 
{ 
    if (heightOnly) 
        image = scaleByHeight(image, max); 
    else if (image.getHeight() > image.getWidth()) 
    { 
        image = scaleByHeight(image, max); 
        image = scaleByWidth(image, max); 
    } 
    else 
    { 
        image = scaleByWidth(image, max); 
        image = scaleByHeight(image, max); 
    } 
    return image; 
} 
 
/** 
 * Scale <tt>src</tt> by <tt>height</tt>. 
 * 
 * @param image The source image. 
 * @param max   The value to scale the image down to. If the current height of the image is less than <tt>max</tt> then this 
 *              method does nothing. 
 * 
 * @return A (possibly) scaled image. 
 */ 
public static BufferedImage scaleByHeight(BufferedImage image, final int max) 
{ 
    int height = image.getHeight(); 
    if (height > max) 
    { 
        int width = image.getWidth(); 
        final float aspectRatio = height / (float)width; 
        do 
        { 
            height >>= 1; 
            if (height < max) 
                height = max; 
            int k = (int)(height / aspectRatio); 
            if (k > 0) 
                width = k; 
            image = scale(image, height, width); 
        } 
        while (height > max); 
    } 
    return image; 
} 
 
/** 
 * Scale <tt>src</tt> by <tt>width</tt>. 
 * 
 * @param image The source image. 
 * @param max   The value to scale the image down to. If the current width of the image is less than <tt>max</tt> then this 
 *              method does nothing. 
 * 
 * @return A (possibly) scaled image. 
 */ 
private static BufferedImage scaleByWidth(BufferedImage image, final int max) 
{ 
    int width = image.getWidth(); 
    if (width > max) 
    { 
        int height = image.getHeight(); 
        final float aspectRatio = height / (float)width; 
        do 
        { 
            width >>= 1; 
            if (width < max) 
                width = max; 
            int k = (int)(width * aspectRatio); 
            if (k > 0) 
                height = k; 
            image = scale(image, height, width); 
        } 
        while (width > max); 
    } 
    return image; 
} 
 
/** 
 * Scale <tt>src</tt> down to height x width pixels. 
 * 
 * @param src    The source image. 
 * @param height The scaled height. 
 * @param width  The scaled width. 
 * 
 * @return The scaled image. 
 */ 
private static BufferedImage scale(BufferedImage src, final int height, final int width) 
{ 
    int type = src.getType(); 
    if (BufferedImage.TYPE_CUSTOM == type) 
        type = src.getTransparency() == Transparency.OPAQUE ? BufferedImage.TYPE_INT_RGB : BufferedImage.TYPE_INT_ARGB; 
    BufferedImage img = new BufferedImage(width, height, type); 
    Graphics2D gscale = img.createGraphics(); 
    gscale.setRenderingHint(RenderingHints.KEY_INTERPOLATION, RenderingHints.VALUE_INTERPOLATION_BILINEAR); 
    gscale.drawImage(src, 0, 0, width, height, null); 
    gscale.dispose(); 
    return img; 
}

Wednesday, February 11, 2009

Amazon's Cloud: What is EC2?

I'm doing research on Amazon's cloud computing platform. So I'll be jotting down notes here in case this information turns out to be useful to someone other than myself.

EC2 is part of Amazon's cloud computing platform. It enables one to run a full stack (operating system, applications, scripts, etc) on one or more compute nodes. There are different types of compute nodes. Compute nodes are organized into CPU capacity (1-8 CPUs), RAM (up to 15GB), and local storage. The user can programatically start and stop compute node instances to deal w/ increasing or decreasing demand. So in the case where a single compute node may not be sufficient in satisfying demand it is the responsibility of the user to deploy application(s) that are cluster aware.

Data stored via local storage does not persist across restarts. For persistent storage one must use Elastic Block Storage (EBS) or Amazon's S3 storage service.

Links:

Thursday, March 13, 2008

Central New York

I'm in central New York (Syracuse) and I'm having a heck of a time writing anything. My imagination is as frozen as the ground outside. I'm going to the city this week to visit places I haven't been in over 17 years. Maybe I'll be inspired by the trip to jot down a few words.

Monday, February 04, 2008

Finished Reading: The End Of The Alphabet

I contracted the flu/cold during the Christmas holiday and it lasted at least three weeks. And though I'm not a big fan of Christmas [consumerism] it felt ruined. I know what you are thinking but you are wrong. It had absolutely nothing to do with the fact that Santa dissed me again. We've had beef since I was twelve so getting dissed was expected. What I didn't expect, was that his punk ass would try to assassinate me with the flu.

I'M STILL STANDING BITCH!

There will be next year and this time I'll be ready for his punk ass.

BTW, New Years sucked too.

I'm still kinda sick. I have this lingering cough and sometimes I go into these coughing fits that leave me breathless and light headed. My doctor just (Thursday) put me on some killer antibiotics (Avelox) and I need to go get a chest x-ray to see if I have pneumonia. Who knows, maybe Santa will have the last laugh yet.

So for the last three weeks I've been trying to catch up on the work that I didn't do while I was sick, and I've spent the last week and a half working 18+ hour days. The code I worked on was an addendum to an existing system in some places and a bug fix release in others. After four weeks of delays it went live on Friday. Yeah! But what I'm proudest of is during all the chaos I managed to read an entire novel, The End Of The Alphabet, by CS Richardson. How, you say? It's only 113 pages. That's right. He squeezed an entire novel into 113 pages and it took me three whole weeks to read it.

It's a love story, so if you are into that sort of thing, you may enjoy it. Be warned, it's not the, "dude crosses dessert and slays dragon", type of love story. It's simple, efficient, and satisfying. Given that it was a mere 113 pages, I was surprised I found it satisfying. Kudos Mr. Richardson.

My rating for this one is, sweet (sensitive meaning of the word).

Saturday, January 26, 2008

Totally Naked and Loving It!

I just got naked and saved 30+ bucks a month. Yeah me!

Maybe you should get naked too!

Thursday, January 24, 2008

My New Favorite Method

Some days I manage to really amuse myself with my work. Today I've added some flare to a method that may otherwise be really boring. So let me introduce you to my new favorite method:

  /** 
     * Flattens the sub directories of <tt>roots</tt> into a single array. 
     * 
     * @param roots The root directories. 
     * 
     * @return The [sorted] subdirectories of <tt>roots</tt>. 
     */ 
    private static File[] merge(File ... roots) 
    { 
        int x = 0; 
        int k = 0; 
        File[][] forest = new File[roots.length][]; 
        for (File sapling : roots) 
        { 
            File[] tree = sapling.listFiles(); 
            if (null != tree) 
            { 
                int leaves = tree.length; 
                if (leaves > 0) 
                { 
                    forest[x++] = tree; 
                    k += leaves; 
                } 
            } 
        } 
        File[] woods = new File[k]; 
        for (k = 0; --x >= 0;) 
        { 
            File[] tree = forest[x]; 
            int leaves = tree.length; 
            System.arraycopy(tree, 0, woods, k, leaves); 
            k += leaves; 
        } 
        Arrays.sort(woods, chipper); 
        return woods; 
    }

It cracks me up sometimes that I get paid to have this much fun.

Wednesday, January 02, 2008

Two Reasons To Use LineNumberReader Instead Of BufferedReader

Preamble

LineNumberReader extends BufferedReader so LineNumberReader "is a" BufferedReader. This "is a" distinction is important in this discussion, because I spend a bit of time talking about BufferedReader. But because of the "is a" relationship (a.k.a inheritance), anything said about BufferedReader is also true of LineNumberReader.

Reason Number One

LineNumberReader keeps track of line numbers. Duh!! right? But I had to say it. There would be no punchline if I didn't.

Reason Number Two

LineNumberReader compresses line terminators to a single newline character ('\n'). Now, "whoopty friggin doo!", is certainly a fair characterization of Reason Number Two, but bear with me.

I've observed over the years that many programmers use BufferedReader exclusively for it's readLine method, because it allows the programmer to work with lines at a time instead of individual characters. But sometimes you need to work a character at a time. BufferedReader is great for this. It is built for efficient reading of text data from non memory sources, like the filesystem or the network. The efficiency comes from the fact that individual calls to read do not map one-to-one with individual calls to the source(s). This results in less physical I/O which causes things to run much faster, which is always a plus.

Now, it's important to realize that you don't actually need BufferedReader to efficiently read text. You can create your own buffer using a char[] and read directly to/from your buffer. This cuts out the middle man, thus eliminating some object allocation, a bit of garbage collection, method calls and the overhead that goes with .hem. But what BufferedReader does for you that you would have to do for yourself, is detect the line terminators in the text. This detection is what makes readLine possible. LineNumberReader kicks it up a notch in that the "end of line" detection doesn't just affect readline, it also affects read. LineNumberReader's read method simplifies newline processing by returning a single newline character ('\n'), regardless of the type and quantity of them. So lets say you are working with text files on/from a system that uses CR+LF ("\r\n") as it's line terminator. Without LineNumberReader you would have to manually process the '\r' separately from the '\n'. With LineNumberReader you will only ever see the '\n', which simplifies the logic required to process the text being read.

Wrap Up

The impetus for this entry started two nights ago when I needed to whip up a little template processing system. I was working with an existing code base and I noticed that there were four types of notification files/messages that were being used. At least 60% of the text between the four messages were the same and about 90% of text between [specific] pairs of messages were the same. So I set out to unify the four files into a single file, a template, from which the original four messages can be derived. I'm posting the code that parses the master template because it showcases the benefit of having LineNumberReader handle "end of line" detection. If you are not used to doing this sort of text processing it may not be obvious how the code is benefiting from using LineNumberReader, so I'll spell it out for you. (1) LineNumberReader tracks line numbers. This time I'm not trying to be funny. The code explicitly throws one Exception and it uses the line number to make the error message more useful. (2) There is no newline "look ahead" nor "look behind" code. Without LineNumberReader the code would need to explicitly handle CR+LF, which requires "look ahead" or "look behind" semantics.

Code

/**
 * Creates a new "payment processor" specific template from the master template. 
 * 
 * @param pp The payment processor. 
 * 
 * @return A payment processor specific template. 
 * 
 * @throws IOException If there is a problem reading the master template. 
 */ 
private String getNotificationTemplate(PaymentProcessor pp) throws IOException 
{ 
    int token = UNDEF; 
    int state = UNDEF; 
    String tagname = null; 
    String ppname = pp.name(); 
    boolean matched = false; 
    StringBuilder tnb = new StringBuilder(); 
    StringBuilder sink = new StringBuilder(); 
    InputStream stream = getServletContext().getResourceAsStream("/WEB-INF/notification_template"); 
    try 
    { 
        LineNumberReader reader = new LineNumberReader(new InputStreamReader(stream, "UTF-8")); 
        for (int eof; -1 != (eof = reader.read());) 
        { 
            char c = (char)eof; 
            switch (c) 
            { 
                case '$': 
                    switch (token) 
                    { 
                        case UNDEF: 
                        case CONTENT: 
                            token = DOLLAR_SIGN; 
                            break; 
 
                        case START_TAG: 
                            token = END_TAG; 
                            break; 
                    } 
                    break; 
 
                case '{': 
                    switch (token) 
                    { 
                        case DOLLAR_SIGN: 
                            token = OPEN_BRACKET1; 
                            break; 
 
                        case OPEN_BRACKET1: 
                            token = OPEN_BRACKET2; 
                            break; 
 
                        default: 
                            token = UNDEF; 
                            break; 
                    } 
                    break; 
 
                case '}': 
                    switch (token) 
                    { 
                        case TAG_NAME: 
                            token = CLOSE_BRACKET1; 
                            break; 
 
                        case CLOSE_BRACKET1: 
                            token = START_TAG; 
                            break; 
 
                        default: 
                            token = UNDEF; 
                            break; 
                    } 
                    break; 
 
                case '\n':// No test for '\r' cause LineNumberReader compresses line terminators to a single '\n'. 
                    switch (token) 
                    { 
                        case START_TAG: 
                            token = state = CONTENT; 
                            matched = Arrays.asList(PIPE_REGX.split(tagname = tnb.toString())).contains(ppname); 
                            sink.setLength(sink.length() - tnb.length() - 5); 
                            continue; 
 
                        case END_TAG: 
                            if (tnb.toString().equals(tagname)) 
                            { 
                                if (matched) 
                                { 
                                    sink.setLength(sink.length() - tnb.length() - 6); 
                                    matched = false; 
                                } 
                                token = state = UNDEF; 
                            } 
                            else 
                            { 
                                String message = 
                                    "Illegal closing tag at line " + reader.getLineNumber() + ". Expected " + tagname + 
                                    " but found " + tnb + " instead."; 
                                throw new RuntimeException(message); 
                            } 
                            continue; 
                    } 
                    break; 
 
                default: 
                    if (OPEN_BRACKET2 == token) 
                    { 
                        token = TAG_NAME; 
                        tnb.setLength(0); 
                    } 
 
                    if (TAG_NAME == token) 
                        tnb.append(c); 
                    else if (CONTENT != token && CONTENT != state) 
                        token = UNDEF; 
                    break; 
            } 
 
            if (CONTENT != state || (CONTENT == state && matched)) 
                sink.append(c); 
        } 
    } 
    finally 
    { 
        stream.close(); 
    } 
    String template = sink.toString(); 
    switch (pp) 
    { 
        case PAYPAL: 
            paypalNotificationTemplate = template; 
            break; 
 
        case CREDITCARD: 
            ccNotificationTemplate = template; 
            break; 
    } 
    return template; 
}

Monday, December 24, 2007

2007 Recreational Reading Summary

As a younger man, recreational reading was a passion of mine. But once I started college I found I did less and less reading outside of the the curriculum. Things got worse once I started my professional life in earnest. Keeping up with IT requires volumes of technical material. The explosion of self published technical content in the form of blogs, white papers, and online documentation has reduced the number of books that I buy but has increased the amount of content that I read. Trying to assimilate all this technical content has led me to a very bad habit of scanning. I no longer enjoy reading because I'm not really reading. I'm scanning for information to solve a particular problem or trying to get to the root of how something is done. The last technical book that I can remember reading without scanning was, Mastering Regular Expressions, 1st Edition, by Jeffrey Friedl. That was years ago. So it was within this context that at the start of 2007 I decided I needed to return to recreational reading.

So far I've only managed to complete three non tech books; (1)Who Moved My Cheese; (2)Rich Dad Poor Dad; and (3)Perfectly Reasonable Deviations From The Beaten Track. The Letters Of Richard P. Feynman. The first two where not on my official reading list. My brother handed them to me after he finished reading them because he wanted to hear what I thought. They aren't the types of books I would select for myself because they fall under the category of "self help". I'm not a fan of "self help" because they always tell the reader things the reader already knows. What's the fun in that? Nevertheless, I read them. They aren't bad books and if you don't already know what they have to teach they are worth checking out. They are small enough (especially, Who Moved My Cheese) that if you don't want to buy them you can read them in one or two sittings at your local library or bookstore.

I finished reading, Perfectly Reasonable Deviations From The Beaten Track. The Letters Of Richard P. Feynman, moments before starting this blog entry. I've been reading it since March but couldn't finish it because of work and all the technical content related to work. I finally finished it because I'm sick. I have a vicious cold that has had me bed ridden since yesterday. During patches of clarity I read. It is the first non fiction work I've read for recreational purposes in at least a decade. It has no traditional narrative. It is a collection of letters sent and received by a man named Richard P. Feynman. Feynman is a renowned Nobel Prize winning physicist who died in 1988. When I first started reading it I was creeped out because I felt like a third party reading this guy's personal mail. But by the end of the 3rd or 4th chapter I had settled into a first person point-of-view and was on occasion, surprised by the letters I wrote and received. This book is not an all around crowd pleaser. If you are the type of person who would be interested in the life of one of the giants in physics, it won't disappoint. Otherwise, your mileage may vary.

Thursday, December 06, 2007

Finished Reading: Beautiful Code

Disclaimer**: I'm not a professional book reviewer so my rating system is succinct and should be taken with a ginormous grain of salt.

I recently finished reading, Beautiful Code, from Oreilly Publishing. It sucked! Avoid it if you can.

IDEA-7+Glassfish, First Impression

It's been years since I've worked with Java EE in any meaningful way. Back then it was called J2EE and my application server of choice was Resin because it was blazingly fast and allowed me to bypass all the J2EE crud --like EJBs, deployment descriptors, war files, ear files, etc-- and just get stuff done. Truth be told, I've always disliked J2EE. It just always smacked of self important (read as, Sun/Scott McNeally) bullshit to me. Sun and their "partners" (BEA, IBM, and others) managed to turn something as simple as serving dynamic content over HTTP into a multi billion dollar application server industry that hoodwinked a lot of people.

I made a conscious decision to avoid J2EE like raw broccoli when Caucho started transitioning from Resin 2 to 3. The transition was an abomination of galactic proportions. They completely redid the configuration system and did not provided any tools to move from the old version to the new one. And to guarantee that migration was a herculean effort, they provided documentation that was grossly incomplete and mostly inaccurate. But they didn't stop there. They were just warming up. The 3.0 release was beta quality software at best but was billed as production ready. The word went out that development on the Resin 2 branch had stopped and all development effort was going to be on 3. So any existing open issues (bug reports) against 2 was null and void and would be addressed in the 3 release. That may sound reasonable but it presupposes that 3 is usable in a production capacity. It wasn't. I spent weeks chasing deadlocks and other concurrency issues in the Resin code. So if you were a user that was affected by Resin 2 bugs you were asked to move to Resin 3 and since Resin 3 had even more bugs you were just fuc*ed. The final insult was, while Resin 2 made the J2EE stuff optional Resin 3 made it mandatory. I gave up on Resin 3 and went live with the 2.1.x and never looked back. That was over 3 years ago.

I recently started dabbling w/ J2EE again in a limited capacity. I needed to provide an HTTP interface to a server application I am working on and there was no way in hell I was going to try to climb the whole J2EE mountain just to provide HTTP access to the app. The simplest way I found to provide HTTP support was to embed Jetty and the simplest way to hook into Jetty's HTTP engine is via the Servlet interface. So though I'm using a servlet I really don't consider it J2EE.

As I said before, it's been approximately 3 years since I deployed my last J2EE app on Resin 2.1.x and they have just come due for upgrading. So the quest has been on to find an application server to replace Resin 2.1.x. Don't worry. This isn't going to turn into some long winded diatribe about all the application servers on the market and their pros and cons and yada yada yada. The truth of the matter is I've only tried one. Glassfish. I downloaded it yesterday minutes after 8PM and installed it at around the same time. My first impression can be summed up in one word, Wow!

The installation and startup was an absolute breeze. No config files in sight. The web based admin interface is simple and intuitive. I have yet to click on the help button for clarification on anything. There is also a command line interface that does everything the web based interface does. This is extremely cool because it means configuration becomes scriptable and thus can be completely automated.

There were only 3 pain points in my journey from getting the software to deploying a test servlet (that tests database connectivity). The first pain point was setting up the connection pool for the database. The JDBC drivers for PostgreSQL isn't bundled with Glassfish. The reason it's a pain point is because the configuration screen listed PostgreSQL as one of it's supported databases. It even prepopulates the Datasource Classname field with the correct PostgreSQL specific classname. So as a n00b, seeing all this, my default assumption is Glassfish comes with everything needed to communicate with the database. After a bit of head banging I finally turned to Google where I learned that I simply needed to copy the PostgreSQL JDBC driver to ${INSTALL_DIR}/domains/domain1/lib and restart the server.

The second pain point is not a Glassfish pain point but an IDE one. I've been using IntelliJ+IDEA for years --I'm using IDEA 7-- and this was the first time I've tried to use it's J2EE facilities. Pretty slick! Granted, my previous experience with this kind of thing is limited to emacs and nano on a Resin config file, so IDEA 7 to me, represents a major leap forward. I mentioned my bias because if this sort of thing is your day to day shtick, I hear Netbeans 6 provides an even better experience than IDEA 7 for Glassfish integration.

The first problem I ran into is understanding the difference between the local and remote configuration. It turns out local means IDEA manages the actual running of the application server. So if you already have it running, like I did, then local can't work with it. If you don't want IDEA to manage the running of the application server you have to use the remote configuration option. Remote seems limited to deployment only.

The third pain point is also an IDE one. This time I couldn't get deployment to work. This one actually pissed me off because the failure was silent on the part of IDEA. Without error messages how the hell am I supposed to know what's wrong? Fortunately, Glassfish is not the silent type. I nuked the Glassfish logs and restarted everything. After the deployment failed yet again I checked the log. This time there was a line in the log about a failed login attempt by the admin user. Finally, a clue. When I configured IDEA to work with Glassfish it prepopulated the username and password fields. My assumption was it was grabbing the data from the same place that Glassfish stores it. Wrong! The other clue was the number of asterisks in the password field was longer than the length of the new password --I changed the password from the default during the intial configuration of Glassfish--. Obviously, the problem was the password that IDEA was using was wrong. I manually entered the correct password and that solved the problem.

Other than those 3 minor issues IDEA+Glassfish has been a pleasure to use. I just hope Glassfish's performance lives up to the hype

Tuesday, November 13, 2007

Of Blogs and Google Docs

This is my first post from Google Docs. I've been searching for a decent blog editor for years and have not been able to find anything I really like. Most of the problem stems from the fact that the pickings are pretty slim if you are a GNU/Linux user. But things may be turning around. While surfing doggdot this morning I came across this link. It lists five blog editors for GNU/Linux with Google Docs being the fifth. So I thought I would take it out for a spin and this entry is the test drive.

Thursday, November 08, 2007

QOTD, 8 Nov 2007

As the amount of RAM installed in systems grows, it would seem that memory pressure should reduce, but, much like salaries or hard disk space, usage grows to fill (or overflow) the available capacity.

----Jake Edge November 7, 2007

Google, I'm Still in Love

God I love Google!

I've been trying to get to the Gentoo wiki since last night. But the forces that be have conspired against me. The evil Internet gremlins doth deny me [and everybody else, for that matter] access. 'Tis hopeless it seems. Or, it would have been hopeless if not for Google. I just searched for "gentoo wiki paludis", and Google has the Gentoo wiki page cached. Friggin' brilliant. Now I know this feature has been around forever but I just wanted to remind you that Google popularized it. If Google hadn't come along and shook up search, down to it's core, I would remain at the mercy of the evil gremlins.

Friday, October 26, 2007

KernelTrap.org

One of my favorite sites on the net is KernelTrap. Though KernelTrap describes itself as, "... a web community devoted to sharing the latest in kernel development news.", all of the heavy lifting is done by one person, Jeremy Andrews. So I would like to take this opportunity to say thank you to Jeremy for his tireless efforts at making KernelTrap a great site and one of my favorite destinations on the net.

The feature I use the most is on the home page and it's basically Jeremy summarizing and distilling the [essence of the] conversations that happen on many of the kernel development mailing lists. Anyone who is or has ever been a member of an extremely voluminous mailing list, know how noisy it can be. Where the worst case scenario is an abysmally low signal-to-noise ratio. Plus it's no fun exploring the list after the fact because it becomes very tedious very fast, pointing and clicking your way through messages, trying to find something interesting. KernelTrap eliminates the noise and makes pointing and clicking fun again [or at least more productive]. It does this by organizing the different conversations from the different kernel development mailing lists into atoms.

An atom is simply a title and a summary of what the original thread/conversation was about, which includes quotes from the source. If the subject matter peaks your interest and you are not satisfied by the summary, you can click on the title or the "read more" link to, (wait for it ...) read more! Reading more takes you to a single page that contains the individual messages that make up the original conversation, no pointing or clicking required, all you have to do is scroll and enjoy. There is even a comments section at the bottom of each entry. The comments don't actually link back to the original mailing list so you can't really use it as a mechanism for joining the conversation. The purpose it does serve [to me] is comic relief. Probably 99% of the comments posted are from people who have never written a lick of kernel code in their life and probably wouldn't know a pointer if jumped up and poked them in the eye. Yet it doesn't stop them from complaining and passing judgment on the people who are actually involved in the conversation. I can't help but laugh.

Jokes aside, the reason I love KernelTrap is because it focuses on kernel development. And though I'm not a kernel developer, nor aspire to be one, the information provided is useful none-the-less. You see, the kernel is the most important piece of software that runs on your computer, because it is responsible for managing the resources that is the computer (CPU, memory, disk, etc). So whether your computer is running 1 or 1,000 processes, or your network application is handling 1 or 1,000 thousands connections, it's the kernel that is responsible for keeping things running smoothly or at least running. The consequence of being responsible for the computer is the kernel ends up being the most scalable piece of software on the computer. It is this feature of kernels that interest me. Because the lessons of scalable design and implementation, inherent in [good] kernels, aren't limited to kernel software. A lot of the lessons can be applied to user land software (my domain). So though the conversations may not tell you how things are implemented (the exception is the Linux Kernel Mailing List because patches [code] are included directly in the messages themselves) it can tell you why and who is doing it.

The newest KernelTrap feature is quotes. A quote is another type of atom that is simply a quote lifted from a larger conversation that is either insightful, funny, or both. My favorite for this week comes from Theo de Raadt of OpenBSD fame:

"You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes."
— Theo de Raadt in an October 24th, 2007 message on the OpenBSD -misc mailing list.

So if you have never visited KernelTrap I highly recommend you take a look and if you are looking for a more Linux centric world view, LWN can't be beat.

Tuesday, October 23, 2007

7 REPSLM-C, Expanded

This post is a follow up to 7 Reasons Every Programmer Should Love Multi-Core and a direct response to this comment.

Maybe I should have put 6 before 4 because 6 makes the point that most of today's programs aren't written to take advantage of multi-core. So what exactly do I mean by take advantage? It seems you think I'm saying, it means simply running today's GUI, client/server, and P2P apps as is, on multi-core machines and expecting magic to happen. But that is not what I'm talking about.

Aside:

With some existing apps like Postfix, WebSphere, SJSDS, IntelliJIDEA 7.0, PVF, and most Java bit torrent trackers/clients [just to name a few] magic can happen. While others require tuning (i.e. Apache, PostgreSQL, and many others). Most applications, especially desktop GUI apps, will require a major rewrite to take full advantage of multi-core machines.

What I'm talking about is programmers finding opportunities to exploit parallelism at every turn, which is what items 1-4 are about. Let's take something as mundane as sorting (i.e. merge sort, quick sort) as an example. Merge sort and quicksort are excellent use cases for applying a divide and conquer strategy. They consist of a partitioning step, a sorting step, and a combining step. Once partitioned, the partitions can be distributed across multiple threads [and thus multiple processors/cores/hardware-thread] and sorted in parallel. Some of you may say, "that's only 1 out of 3 steps, big deal." Others may take it even further and say, "1 out of 3. That means 2/3'rds of the algorithm is sequential. Amdahl's Law at work buddy!". But what you would be over looking is, [in the serial version] for a large enough dataset, the sorting step would dominate the runtime. So even though we have managed to only parallelize a single step we can still realize substantial runtime performance gains. This behavior is expressed quite eloquently by John L. Gustafson in his [short] essay, Reevaluating Amdahl's Law.

So what does all of this have to do w/ your comment? Let's start with your admonition of GUI applications taking advantage of multi-core and I'll use boring old sorting to make my point.

It is sometime in the future and there is a guy name Bob. Bob's current computer just died (CPU burned out) and he goes out and buys a new one. Bob doesn't know or care about multi-core [or whatever the future marketing term is for [S]MP]. He just wants something affordable that will run all his applications. Nevertheless, his new machine is a 128 way box (it is the future after-all), with tons of RAM. Bob takes his new machine home and fires it up. Bob keeps all his digital photographs and video on a 4 terabyte external storage array. He bought the original unit years ago before 32 terabyte hard drives came standard with PCs. You see, Bob's daughter is pregnant and is in her final trimester and her birthday is just around the corner. Bob wants to make her a Blue-HDD-DVDDD-X2 disk containing stills and video footage of her life, starting before she was even born, and up to her current pregnancy. It begins with the ultrasound video of her in her mother's womb and ends with the ultrasound of his grandchild in his daughter's womb. So Bob fires up his [hypothetical] image manager and tells it to create a workspace containing all the images and videos on the storage array, sorted by date. It's almost 30 years worth of data. And though the image manager software is old, some programmer, long ago, wrote a sorting algorithm that would scale with the number of processors available to it. So Bob clicks a button and in less than 5 minutes 3.5 terabytes of data has been sorted and ready to be manipulated. So what's the point? The point is it doesn't matter than "99%" of the CPU time was spent "waiting for some event", because when it mattered, (when Bob clicked the button) all the available resources were employed to solve the user's problem efficiently, resulting in a great user experience. Now I know the example is contrived but the premise upon which it is based is real. If you look at most GUI applications of today, very few of them can handle multiple simultaneous events or even rapid fire sequential events. In large part because most of the work (the action to be performed) happens on the same thread that is supposed to be listening for new events. Which is why the user interface freezes when the action to be performed requires disk or network access or is CPU bound. The classic example is loading a huge file into RAM from disk. Most GUI apps provide a progress meter and a cancel button but once the I/O starts, clicking cancel doesn't actually do anything because the thread that's supposed to be processing mouse events is busy reading the file in from disk. So yes, GUI application programmers should Love Multi-Core!

Client/Server and P2P are in the same boat in that they are both network applications. But they, like GUI and every other problem domain, can benefit from data decomposition driven parallelism (divide and conquer). I'm not going into great detail about how network applications benefit from multi-core because that subject has been beaten to death. I'll just say a couple things. The consensus is more processors equal more concurrent connections and/or reduced latency (user's aren't waiting around as long for a free thread to become available to process their requests). Finally multi-core affects vertical and horizontal scaling. Let's say you work at a web company and the majority of your web traffic is to static content on your web server (minimal contention between requests). Let us also assume that have unlimited bandwidth. The web server machine is a 2 socket box and quad-core capable but you only bought one 1P processor. A month passes and you got dugg and the blogosphere is abuzz about what you are selling. Customers are browsing and signing up in droves. Latency is climbing and connections are timing out in droves. You overnight 2 quad-core CPUs and additional RAM. Latency drops to a respectable level and you just avoided buying, powering, and cooling a brand new machine that would have cost you 3x as much as just spent for the CPUs and RAM. That is scaling vertically. If you were building a cluster (horizontal scaling), multi-core means you need less physical machines for the same amount of processing power. In other words, multi-core reduces the cost of horizontal scaling both in terms of dollars and latency. Access to RAM will always be faster than the network. So there is a lot less latency with performing the work locally --pushing it across the FSB, HyperTransport, etc, to multiple cores-- than pushing it out over the network and [eventually] pulling the results back. So yes, if you are coding or deploying network applications, P2P, client/server, or otherwise, you should Love Multi-Core!

Saturday, October 20, 2007

7 Reasons Every Programmer Should Love Multi-Core

  1. The technology is not new it's old. It's just really cheap SMP and the SMP domain (shared memory model) is a well understood domain. So there are tons of resources (books, white papers, essays, blogs, etc) available to get you up to speed.

  2. Shared memory concurrency is challenging. It's guaranteed to grow your brain.

  3. Most programming languages [already] have language and/or API level support (threads) for multi-core, so you can get started right now.

  4. There are a plethora of computing domains that benefit from increased parallelism. The following are just a few off the top of my head: GUI applications, client/server, p2p, games, search, simulations, AI. In other words, there won't be a shortage of interesting work to do in this space.

  5. Most programmers [and their managers] don't have a clue about concurrency so you can easily impress them with your skills/CNM (Concurrency Ninja Moves).

  6. The majority of today's programs aren't written with multi-core in mind so mastering concurrency programming means you won't be out of a job any time soon. Somebody has to write the multi-core aware version of all those apps.

  7. Since most programmers are clueless about concurrency, mastering it means you'll be smarter that millions of people. Being smarter than your [so called] peers is really satisfying.