Saturday, April 29, 2006

Amen!

Read the title.

Click on the link.

Read the post.

"That's all folks!"

Monday, April 24, 2006

Sublime

An interesting read to say the least.

Monday, April 10, 2006

New Template

I've just switched my blogger template so if you are a regular visitor don't freak out. The new template makes better use of screen real estate and is more code friendly.

Let me know what you think.

Saturday, April 08, 2006

regex vs. XML

I needed to screen-scrape wikipedia for the list of Top Level Domains for my email app so I could build an index of suffixes that would reduce the number of martian addresses the app would have to process. So I surfed over to the TLD page to have a look-see. Low and behold those beautiful wikipedians were nice enough to produce well formed XML (actually, XHTML 1.0 Transitional) content. My heart leaped w/ joy because no screen-scraping was needed. With a little bit of XPath I would have my index in no time.

WRONG!!!!!!!!!!!

I spent the next hour+ trying to get XOM to work and spit out my list via XPath because I sure as hell wasn't going to try to manually walk the tree extracting the TLDs. Do you have any idea how much code that would take? I had a feeling the reason it doesn't work has something to do w/ how XOM handles namespaces because when I access the tree the hard way it demands that I specify the namespace URI of the document to lookup any element in the tree even though the document declares one and only one namespace. Now this isn't a knock against XOM, though I'm sure if I had used dom4j it would have worked because dom4j is more concerned with being useful than being correct. But each have there place. By this time I was so aggravated I just didn't feel like downloading dom4j and unpacking it and setting my classpath, yada, yada, yada.

Regex to the rescue

This his how I got what I needed the old fashion screen-scraping way. 5 minutes tops from concept to results:
String tld = "http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains"
Matcher regex = 
    Pattern.compile( "<td><a[^>]++>\\.(\\w++)</a></td>" ).matcher( "" );
BufferedReader reader =
    new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream( new URL( tld ).openStream() ), "utf-8" ) );
PrintWriter writer = 
    new PrintWriter( 
        new BufferedWriter( 
            new FileWriter( new File( "/home/HashiDiKo/temp/TLD" ) ) ) );

for( String line; null != (line = reader.readLine()); )
    if( regex.reset( line ).matches() )
        writer.println( regex.group( 1 ) );

reader.close();
writer.close();

Saturday, April 01, 2006