Design Decisions: April 2006

I needed to screen-scrape wikipedia for the list of Top Level Domains for my email app so I could build an index of suffixes that would reduce the number of martian addresses the app would have to process. So I surfed over to the TLD page to have a look-see. Low and behold those beautiful wikipedians were nice enough to produce well formed XML (actually, XHTML 1.0 Transitional) content. My heart leaped w/ joy because no screen-scraping was needed. With a little bit of XPath I would have my index in no time.

WRONG!!!!!!!!!!!

I spent the next hour+ trying to get XOM to work and spit out my list via XPath because I sure as hell wasn't going to try to manually walk the tree extracting the TLDs. Do you have any idea how much code that would take? I had a feeling the reason it doesn't work has something to do w/ how XOM handles namespaces because when I access the tree the hard way it demands that I specify the namespace URI of the document to lookup any element in the tree even though the document declares one and only one namespace. Now this isn't a knock against XOM, though I'm sure if I had used dom4j it would have worked because dom4j is more concerned with being useful than being correct. But each have there place. By this time I was so aggravated I just didn't feel like downloading dom4j and unpacking it and setting my classpath, yada, yada, yada.

Regex to the rescue

This his how I got what I needed the old fashion screen-scraping way. 5 minutes tops from concept to results:

String tld = "http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains"
Matcher regex = 
    Pattern.compile( "<td><a[^>]++>\\.(\\w++)</a></td>" ).matcher( "" );
BufferedReader reader =
    new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream( new URL( tld ).openStream() ), "utf-8" ) );
PrintWriter writer = 
    new PrintWriter( 
        new BufferedWriter( 
            new FileWriter( new File( "/home/HashiDiKo/temp/TLD" ) ) ) );

for( String line; null != (line = reader.readLine()); )
    if( regex.reset( line ).matches() )
        writer.println( regex.group( 1 ) );

reader.close();
writer.close();

Design Decisions

Saturday, April 29, 2006

Amen!

Monday, April 24, 2006

Sublime

Monday, April 10, 2006

New Template

Saturday, April 08, 2006

regex vs. XML

Regex to the rescue

Saturday, April 01, 2006

This is truly cool

Links