Saturday, April 08, 2006

regex vs. XML

I needed to screen-scrape wikipedia for the list of Top Level Domains for my email app so I could build an index of suffixes that would reduce the number of martian addresses the app would have to process. So I surfed over to the TLD page to have a look-see. Low and behold those beautiful wikipedians were nice enough to produce well formed XML (actually, XHTML 1.0 Transitional) content. My heart leaped w/ joy because no screen-scraping was needed. With a little bit of XPath I would have my index in no time.

WRONG!!!!!!!!!!!

I spent the next hour+ trying to get XOM to work and spit out my list via XPath because I sure as hell wasn't going to try to manually walk the tree extracting the TLDs. Do you have any idea how much code that would take? I had a feeling the reason it doesn't work has something to do w/ how XOM handles namespaces because when I access the tree the hard way it demands that I specify the namespace URI of the document to lookup any element in the tree even though the document declares one and only one namespace. Now this isn't a knock against XOM, though I'm sure if I had used dom4j it would have worked because dom4j is more concerned with being useful than being correct. But each have there place. By this time I was so aggravated I just didn't feel like downloading dom4j and unpacking it and setting my classpath, yada, yada, yada.

Regex to the rescue

This his how I got what I needed the old fashion screen-scraping way. 5 minutes tops from concept to results:
String tld = "http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains"
Matcher regex = 
    Pattern.compile( "<td><a[^>]++>\\.(\\w++)</a></td>" ).matcher( "" );
BufferedReader reader =
    new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream( new URL( tld ).openStream() ), "utf-8" ) );
PrintWriter writer = 
    new PrintWriter( 
        new BufferedWriter( 
            new FileWriter( new File( "/home/HashiDiKo/temp/TLD" ) ) ) );

for( String line; null != (line = reader.readLine()); )
    if( regex.reset( line ).matches() )
        writer.println( regex.group( 1 ) );

reader.close();
writer.close();

1 comment:

  1. Anonymous6:20 AM

    just because you suck at using XPath doesn't mean it's bad

    ReplyDelete