I needed to screen-scrape wikipedia for the list of Top Level Domains for my email app so I could build an index of suffixes that would reduce the number of martian addresses the app would have to process. So I surfed over to the TLD page to have a look-see. Low and behold those beautiful wikipedians were nice enough to produce well formed XML (actually, XHTML 1.0 Transitional) content. My heart leaped w/ joy because no screen-scraping was needed. With a little bit of XPath I would have my index in no time.
WRONG!!!!!!!!!!!
I spent the next hour+ trying to get XOM to work and spit out my list via XPath because I sure as hell wasn't going to try to manually walk the tree extracting the TLDs. Do you have any idea how much code that would take? I had a feeling the reason it doesn't work has something to do w/ how XOM handles namespaces because when I access the tree the hard way it demands that I specify the namespace URI of the document to lookup any element in the tree even though the document declares one and only one namespace. Now this isn't a knock against XOM, though I'm sure if I had used dom4j it would have worked because dom4j is more concerned with being useful than being correct. But each have there place. By this time I was so aggravated I just didn't feel like downloading dom4j and unpacking it and setting my classpath, yada, yada, yada.
Regex to the rescue
This his how I got what I needed the old fashion screen-scraping way. 5 minutes tops from concept to results:String tld = "http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains" Matcher regex = Pattern.compile( "<td><a[^>]++>\\.(\\w++)</a></td>" ).matcher( "" ); BufferedReader reader = new BufferedReader( new InputStreamReader( new BufferedInputStream( new URL( tld ).openStream() ), "utf-8" ) ); PrintWriter writer = new PrintWriter( new BufferedWriter( new FileWriter( new File( "/home/HashiDiKo/temp/TLD" ) ) ) ); for( String line; null != (line = reader.readLine()); ) if( regex.reset( line ).matches() ) writer.println( regex.group( 1 ) ); reader.close(); writer.close();
just because you suck at using XPath doesn't mean it's bad
ReplyDelete