2007-02-25 13:45AtomiseAs I mentioned, my desire to be above reproach when criticising MPlayer lead me to produce a PHP script that converted their website into an Atom feed. I have now emailed that script to their webmaster and await a reply. In the mean time, I hope they will not see it as a lessening of their security that I publish the source code here, especially since the project itself has benefited from the openness of their code. Also, in case we forget, I should point out that contributing to an open source project is not just a form of political activism (and even if it were “just” that it would be worthwhile in itself), but it has short term benefits (the only type which some people appreciate) in that it has increased my understanding of the Atom format so I am now confident I could use the format in my own projects, whether at home or at work. In terms of the benefit to people reading this, well, they may be looking for a script that converts a site to an Atom feed (as I was before I endeavoured to make my own), they may be interested in screen scraping in general (which is the method I am using), or they may just want to check someone else’s PHP code for style tips (or possibly examples of mistakes to avoid). Being GPL code you may do all those things, as long as you respect the requirements of the licence. <?php /** * atomise.php * * Copyright 2006 Hagfish * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License, * version 2, as published by the Free Software Foundation. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ // RewriteRule ^atomise.xml$ atomise.php [L] $url = ‘http://www.mplayerhq.hu/design7/news.html’; parse_html($url); function htmlfix($string) { return htmlentities ($string, ENT_NOQUOTES, ‘UTF-8’); } function parse_html($url) { //$itemregexp = "%rss:item *\" *>(.+?)</span>%is"; $itemregexp = "%newsentry *\" *>(.+?)</div>%is"; $allowable_tags = "<A><B><BR><BLOCKQUOTE><CENTER><DD><DL><DT><HR><I><IMG><LI><OL><P><PRE><U><UL><H1><H2><H3>"; $urlparts = parse_url($url); if ($urlparts[path] == "") $url .= "/"; if ($fp = @fopen($url, "r")) { while (!feof($fp)) { $data .= utf8_encode(fgets($fp, 128)); } fclose($fp); } // print "<pre>"; // print htmlfix($data); eregi("<title>(.*)</title>", $data, $title); $channel_title = $title[1]; // Split the data into items using the regexp which represents the start of each item $match_count = preg_match_all($itemregexp, $data, $items); $match_count = ($match_count > 10) ? 10 : $match_count; header("Content-Type: application/atom+xml"); $output .= "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n"; $output .= "<feed xmlns=\"http://www.w3.org/2005/Atom\">\n"; $output .= " <id>http://". $_SERVER[‘SERVER_NAME’] . $_SERVER[‘REQUEST_URI’] ."</id>\n"; $output .= " <title>". htmlfix(strip_tags($channel_title)) ."</title>\n"; $output .= " <subtitle>MPlayer News</subtitle>\n"; $output .= " <link rel=\"self\" type=\"application/atom+xml\" href=\"http://". $_SERVER[‘SERVER_NAME’] . $_SERVER[‘REQUEST_URI’] ."\"/>\n"; $output .= " <link href=\"". htmlfix($url) ."\" />\n"; $output .= " <author><email>". htmlfix("webmaster@mplayerhq.hu") ."</email><name>Webmaster</name></author>\n"; $output .= " <updated>". get_date($items[1][1]) ."</updated>\n"; for ($i=0; $i< $match_count; $i++) { $desc = $items[1][$i]; $title = get_title($desc); $item_url = get_link($desc, $url); $desc = strip_tags($desc, $allowable_tags); $output .= " <entry>\n"; $output .= " <id>http://". $_SERVER[‘SERVER_NAME’] . $_SERVER[‘REQUEST_URI’] . "/" . get_date($items[1][$i]) ."</id>\n"; $output .= " <title>". htmlfix($title) ."</title>\n"; $output .= " <updated>". get_date($items[1][$i]) ."</updated>\n"; $output .= " <link rel=\"alternate\" type=\"text/html\" href=\"". htmlfix($item_url) ."\" />\n"; $output .= " <author><name>". get_author($items[1][$i]) ."</name></author>\n"; $output .= " <content type=\"xhtml\" xml:lang=\"en\">\n"; $output .= " <div xmlns=\"http://www.w3.org/1999/xhtml\">". get_description($desc) ."</div>\n"; $output .= " </content>\n"; $output .= " </entry>\n"; } $output .= "</feed>\n"; print $output; // print htmlfix($output); // print "</pre>"; } function get_author($desc) { $posterWithTail = ereg_replace(".*<span class="poster">posted by ", "", $desc); $poster = ereg_replace("<.*", "", $posterWithTail); return $poster; } function get_date($desc) { $upToDate = ereg_replace("::.*", "", $desc); $date = ereg_replace(".*>", "", $upToDate); $numericalDate = ereg_replace(",.*", "", $date); $isoDate = str_replace(".", "-", $numericalDate); return $isoDate . "T00:00:00Z"; } function xmlFix($desc) { $desc = ereg_replace("(<img[^>]*)>", "\1 />", $desc); $desc = ereg_replace("&", "&", $desc); $desc = str_replace("<br>", "<br />", $desc); return $desc; } function get_description($desc) { //$desc = htmlfix($desc); $desc = str_replace("\r", " ", $desc); $desc = str_replace("\n", " ", $desc); //$desc = ereg_replace("^[^>]*>[^>]*>[^>]*>[^<]*<", "<", $desc); $desc = ereg_replace(".*</h2>", "", $desc); $desc = xmlFix($desc); return $desc; } function get_title($desc) { $upToTitle = ereg_replace("</a>.*", "", $desc); $title = ereg_replace(".*:: ", "", $upToTitle); return $title; } function get_link($desc, $url) { $nameWithTail = ereg_replace(".*a name="", "", $desc); $name = ereg_replace("".*", "", $nameWithTail); $link = "$url#$name"; return $link; } ?> Has the software name “Atomise” been used before? Is the spelling too en-GB and not en-GB-oed? Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |