Jan 2, 2010

Parsing html with Yql and php without regular expressions

Most data on the Web is stored in the Hypertext Markup Language (HTML) format. There are many times that you might want to parse HTML in your application. However, programming languages do not provide any easy way to parse HTML.
Evidence of this is the numerous questions posted by programmers looking for an easy way to parse HTML.

I am here with a easy way to parse data from any web page.

You just have to pass Url of the page and the xpath(XPath is used to navigate through elements and attributes in an XML document.)

The Php Code


$url ="http://motyar.blogspot.com";
$xpath ="div[@id='actions']/a";
$queryUrl = "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D'".$url."'%20and%20xpath%3D'%2F%2F".$xpath."'&format=json";
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $queryUrl); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
$output = curl_exec($ch); 
curl_close($ch);
$data = json_decode($output);
$results = $data->query->results;

print_r($results);

The result
This script will return a stdClass Object look like this -


stdClass Object
(
[a] => Array
(
[0] => stdClass Object
(
[href] => http://twitter.com/dharmmotyar
[id] => twitter
[target] => _blank
[title] => Follow me on Twitter.
[span] => stdClass Object
(
[content] => Twitter
)

)

[1] => stdClass Object
(
[href] => http://motyar.blogspot.com/rss.xml
[id] => rss
[title] => RSS feed of this site.
[span] => stdClass Object
(
[class] => hidden
[content] => RSS
)

)

[2] => stdClass Object
(
[href] => http://motyar.blogspot.com
[id] => home_link
[title] => My homepage.
[span] => stdClass Object
(
[class] => hidden
[content] => Home
)

)

)

)

Feel free to share any queries.

Labels: Yql

By : Motyar+ @motyar