PHP loadHTMLFile and a html file without DOCTYPE

Just noticed that when you parse an html file with DOMDocument’s method loadHTMLFile and there's no DOCTYPE defined in your html, PHP will silently load an empty DOM document.

Just try saving the following in a test.html file:

<html><body><div id="toc">wtf</div></body></html>

And then run the following php code:

$doc = new DOMDocument();
if ($doc->loadHTMLFile('test.html')) {
    echo 'loadHTMLFile was successfully executed<br>';
    $toc = $doc->getElementById('toc');
    echo 'now trying to var_dump the $toc:<br>';
    var_dump($toc);
}

You’ll get NULL as a result of the var_dump call. As if getElementById couldn’t find the node.

Interesting?

Citing php.net,

The function parses the HTML document in the file named filename. Unlike loading XML, HTML does not have to be well-formed to load.

Does this imply that DOCTYPE may be omitted? I think so. But then the abovementioned code wouldn’t show NULL as a dump of $toc. Unfortunately, experiment shows that DOCTYPE is required, even a HTML5-ish <!DOCTYPE html> will do the job.

But why on earth doesn’t loadHTMLFile throw a warning or at least return false as it should according to the documentation? Nobody knows.

So if you notice that your DOM-based php script acts in a weird way, check if you have a DOCTYPE defined on the HTML document you’re trying to parse.

Hope this saves someone some time.

P.S. More bugs to come — if you have a HTML file saved in utf-8 codepage with BOM, loadHTMLFile will throw the following E_WARNING:

**Warning**: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]:
Misplaced DOCTYPE declaration in test-BOM.html, line: 1 in /home/test/www/test-DOMDocument.php on line 3

Remove the BOM and everything works fine. Apparently, loadHTMLFile doesn’t know that BOM usually indicates that the document is saved in UTF8/16/32. Weird.

P.P.S. Another issue. Try pointing loadHTMLFile to an HTML-document saved in UTF-8 with some international characters (Russian words, in my case). Then get a node with international characters and do echo $node->nodeValue. Are you getting corrupted symbols? I was. The whole project is in UTF-8, every single file is saved in UTF-8.

I added

<meta http-equiv="Content-type" content="text/html;charset=utf-8"/>

to the head section — characters started showing in a correct encoding, but the following WARNING appeared:

**Warning**: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]: Input is
not proper UTF-8, indicate encoding ! in /home/test/www/test-russian.html, line: 65 in
/home/test/www/test-DOMDocument.php on line 29

And the only way to properly get rid of this warning is to add

<?xml version="1.0" encoding="UTF-8"?>

to the first line of your html document and it finally worked without any warnings or issues. Awesome. XML header must be used for load**HTML**File to run properly. Way too buggy to use.

comments powered by Disqus