Just noticed that when you parse an html file with DOMDocument
’s method loadHTMLFile
and there's no DOCTYPE defined in your html, PHP will silently load an empty DOM document.
Just try saving the following in a test.html file:
<html><body><div id="toc">wtf</div></body></html>
And then run the following php code:
$doc = new DOMDocument();
if ($doc->loadHTMLFile('test.html')) {
echo 'loadHTMLFile was successfully executed<br>';
$toc = $doc->getElementById('toc');
echo 'now trying to var_dump the $toc:<br>';
var_dump($toc);
}
You’ll get NULL as a result of the var_dump
call. As if getElementById
couldn’t
find the node.
Interesting?
Citing php.net,
The function parses the HTML document in the file named filename. Unlike loading XML, HTML does not have to be well-formed to load.
Does this imply that DOCTYPE may be omitted? I think so. But then the abovementioned code wouldn’t show NULL as a dump of $toc. Unfortunately, experiment shows that DOCTYPE is required, even a HTML5-ish <!DOCTYPE html>
will do the job.
But why on earth doesn’t loadHTMLFile
throw a warning or at least return false as it should according to the documentation? Nobody knows.
So if you notice that your DOM-based php script acts in a weird way, check if you have a DOCTYPE defined on the HTML document you’re trying to parse.
Hope this saves someone some time.
P.S. More bugs to come — if you have a HTML file saved in utf-8 codepage with BOM, loadHTMLFile
will throw the
following E_WARNING:
**Warning**: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]:
Misplaced DOCTYPE declaration in test-BOM.html, line: 1 in /home/test/www/test-DOMDocument.php on line 3
Remove the BOM and everything works fine. Apparently, loadHTMLFile
doesn’t know that BOM usually indicates that the document is saved in UTF8/16/32. Weird.
P.P.S. Another issue. Try pointing loadHTMLFile
to an HTML-document saved in UTF-8 with some international characters (Russian words, in my case). Then get a node with international characters and do echo $node->nodeValue
. Are you getting corrupted symbols? I was. The whole project is in UTF-8, every single file is saved in UTF-8.
I added
<meta http-equiv="Content-type" content="text/html;charset=utf-8"/>
to the head section — characters started showing in a correct encoding, but the following WARNING appeared:
**Warning**: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]: Input is
not proper UTF-8, indicate encoding ! in /home/test/www/test-russian.html, line: 65 in
/home/test/www/test-DOMDocument.php on line 29
And the only way to properly get rid of this warning is to add
<?xml version="1.0" encoding="UTF-8"?>
to the first line of your html document and it finally worked without any warnings or issues. Awesome. XML header must be used for load**HTML**File
to run properly. Way too buggy to use.