How To Parse HTML/XML In PHP?

Published October 6, 2024

Problem: Extracting Data from HTML/XML in PHP

Parsing HTML or XML in PHP involves breaking down structured markup languages into pieces of data. This process is important for extracting specific information from web pages or XML documents, allowing developers to work with the content programmatically.

Solutions for parsing HTML/XML in PHP

PHP offers options for parsing HTML and XML documents. These solutions include native extensions and external libraries, each with its own strengths.

Native XML Extensions

PHP provides XML extensions that come with the language:

  1. DOM (Document Object Model): This extension lets you work with XML documents through the DOM API. It's based on libxml and can handle broken HTML. DOM can parse, modify, and perform XPath queries on documents.

  2. XMLReader: This extension is an XML pull parser that moves through the document stream and stops at each node. It's based on libxml but may be less able to parse broken HTML compared to DOM.

  3. XML Parser: This extension lets you create XML parsers and define handlers for XML events. It uses a SAX-style XML push parser and may manage memory better than DOM or SimpleXML.

  4. SimpleXML: This extension provides a tool to convert XML to an object that you can process with property selectors and array iterators. It works well with valid XHTML but can't parse broken HTML.

Tip: Choose the Right Extension

When selecting a native XML extension, consider your specific needs. If you're working with well-formed XML and need simple access to data, SimpleXML might be your best choice. For more complex operations or when dealing with potentially broken HTML, the DOM extension offers more flexibility and robustness.

Third-party libraries based on libxml

Some third-party libraries build on the native libxml extensions:

  1. FluentDom: This library provides a jQuery-like XML interface for the DOMDocument in PHP. It supports XPath or CSS selectors and can load formats like JSON, CSV, and JsonML.

  2. HtmlPageDom: This library extends DomCrawler from Symfony2 components, adding methods to manipulate the DOM tree of HTML documents.

  3. phpQuery: This library provides a server-side, chainable, CSS3 selector-driven DOM API based on the jQuery JavaScript Library.

  4. laminas-dom: This package offers tools for working with DOM documents and structures, including an interface for querying DOM documents using XPath and CSS selectors.

  5. fDOMDocument: This library extends the standard DOM to use exceptions instead of PHP warnings or notices. It also adds methods and shortcuts for convenience.

  6. sabre/xml: This library wraps and extends the XMLReader and XMLWriter classes to create a simple XML to object/array mapping system.

  7. FluidXML: This library provides an API for manipulating XML, using XPath and the fluent programming pattern.

Third-party libraries not based on libxml

Some third-party libraries don't rely on libxml:

  1. PHP Simple HTML DOM Parser: This parser allows you to manipulate HTML using methods similar to jQuery. However, it's not recommended due to performance issues and limited selector support.

  2. PHP Html Parser: This library aims to provide a way to scrape HTML, but it's not recommended due to performance issues and lack of memory management features.

When choosing a parsing solution, consider performance, memory usage, and your project's needs. Native XML extensions often perform better, while third-party libraries may offer more convenient APIs or extra features.

Example: Using DOM to Parse HTML

Here's a simple example of using the DOM extension to parse an HTML document:

<?php
// Create a new DOM Document
$dom = new DOMDocument();

// Load HTML
$html = '<html><body><h1>Hello, World!</h1></body></html>';
$dom->loadHTML($html);

// Find all h1 elements
$h1Elements = $dom->getElementsByTagName('h1');

// Output the text content of the first h1 element
echo $h1Elements->item(0)->textContent;  // Outputs: Hello, World!
?>

This example demonstrates how to load an HTML string, parse it using the DOM extension, and extract specific information from the parsed document.