在PHP中parsing巨大的XML文件

我试图将DMOZ内容/结构XML文件parsing到MySQL中,但是所有现有的脚本都是非常旧的,并且不能很好地工作。 我怎样才能在PHP中打开一个大的(+ 1GB)XML文件进行parsing?

只有两个PHP API非常适合处理大文件。 第一个是旧的expat api,第二个是更新的XMLreader函数。 这些apis读取连续的stream,而不是将整个树加载到内存中(这是simplexml和DOM的作用)。

举个例子,你可能想看看DMOZ目录的这个部分parsing器:

<?php class SimpleDMOZParser { protected $_stack = array(); protected $_file = ""; protected $_parser = null; protected $_currentId = ""; protected $_current = ""; public function __construct($file) { $this->_file = $file; $this->_parser = xml_parser_create("UTF-8"); xml_set_object($this->_parser, $this); xml_set_element_handler($this->_parser, "startTag", "endTag"); } public function startTag($parser, $name, $attribs) { array_push($this->_stack, $this->_current); if ($name == "TOPIC" && count($attribs)) { $this->_currentId = $attribs["R:ID"]; } if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) { echo $attribs["R:RESOURCE"] . "\n"; } $this->_current = $name; } public function endTag($parser, $name) { $this->_current = array_pop($this->_stack); } public function parse() { $fh = fopen($this->_file, "r"); if (!$fh) { die("Epic fail!\n"); } while (!feof($fh)) { $data = fread($fh, 4096); xml_parse($this->_parser, $data, feof($fh)); } } } $parser = new SimpleDMOZParser("content.rdf.u8"); $parser->parse(); 

这是一个非常类似的问题,以最好的方式来处理PHP中的大型XML,但有一个很好的具体答案upvoted解决DMOZ目录parsing的具体问题。 但是,由于这对于大型XML来说是一个很好的攻击,所以我会从另外一个问题转发我的答案:

我承担:

https://github.com/prewk/XmlStreamer

一个简单的类,将stream式传输文件时将所有的孩子提取到XML根元素。 testing来自pubmed.com的108 MB XML文件。

 class SimpleXmlStreamer extends XmlStreamer { public function processNode($xmlString, $elementName, $nodeIndex) { $xml = simplexml_load_string($xmlString); // Do something with your SimpleXML object return true; } } $streamer = new SimpleXmlStreamer("myLargeXmlFile.xml"); $streamer->parse(); 

我最近不得不parsing一些非常大的XML文档,并且需要一次读取一个元素的方法。

如果您有以下文件complex-test.xml

 <?xml version="1.0" encoding="UTF-8"?> <Complex> <Object> <Title>Title 1</Title> <Name>It's name goes here</Name> <ObjectData> <Info1></Info1> <Info2></Info2> <Info3></Info3> <Info4></Info4> </ObjectData> <Date></Date> </Object> <Object></Object> <Object> <AnotherObject></AnotherObject> <Data></Data> </Object> <Object></Object> <Object></Object> </Complex> 

并且想要返回<Object/>

PHP:

 require_once('class.chunk.php'); $file = new Chunk('complex-test.xml', array('element' => 'Object')); while ($xml = $file->read()) { $obj = simplexml_load_string($xml); // do some parsing, insert to DB whatever } ########### Class File ########### <?php /** * Chunk * * Reads a large file in as chunks for easier parsing. * * The chunks returned are whole <$this->options['element']/>s found within file. * * Each call to read() returns the whole element including start and end tags. * * Tested with a 1.8MB file, extracted 500 elements in 0.11s * (with no work done, just extracting the elements) * * Usage: * <code> * // initialize the object * $file = new Chunk('chunk-test.xml', array('element' => 'Chunk')); * * // loop through the file until all lines are read * while ($xml = $file->read()) { * // do whatever you want with the string * $o = simplexml_load_string($xml); * } * </code> * * @package default * @author Dom Hastings */ class Chunk { /** * options * * @var array Contains all major options * @access public */ public $options = array( 'path' => './', // string The path to check for $file in 'element' => '', // string The XML element to return 'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk ); /** * file * * @var string The filename being read * @access public */ public $file = ''; /** * pointer * * @var integer The current position the file is being read from * @access public */ public $pointer = 0; /** * handle * * @var resource The fopen() resource * @access private */ private $handle = null; /** * reading * * @var boolean Whether the script is currently reading the file * @access private */ private $reading = false; /** * readBuffer * * @var string Used to make sure start tags aren't missed * @access private */ private $readBuffer = ''; /** * __construct * * Builds the Chunk object * * @param string $file The filename to work with * @param array $options The options with which to parse the file * @author Dom Hastings * @access public */ public function __construct($file, $options = array()) { // merge the options together $this->options = array_merge($this->options, (is_array($options) ? $options : array())); // check that the path ends with a / if (substr($this->options['path'], -1) != '/') { $this->options['path'] .= '/'; } // normalize the filename $file = basename($file); // make sure chunkSize is an int $this->options['chunkSize'] = intval($this->options['chunkSize']); // check it's valid if ($this->options['chunkSize'] < 64) { $this->options['chunkSize'] = 512; } // set the filename $this->file = realpath($this->options['path'].$file); // check the file exists if (!file_exists($this->file)) { throw new Exception('Cannot load file: '.$this->file); } // open the file $this->handle = fopen($this->file, 'r'); // check the file opened successfully if (!$this->handle) { throw new Exception('Error opening file for reading'); } } /** * __destruct * * Cleans up * * @return void * @author Dom Hastings * @access public */ public function __destruct() { // close the file resource fclose($this->handle); } /** * read * * Reads the first available occurence of the XML element $this->options['element'] * * @return string The XML string from $this->file * @author Dom Hastings * @access public */ public function read() { // check we have an element specified if (!empty($this->options['element'])) { // trim it $element = trim($this->options['element']); } else { $element = ''; } // initialize the buffer $buffer = false; // if the element is empty if (empty($element)) { // let the script know we're reading $this->reading = true; // read in the whole doc, cos we don't know what's wanted while ($this->reading) { $buffer .= fread($this->handle, $this->options['chunkSize']); $this->reading = (!feof($this->handle)); } // return it all return $buffer; // we must be looking for a specific element } else { // set up the strings to find $open = '<'.$element.'>'; $close = '</'.$element.'>'; // let the script know we're reading $this->reading = true; // reset the global buffer $this->readBuffer = ''; // this is used to ensure all data is read, and to make sure we don't send the start data again by mistake $store = false; // seek to the position we need in the file fseek($this->handle, $this->pointer); // start reading while ($this->reading && !feof($this->handle)) { // store the chunk in a temporary variable $tmp = fread($this->handle, $this->options['chunkSize']); // update the global buffer $this->readBuffer .= $tmp; // check for the open string $checkOpen = strpos($tmp, $open); // if it wasn't in the new buffer if (!$checkOpen && !($store)) { // check the full buffer (in case it was only half in this buffer) $checkOpen = strpos($this->readBuffer, $open); // if it was in there if ($checkOpen) { // set it to the remainder $checkOpen = $checkOpen % $this->options['chunkSize']; } } // check for the close string $checkClose = strpos($tmp, $close); // if it wasn't in the new buffer if (!$checkClose && ($store)) { // check the full buffer (in case it was only half in this buffer) $checkClose = strpos($this->readBuffer, $close); // if it was in there if ($checkClose) { // set it to the remainder plus the length of the close string itself $checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize']; } // if it was } elseif ($checkClose) { // add the length of the close string itself $checkClose += strlen($close); } // if we've found the opening string and we're not already reading another element if ($checkOpen !== false && !($store)) { // if we're found the end element too if ($checkClose !== false) { // append the string only between the start and end element $buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen)); // update the pointer $this->pointer += $checkClose; // let the script know we're done $this->reading = false; } else { // append the data we know to be part of this element $buffer .= substr($tmp, $checkOpen); // update the pointer $this->pointer += $this->options['chunkSize']; // let the script know we're gonna be storing all the data until we find the close element $store = true; } // if we've found the closing element } elseif ($checkClose !== false) { // update the buffer with the data upto and including the close tag $buffer .= substr($tmp, 0, $checkClose); // update the pointer $this->pointer += $checkClose; // let the script know we're done $this->reading = false; // if we've found the closing element, but half in the previous chunk } elseif ($store) { // update the buffer $buffer .= $tmp; // and the pointer $this->pointer += $this->options['chunkSize']; } } } // return the element (or the whole file if we're not looking for elements) return $buffer; } } 

我会build议使用基于SAX的parsing器,而不是基于DOM的parsing。

关于在PHP中使用SAX的信息: http : //www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

这不是一个很好的解决scheme,但只是抛出另一个select:

你可以把许多大的XML文件分解成多个块,特别是那些实际上只是类似元素列表(因为我怀疑你正在使用的文件)。

例如,如果您的文档看起来像:

 <dmoz> <listing>....</listing> <listing>....</listing> <listing>....</listing> <listing>....</listing> <listing>....</listing> <listing>....</listing> ... </dmoz> 

您可以一次读取一两个字节,人为地将您加载到根级别标记中的几个完整的<listing>标记包装起来,然后通过simplexml / domxml(我使用domxml,在采取这种方法时)加载它们。

坦率地说,如果你使用PHP <5.1.2,我更喜欢这种方法。 对于5.1.2或更高版本,XMLReader是可用的,这可能是最好的select,但在此之前,您坚持使用上述分块策略或旧的SAX / expat lib。 我不知道其他人,但我恨写/维护SAX / expatparsing器。

但是,请注意,当文档包含许多相同的底层元素时,这种方法并不实际(例如,它适用于任何types的文件列表或URL等,但不会使感觉是parsing一个大的HTML文档)