All Projects → bupt1987 → Html Parser

bupt1987 / Html Parser

Licence: mit
php html parser,类似与PHP Simple HTML DOM Parser,但是比它快好几倍

Projects that are alternatives of or similar to Html Parser

Sax Wasm
The first streamable, fixed memory XML, HTML, and JSX parser for WebAssembly.
Stars: ✭ 89 (-82.55%)
Mutual labels:  parser, html-parser
Fuzi
A fast & lightweight XML & HTML parser in Swift with XPath & CSS support
Stars: ✭ 894 (+75.29%)
Mutual labels:  parser, html-parser
Oga
Read-only mirror of https://gitlab.com/yorickpeterse/oga
Stars: ✭ 1,147 (+124.9%)
Mutual labels:  parser, html-parser
Save For Offline
Android app for saving webpages for offline reading.
Stars: ✭ 114 (-77.65%)
Mutual labels:  parser, html-parser
Lua Gumbo
Moved to https://gitlab.com/craigbarnes/lua-gumbo
Stars: ✭ 116 (-77.25%)
Mutual labels:  parser, html-parser
Posthtml
PostHTML is a tool to transform HTML/XML with JS plugins
Stars: ✭ 2,737 (+436.67%)
Mutual labels:  parser, html-parser
Harser
Easy way for HTML parsing and building XPath
Stars: ✭ 135 (-73.53%)
Mutual labels:  parser, html-parser
Hquery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Stars: ✭ 295 (-42.16%)
Mutual labels:  parser, html-parser
Fasthan
fastHan是基于fastNLP与pytorch实现的中文自然语言处理工具,像spacy一样调用方便。
Stars: ✭ 449 (-11.96%)
Mutual labels:  parser
Deta parser
快速中文分词分析word segmentation
Stars: ✭ 476 (-6.67%)
Mutual labels:  parser
Anystyle
Fast and smart citation reference parsing
Stars: ✭ 438 (-14.12%)
Mutual labels:  parser
Exifr
📷 The fastest and most versatile JS EXIF reading library.
Stars: ✭ 448 (-12.16%)
Mutual labels:  parser
Html5 Dom Document Php
A better HTML5 parser for PHP.
Stars: ✭ 477 (-6.47%)
Mutual labels:  parser
Mwparserfromhell
A Python parser for MediaWiki wikicode
Stars: ✭ 440 (-13.73%)
Mutual labels:  parser
Tenko
An 100% spec compliant ES2021 JavaScript parser written in JS
Stars: ✭ 490 (-3.92%)
Mutual labels:  parser
Picofeed
PHP library to parse and write RSS/Atom feeds
Stars: ✭ 439 (-13.92%)
Mutual labels:  parser
Tiny Compiler
A tiny compiler for a language featuring LL(2) with Lexer, Parser, ASM-like codegen and VM. Complex enough to give you a flavour of how the "real" thing works whilst not being a mere toy example
Stars: ✭ 425 (-16.67%)
Mutual labels:  parser
Textx
Domain-Specific Languages and parsers in Python made easy http://textx.github.io/textX/
Stars: ✭ 496 (-2.75%)
Mutual labels:  parser
Kong
Kong is a command-line parser for Go
Stars: ✭ 481 (-5.69%)
Mutual labels:  parser
Stream Json
The micro-library of Node.js stream components for creating custom JSON processing pipelines with a minimal memory footprint. It can parse JSON files far exceeding available memory streaming individual primitives using a SAX-inspired API.
Stars: ✭ 462 (-9.41%)
Mutual labels:  parser

HtmlParser

Total Downloads Build Status

php html解析工具,类似与PHP Simple HTML DOM Parser。 由于基于php模块dom,所以在解析html时的效率比 PHP Simple HTML DOM Parser 快好几倍。

注意:html代码必须是utf-8编码字符,如果不是请转成utf-8
如果有乱码的问题参考:http://www.fwolf.com/blog/post/314

现在支持composer

"require": {"bupt1987/html-parser": "dev-master"}

加载composer
require 'vendor/autoload.php';

================================================================================

Example
<?php
require 'vendor/autoload.php';

$html = '<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>test</title>
  </head>
  <body>
    <p class="test_class test_class1">p1</p>
    <p class="test_class test_class2">p2</p>
    <p class="test_class test_class3">p3</p>
    <div id="test1">测试1</div>
  </body>
</html>';
$html_dom = new \HtmlParser\ParserDom($html);
$p_array = $html_dom->find('p.test_class');
$p1 = $html_dom->find('p.test_class1',0);
$div = $html_dom->find('div#test1',0);
foreach ($p_array as $p){
	echo $p->getPlainText() . "\n";
}
echo $div->getPlainText() . "\n";
echo $p1->getPlainText() . "\n";
echo $p1->getAttr('class') . "\n";
echo "show html:\n";
echo $div->innerHtml() . "\n";
echo $div->outerHtml() . "\n";
?>

基础用法

// 查找所有a标签
$ret = $html->find('a');

// 查找a标签的第一个元素
$ret = $html->find('a', 0);

// 查找a标签的倒数第一个元素
$ret = $html->find('a', -1); 

// 查找所有含有id属性的div标签
$ret = $html->find('div[id]');

// 查找所有含有id属性为foo的div标签
$ret = $html->find('div[id=foo]'); 

高级用法

// 查找所有id=foo的元素
$ret = $html->find('#foo');

// 查找所有class=foo的元素
$ret = $html->find('.foo');

// 查找所有拥有 id属性的元素
$ret = $html->find('*[id]'); 

// 查找所有 anchors 和 images标记 
$ret = $html->find('a, img'); 

// 查找所有有"title"属性的anchors and images 
$ret = $html->find('a[title], img[title]');

层级选择器

// Find all <li> in <ul> 
$es = $html->find('ul li');

// Find Nested <div> tags
$es = $html->find('div div div'); 

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]'); 

嵌套选择器

// Find all <li> in <ul> 
foreach($html->find('ul') as $ul) 
{
       foreach($ul->find('li') as $li) 
       {
             // do something...
       }
}

// Find first <li> in first <ul> 
$e = $html->find('ul', 0)->find('li', 0);

属性过滤

支持属性选择器操作:

过滤	描述
[attribute]	匹配具有指定属性的元素.
[!attribute]	匹配不具有指定属性的元素。
[attribute=value]	匹配具有指定属性值的元素
[attribute!=value]	匹配不具有指定属性值的元素
[attribute^=value]	匹配具有指定属性值开始的元素
[attribute$=value]	匹配具有指定属性值结束的元素
[attribute*=value]	匹配具有指定属性的元素,且该属性包含了一定的值

Dom扩展用法

获取dom通过扩展实现更多的功能,详见:http://php.net/manual/zh/book.dom.php

/**
 * @var \DOMNode
 */
$oHtml->node

$oHtml->node->childNodes
$oHtml->node->parentNode
$oHtml->node->firstChild
$oHtml->node->lastChild
等等...

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].