php解析html类库simple_html_dom(爬虫相关)

下载地址：https://github.com/samacs/simple_html_dom

解析器不仅仅只是帮助我们验证html文档；更能解析不符合W3C标准的html文档。它使用了类似jQuery的元素选择器，通过元素的id，class，tag等等来查找定位；同时还提供添加、删除、修改文档树的功能。当然，这样一款强大的html Dom解析器也不是尽善尽美；在使用的过程中需要十分小心内存消耗的情况。不过，不要担心；本文中，笔者在最后会为各位介绍如何避免消耗过多的内存。

开始使用

上传类文件以后，有三种方式调用这个类：

从url中加载html文档

从字符串中加载html文档

从文件中加载html文档

<?php

// 新建一个Dom实例

$html =newsimple_html_dom();

// 从url中加载

$html->load_file(‘http://www.jb51.net’);

// 从字符串中加载

$html->load(‘

从字符串中加载html文档演示’);

//从文件中加载

$html->load_file(‘path/file/test.html’);

如果从字符串加载html文档，需要先从网络上下载。建议使用CURL来抓取html文档并加载DOM中。

PHP Simple HTML DOM Parser提供了3种方式来创建DOM对象 :

// Create a DOM object from a string

$html = str_get_html(‘Hello!’);

// Create a DOM object from a URL

$html = file_get_html(‘http://www.google.com/’);

// Create a DOM object from a HTML file

$html = file_get_html(‘test.htm’);

查找html元素

可以使用find函数来查找html文档中的元素。返回的结果是一个包含了对象的数组。我们使用HTML DOM解析类中的函数来访问这些对象，下面给出几个示例：

<?php

//查找html文档中的超链接元素

$a = $html->find(‘a’);

//查找文档中第(N)个超链接，如果没有找到则返回空数组.

$a = $html->find(‘a’,0);

// 查找id为main的div元素

$main = $html->find(‘div[id=main]’,0);

// 查找所有包含有id属性的div元素

$divs = $html->find(‘div[id]’);

// 查找所有包含有id属性的元素

$divs = $html->find(‘[id]’);

还可以使用类似jQuery的选择器来查找定位元素：

<?php

// 查找id=’#container’的元素

$ret = $html->find(‘#container’);

// 找到所有class=foo的元素

$ret = $html->find(‘.foo’);

// 查找多个html标签

$ret = $html->find(‘a, img’);

// 还可以这样用

$ret = $html->find(‘a[title], img[title]’);

解析器支持对子元素的查找

<?php

// 查找 ul列表中所有的li项

$ret = $html->find(‘ul li’);

//查找 ul 列表指定class=selected的li项

$ret = $html->find(‘ul li.selected’);

如果你觉得这样用起来麻烦，使用内置函数可以轻松定位元素的父元素、子元素与相邻元素

<?php

// 返回父元素

$e->parent;

// 返回子元素数组

$e->children;

// 通过索引号返回指定子元素

$e->children(0);

// 返回第一个资源速

$e->first_child ();

// 返回最后一个子元素

$e->last _child ();

// 返回上一个相邻元素

$e->prev_sibling ();

//返回下一个相邻元素

$e->next_sibling ();

元素属性操作

使用简单的正则表达式来操作属性选择器。

[attribute] – 选择包含某属性的html元素

[attribute=value] – 选择所有指定值属性的html元素

[attribute!=value]- 选择所有非指定值属性的html元素

[attribute^=value] -选择所有指定值开头属性的html元素

[attribute$=value] 选择所有指定值结尾属性的html元素

[attribute*=value] -选择所有包含指定值属性的html元素

在解析器中调用元素属性

在DOM中元素属性也是对象：

<?php

// 本例中将$a的锚链接值赋给$link变量

$link = $a->href;

或者：

<?php

$link = $html->find(‘a’,0)->href;

每个对象都有4个基本对象属性:

tag – 返回html标签名

innertext – 返回innerHTML

outertext – 返回outerHTML

plaintext – 返回html标签中的文本

在解析器中编辑元素

编辑元素属性的用法和调用它们是类似的：

<?php

//给$a的锚链接赋新值

$a->href =’http://www.jb51.net’;

// 删除锚链接

$a->href =null;

// 检测是否存在锚链接

if(isset($a->href)) {

//代码

}

解析器中没有专门的方法来添加、删除元素，不过可以变通一下使用：

<?php

// 封装元素

$e->outertext =’

‘. $e->outertext .’

‘;

// 删除元素

$e->outertext =”;

// 添加元素

$e->outertext = $e->outertext .’

foo

‘;

// 插入元素

$e->outertext =’

foo

‘. $e->outertext;

保存修改后的html DOM文档也非常简单：

<?php

$doc = $html;

// 输出

echo$doc;

如何避免解析器消耗过多内存

在本文的开篇中，笔者就提到了Simple HTML DOM解析器消耗内存过多的问题。如果php脚本占用内存太多，会导致网站停止响应等一系列严重的问题。解决的方法也很简单，在解析器加载html文档并使用完成后，记得清理掉这个对象就可以了。当然，也不要把问题看得太严重了。如果只是加载了2、3个文档，清理或不清理是没有多大区别的。当你加载了5个10个甚至更多的文档的时候，用完一个就清理一下内存

<?php

$html->clear();

一个实例:

简单范例

<?PHP

include”simple_html_dom.php”;//加载simple_html_dom.php文件

$html = file_get_html(‘http://www.google.com/’);//获取html

$dom =newsimple_html_dom();//new simple_html_dom对象

$dom->load($html)//加载html

// Find all images

foreach($dom->find(‘img’)as$element) {

//获取img标签数组

echo$element->src .’
‘;//获取每个img标签中的src

}

// Find all links

foreach($dom->find(‘a’)as$element){//获取a标签的数组

echo$element->href .’
‘;//获取每个a标签中的href

}

$html = file_get_html(‘http://slashdot.org/’);//获取html

$dom =newsimple_html_dom();//new simple_html_dom对象

$dom->load($html);//加载html

// Find all article blocks

foreach($dom->find(‘div.article’)as$article) {

$item[‘title’] = $article->find(‘div.title’,0)->plaintext;//plaintext 获取纯文本

$item[‘intro’] = $article->find(‘div.intro’,0)->plaintext;

$item[‘details’] = $article->find(‘div.details’,0)->plaintext;

$articles[] = $item;

}

print_r($articles);

// Create DOM from string

$html = str_get_html(‘

Hello

World

‘);

$dom =newsimple_html_dom();//new simple_html_dom对象

$dom->load($html);//加载html

$dom->find(‘div’,1)->class =’bar’;//class = 赋值给第二个div的class赋值

$dom->find(‘div[id=hello]’,0)->innertext =’foo’;//innertext内部文本

echo$dom;

//Output:

fooWorld

DOM methods & properties

Name Description

void __construct ( [string $filename] ) 构造函数,将文件名参数将自动加载内容,无论是文本或文件/ url。

string plaintext 纯文本

void clear () 清理内存

void load ( string $content ) 加载内容

string save ( [string $filename] ) Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.

void load_file ( string $filename ) Load contents from a from a file or a URL.

void set_callback ( string $function_name ) 设置一个回调函数。

mixed find ( string $selector [, int $index] ) 找到元素的CSS选择器。返回第n个元素对象如果索引设置,否则返回一个数组对象。

4.find 方法详细介绍

find ( string$selector [, int $index] )

// Find all anchors, returns a array of element objects a标签数组

$ret = $html->find(‘a’);

// Find (N)th anchor, returns element object or null if not found (zero based)第一个a标签

$ret = $html->find(‘a’, 0);

// Find lastest anchor, returns element object or null if not found (zero based)最后一个a标签

$ret = $html->find(‘a’, -1);

// Find all

with the id attribute

$ret = $html->find(‘div[id]’);

// Find all

which attribute id=foo

$ret = $html->find(‘div[id=foo]’);

// Find all element which id=foo

$ret = $html->find(‘#foo’);

// Find all element which class=foo

$ret = $html->find(‘.foo’);

// Find all element has attribute id

$ret = $html->find(‘*[id]’);

// Find all anchors and images a标签与img标签数组

$ret = $html->find(‘a, img’);

// Find all anchors and images with the “title” attribute

$ret = $html->find(‘a[title], img[title]’);

// Find all

$es = $html->find(‘ul li’); ul标签下的li标签数组

// Find Nested

推荐阅读更多精彩内容

电子商务网站开发与建设

概要 64学时 3.5学分章节安排电子商务网站概况 HTML5+CSS3 JavaScript Node 电子…

阿啊阿吖丁阅读 7,297评论 0赞 3
前端面试题总结

HTML HTML5标签媒体查询head部分写法 Doctype作用? 严格模式与混杂模式如何区分？它们有何意义…

Mayo_阅读 471评论 0赞 8
Jquery知识点总结

一：认识jquery jquery是javascript的类库，具有轻量级，完善的文档，丰富的插件支持，完善的Aj…

xuguibin阅读 1,364评论 1赞 7
前端面试题整理

一、理论基础知识部分 1.1、讲讲输入完网址按下回车，到看到网页这个过程中发生了什么 a. 域名解析 b. 发起T…

我家媳妇蠢蠢哒阅读 2,955评论 2赞 106
白雪公主新编

从前，有个王后有一面魔镜，它能回答别人提出的问题。她每天问镜子，“魔镜，魔镜，谁是这世上最美的女人？” 镜子一直回…

胡子长阅读 495评论 0赞 0

51赞52赞

赞赏

更多好文

{“dataManager”:”[]”,”props”:{“isServer”:true,”initialState”:{“global”:{“done”:false,”artFromType”:null,”fontType”:”black”,”$modal”:{“ContributeModal”:false,”RewardListModal”:false,”PayModal”:false,”CollectionModal”:false,”DownloadAppQRModal”:false,”LikeListModal”:false,”ReportModal”:false,”QRCodeShareModal”:false,”BookCatalogModal”:false,”RewardModal”:false},”$ua”:{“value”:”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36″,”isIE11″:false,”earlyIE”:null,”chrome”:”58.0″,”firefox”:null,”safari”:null,”isMac”:false},”$diamondRate”:{“displayable”:false,”rate”:0},”readMode”:”day”,”locale”:”zh-CN”,”seoList”:[{“comments_count”:0,”public_abbr”:”概要 64学时 3.5学分章节安排电子商务网站概况 HTML5+CSS3 JavaScript Node 电子…”,”share_image_url”:”http://upload-images.jianshu.io/upload_images/1783296-5906996156ee9659.png”,”slug”:”4168ba8ac876″,”user”:{“id”:1783296,”nickname”:”阿啊阿吖丁”,”slug”:”6deb978686b8″,”avatar”:”https://upload.jianshu.io/users/upload_avatars/1783296/ba1eeead-3337-4a81-a436-7f68333782a8.jpg”},”likes_count”:3,”title”:”电子商务网站开发与建设”,”id”:24561113,”views_count”:7297},{“comments_count”:0,”public_abbr”:”HTML HTML5标签媒体查询head部分写法 Doctype作用? 严格模式与混杂模式如何区分？它们有何意义…”,”share_image_url”:””,”slug”:”0c336298d217″,”user”:{“id”:2406704,”nickname”:”Mayo_”,”slug”:”d886c454b73c”,”avatar”:”https://cdn2.jianshu.io/assets/default_avatar/9-cceda3cf5072bcdd77e8ca4f21c40998.jpg”},”likes_count”:8,”title”:”前端面试题总结”,”id”:4973479,”views_count”:471},{“comments_count”:1,”public_abbr”:”一：认识jquery jquery是javascript的类库，具有轻量级，完善的文档，丰富的插件支持，完善的Aj…”,”share_image_url”:””,”slug”:”5e78b39c27aa”,”user”:{“id”:10190477,”nickname”:”xuguibin”,”slug”:”1fc8338c6094″,”avatar”:”https://cdn2.jianshu.io/assets/default_avatar/9-cceda3cf5072bcdd77e8ca4f21c40998.jpg”},”likes_count”:7,”title”:”Jquery知识点总结”,”id”:28573207,”views_count”:1364},{“comments_count”:2,”public_abbr”:”一、理论基础知识部分 1.1、讲讲输入完网址按下回车，到看到网页这个过程中发生了什么 a. 域名解析 b. 发起T…”,”share_image_url”:””,”slug”:”4de489a24438″,”user”:{“id”:9635602,”nickname”:”我家媳妇蠢蠢哒”,”slug”:”fd376b245c44″,”avatar”:”https://cdn2.jianshu.io/assets/default_avatar/2-9636b13945b9ccf345bc98d0d81074eb.jpg”},”likes_count”:106,”title”:”前端面试题整理”,”id”:25986160,”views_count”:2955},{“comments_count”:0,”public_abbr”:”从前，有个王后有一面魔镜，它能回答别人提出的问题。她每天问镜子，“魔镜，魔镜，谁是这世上最美的女人？” 镜子一直回…”,”share_image_url”:””,”slug”:”6b444d2b75b7″,”user”:{“id”:53964,”nickname”:”胡子长”,”slug”:”98be4471ca96″,”avatar”:”https://upload.jianshu.io/users/upload_avatars/53964/2bab407a8d9c”},”likes_count”:0,”title”:”白雪公主新编”,”id”:20311837,”views_count”:495}]},”note”:{“data”:{“is_author”:false,”last_updated_at”:1546828691,”public_title”:”php解析html类库simple_html_dom(爬虫相关)”,”purchased”:false,”liked_note”:false,”comments_count”:0,”free_content”:”u003cpu003e下载地址：u003ca href=”https://github.com/samacs/simple_html_dom” target=”_blank” rel=”nofollow”u003ehttps://github.com/samacs/simple_html_domu003c/au003eu003c/pu003eu003cpu003e解析器不仅仅只是帮助我们验证html文档；更能解析不符合W3C标准的html文档。它使用了类似jQuery的元素选择器，通过元素的id，class，tag等等来查找定位；同时还提供添加、删除、修改文档树的功能。当然，这样一款强大的html Dom解析器也不是尽善尽美；在使用的过程中需要十分小心内存消耗的情况。不过，不要担心；本文中，笔者在最后会为各位介绍如何避免消耗过多的内存。u003c/pu003eu003cpu003e开始使用u003c/pu003eu003cpu003eu003cbu003e上传类文件以后，有三种方式调用这个类：u003c/bu003eu003c/pu003eu003cpu003e从url中加载html文档u003c/pu003eu003cpu003e从字符串中加载html文档u003c/pu003eu003cpu003e从文件中加载html文档u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e// 新建一个Dom实例u003c/pu003eu003cpu003e$html =newsimple_html_dom();u003c/pu003eu003cpu003e// 从url中加载u003c/pu003eu003cpu003e$html-u0026gt;load_file(‘http://www.jb51.net’);u003c/pu003eu003cpu003e// 从字符串中加载u003c/pu003eu003cpu003e$html-u0026gt;load(‘u0026lt;htmlu0026gt;u0026lt;bodyu0026gt;从字符串中加载html文档演示u0026lt;/bodyu0026gt;u0026lt;/htmlu0026gt;’);u003c/pu003eu003cpu003e//从文件中加载u003c/pu003eu003cpu003e$html-u0026gt;load_file(‘path/file/test.html’);u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e如果从字符串加载html文档，需要先从网络上下载。建议使用CURL来抓取html文档并加载DOM中。u003c/pu003eu003cpu003ePHP Simple HTML DOM Parser提供了3种方式来创建DOM对象 :u003c/pu003eu003cpu003e// Create a DOM object from a stringu003c/pu003eu003cpu003e$html = str_get_html(‘Hello!’);u003c/pu003eu003cpu003e// Create a DOM object from a URLu003c/pu003eu003cpu003e$html = file_get_html(‘http://www.google.com/’);u003c/pu003eu003cpu003e// Create a DOM object from a HTML fileu003c/pu003eu003cpu003e$html = file_get_html(‘test.htm’);u003c/pu003eu003cpu003eu003cbu003e查找html元素u003c/bu003eu003c/pu003eu003cpu003e可以使用find函数来查找html文档中的元素。返回的结果是一个包含了对象的数组。我们使用HTML DOM解析类中的函数来访问这些对象，下面给出几个示例：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e//查找html文档中的超链接元素u003c/pu003eu003cpu003e$a = $html-u0026gt;find(‘a’);u003c/pu003eu003cpu003e//查找文档中第(N)个超链接，如果没有找到则返回空数组.u003c/pu003eu003cpu003e$a = $html-u0026gt;find(‘a’,0);u003c/pu003eu003cpu003e// 查找id为main的div元素u003c/pu003eu003cpu003e$main = $html-u0026gt;find(‘div[id=main]’,0);u003c/pu003eu003cpu003e// 查找所有包含有id属性的div元素u003c/pu003eu003cpu003e$divs = $html-u0026gt;find(‘div[id]’);u003c/pu003eu003cpu003e// 查找所有包含有id属性的元素u003c/pu003eu003cpu003e$divs = $html-u0026gt;find(‘[id]’);u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e还可以使用类似jQuery的选择器来查找定位元素：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e// 查找id=’#container’的元素u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘#container’);u003c/pu003eu003cpu003e// 找到所有class=foo的元素u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘.foo’);u003c/pu003eu003cpu003e// 查找多个html标签u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a, img’);u003c/pu003eu003cpu003e// 还可以这样用u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a[title], img[title]’);u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e解析器支持对子元素的查找u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e// 查找 ul列表中所有的li项u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘ul li’);u003c/pu003eu003cpu003e//查找 ul 列表指定class=selected的li项u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘ul li.selected’);u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e如果你觉得这样用起来麻烦，使用内置函数可以轻松定位元素的父元素、子元素与相邻元素u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e// 返回父元素u003c/pu003eu003cpu003e$e-u0026gt;parent;u003c/pu003eu003cpu003e// 返回子元素数组u003c/pu003eu003cpu003e$e-u0026gt;children;u003c/pu003eu003cpu003e// 通过索引号返回指定子元素u003c/pu003eu003cpu003e$e-u0026gt;children(0);u003c/pu003eu003cpu003e// 返回第一个资源速u003c/pu003eu003cpu003e$e-u0026gt;first_child ();u003c/pu003eu003cpu003e// 返回最后一个子元素u003c/pu003eu003cpu003e$e-u0026gt;last _child ();u003c/pu003eu003cpu003e// 返回上一个相邻元素u003c/pu003eu003cpu003e$e-u0026gt;prev_sibling ();u003c/pu003eu003cpu003e//返回下一个相邻元素u003c/pu003eu003cpu003e$e-u0026gt;next_sibling ();u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003eu003cbu003e元素属性操作u003c/bu003eu003c/pu003eu003cpu003e使用简单的正则表达式来操作属性选择器。u003c/pu003eu003cpu003e[attribute] – 选择包含某属性的html元素u003c/pu003eu003cpu003e[attribute=value] – 选择所有指定值属性的html元素u003c/pu003eu003cpu003e[attribute!=value]- 选择所有非指定值属性的html元素u003c/pu003eu003cpu003e[attribute^=value] -选择所有指定值开头属性的html元素u003c/pu003eu003cpu003e[attribute$=value] 选择所有指定值结尾属性的html元素u003c/pu003eu003cpu003e[attribute*=value] -选择所有包含指定值属性的html元素u003c/pu003eu003cpu003e在解析器中调用元素属性u003c/pu003eu003cpu003e在DOM中元素属性也是对象：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e// 本例中将$a的锚链接值赋给$link变量u003c/pu003eu003cpu003e$link = $a-u0026gt;href;u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e或者：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e$link = $html-u0026gt;find(‘a’,0)-u0026gt;href;u003c/pu003eu003cpu003e?u003c/pu003eu003cpu003e每个对象都有4个基本对象属性:u003c/pu003eu003cpu003etag – 返回html标签名u003c/pu003eu003cpu003einnertext – 返回innerHTMLu003c/pu003eu003cpu003eoutertext – 返回outerHTMLu003c/pu003eu003cpu003eplaintext – 返回html标签中的文本u003c/pu003eu003cpu003e在解析器中编辑元素u003c/pu003eu003cpu003e编辑元素属性的用法和调用它们是类似的：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e//给$a的锚链接赋新值u003c/pu003eu003cpu003e$a-u0026gt;href =’http://www.jb51.net’;u003c/pu003eu003cpu003e// 删除锚链接u003c/pu003eu003cpu003e$a-u0026gt;href =null;u003c/pu003eu003cpu003e// 检测是否存在锚链接u003c/pu003eu003cpu003eif(isset($a-u0026gt;href)) {u003c/pu003eu003cpu003e//代码u003c/pu003eu003cpu003e}u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e解析器中没有专门的方法来添加、删除元素，不过可以变通一下使用：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e// 封装元素u003c/pu003eu003cpu003e$e-u0026gt;outertext =’u0026lt;div class=”wrap”u0026gt;’. $e-u0026gt;outertext .’u0026lt;divu0026gt;’;u003c/pu003eu003cpu003e// 删除元素u003c/pu003eu003cpu003e$e-u0026gt;outertext =”;u003c/pu003eu003cpu003e// 添加元素u003c/pu003eu003cpu003e$e-u0026gt;outertext = $e-u0026gt;outertext .’u0026lt;divu0026gt;foou0026lt;divu0026gt;’;u003c/pu003eu003cpu003e// 插入元素u003c/pu003eu003cpu003e$e-u0026gt;outertext =’u0026lt;divu0026gt;foou0026lt;divu0026gt;’. $e-u0026gt;outertext;u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e保存修改后的html DOM文档也非常简单：u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e$doc = $html;u003c/pu003eu003cpu003e// 输出u003c/pu003eu003cpu003eecho$doc;u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003eu003cbu003e如何避免解析器消耗过多内存u003c/bu003eu003c/pu003eu003cpu003e在本文的开篇中，笔者就提到了Simple HTML DOM解析器消耗内存过多的问题。如果php脚本占用内存太多，会导致网站停止响应等一系列严重的问题。解决的方法也很简单，在解析器加载html文档并使用完成后，记得清理掉这个对象就可以了。当然，也不要把问题看得太严重了。如果只是加载了2、3个文档，清理或不清理是没有多大区别的。当你加载了5个10个甚至更多的文档的时候，用完一个就清理一下内存u003c/pu003eu003cpu003eu0026lt;?phpu003c/pu003eu003cpu003e$html-u0026gt;clear();u003c/pu003eu003cpu003e?u0026gt;u003c/pu003eu003cpu003e一个实例:u003c/pu003eu003cpu003eu003c/pu003eu003cpu003e简单范例u003c/pu003eu003cpu003eu003c/pu003eu003cpu003eu0026lt;?PHPu003c/pu003eu003cpu003einclude”simple_html_dom.php”;//加载simple_html_dom.php文件 u003c/pu003eu003cpu003e$html = file_get_html(‘http://www.google.com/’);//获取html u003c/pu003eu003cpu003e$dom =newsimple_html_dom();//new simple_html_dom对象 u003c/pu003eu003cpu003e$dom-u0026gt;load($html)//加载html u003c/pu003eu003cpu003e// Find all images u003c/pu003eu003cpu003eforeach($dom-u0026gt;find(‘img’)as$element) {u003c/pu003eu003cpu003e//获取img标签数组 u003c/pu003eu003cpu003eecho$element-u0026gt;src .’u0026lt;bru0026gt;’;//获取每个img标签中的srcu003c/pu003eu003cpu003et} u003c/pu003eu003cpu003e// Find all links u003c/pu003eu003cpu003eforeach($dom-u0026gt;find(‘a’)as$element){//获取a标签的数组 u003c/pu003eu003cpu003eecho$element-u0026gt;href .’u0026lt;bru0026gt;’;//获取每个a标签中的href u003c/pu003eu003cpu003et}u003c/pu003eu003cpu003e$html = file_get_html(‘http://slashdot.org/’);//获取html u003c/pu003eu003cpu003e$dom =newsimple_html_dom();//new simple_html_dom对象 u003c/pu003eu003cpu003e$dom-u0026gt;load($html);//加载html u003c/pu003eu003cpu003e// Find all article blocks u003c/pu003eu003cpu003eforeach($dom-u0026gt;find(‘div.article’)as$article) {u003c/pu003eu003cpu003e$item[‘title’] = $article-u0026gt;find(‘div.title’,0)-u0026gt;plaintext;//plaintext 获取纯文本u003c/pu003eu003cpu003e$item[‘intro’] = $article-u0026gt;find(‘div.intro’,0)-u0026gt;plaintext;u003c/pu003eu003cpu003e$item[‘details’] = $article-u0026gt;find(‘div.details’,0)-u0026gt;plaintext;u003c/pu003eu003cpu003e $articles[] = $item;u003c/pu003eu003cpu003et}u003c/pu003eu003cpu003eprint_r($articles);u003c/pu003eu003cpu003e// Create DOM from string u003c/pu003eu003cpu003e$html = str_get_html(‘u0026lt;div id=”hello”u0026gt;Hellou0026lt;/divu0026gt;u0026lt;div id=”world”u0026gt;Worldu0026lt;/divu0026gt;’);u003c/pu003eu003cpu003e$dom =newsimple_html_dom();//new simple_html_dom对象u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e$dom-u0026gt;load($html);//加载html u003c/pu003eu003cpu003e$dom-u0026gt;find(‘div’,1)-u0026gt;class =’bar’;//class = 赋值给第二个div的class赋值u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e$dom-u0026gt;find(‘div[id=hello]’,0)-u0026gt;innertext =’foo’;//innertext内部文本u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003eecho$dom;u003c/pu003eu003cpu003eu003cbru003eu003c/pu003eu003cpu003e //Output:u003c/pu003eu003cpu003efooWorldu003c/pu003eu003cpu003eu003c/pu003eu003cpu003eDOM methods u0026amp; propertiesu003c/pu003eu003cpu003eu003c/pu003eu003cpu003e tName Description u003c/pu003eu003cpu003e tvoid __construct ( [string $filename] ) 构造函数,将文件名参数将自动加载内容,无论是文本或文件/ url。 u003c/pu003eu003cpu003e tstring plaintext 纯文本 u003c/pu003eu003cpu003e tvoid clear () 清理内存 u003c/pu003eu003cpu003e tvoid load ( string $content ) 加载内容 u003c/pu003eu003cpu003e tstring save ( [string $filename] ) Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file. u003c/pu003eu003cpu003e tvoid load_file ( string $filename ) Load contents from a from a file or a URL. u003c/pu003eu003cpu003e tvoid set_callback ( string $function_name ) 设置一个回调函数。 u003c/pu003eu003cpu003emixed find ( string $selector [, int $index] ) 找到元素的CSS选择器。返回第n个元素对象如果索引设置,否则返回一个数组对象。u003c/pu003eu003cpu003e 4.find 方法详细介绍u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003efind ( string$selector [, int $index] ) u003c/pu003eu003cpu003e// Find all anchors, returns a array of element objects a标签数组 u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a’);u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find (N)th anchor, returns element object or null if not found (zero based)第一个a标签 u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a’, 0);u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find lastest anchor, returns element object or null if not found (zero based)最后一个a标签 u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a’, -1); u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all u0026lt;divu0026gt; with the id attribute u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘div[id]’);u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all u0026lt;divu0026gt; which attribute id=foo u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘div[id=foo]’); u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e// Find all element which id=foo u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘#foo’);u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all element which class=foo u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘.foo’);u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all element has attribute id u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘*[id]’); u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all anchors and images a标签与img标签数组 u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a, img’); u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all anchors and images with the “title” attribute u003c/pu003eu003cpu003e$ret = $html-u0026gt;find(‘a[title], img[title]’);u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e// Find all u0026lt;liu0026gt; in u0026lt;ulu0026gt; u003c/pu003eu003cpu003e$es = $html-u0026gt;find(‘ul li’); ul标签下的li标签数组u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find Nested u0026lt;divu0026gt; tags u003c/pu003eu003cpu003e$es = $html-u0026gt;find(‘div div div’); div标签下div标签下div标签数组u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all u0026lt;tdu0026gt; in u0026lt;tableu0026gt; which class=hello u003c/pu003eu003cpu003e$es = $html-u0026gt;find(‘table.hello td’); table标签下td标签数组u0026lt;/pu0026gt;u0026lt;pu0026gt; // Find all td tags with attribite align=center in table tags u003c/pu003eu003cpu003e$es = $html-u0026gt;find(”table td[align=center]’); u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e 5.Element 的方法 u003c/pu003eu003cpu003e$e = $html-u0026gt;find(“div”, 0); //$e 所拥有的方法如下表所示 u003c/pu003eu003cpu003e Attribute Name Usage u003c/pu003eu003cpu003e$e-u0026gt;tag 标签 u003c/pu003eu003cpu003e$e-u0026gt;outertext 外文本 u003c/pu003eu003cpu003e$e-u0026gt;innertext 内文本 u003c/pu003eu003cpu003e$e-u0026gt;plaintext 纯文本 u0026lt;/pu0026gt;u0026lt;pu0026gt; u0026lt;/pu0026gt;u0026lt;pu0026gt; // Example u003c/pu003eu003cpu003e$html = str_get_html(“u0026lt;divu0026gt;foo u0026lt;bu0026gt;baru0026lt;/bu0026gt;u0026lt;/divu0026gt;”); u003c/pu003eu003cpu003eecho $e-u0026gt;tag; // Returns: ” div” u003c/pu003eu003cpu003eecho $e-u0026gt;outertext; // Returns: ” u0026lt;divu0026gt;foo u0026lt;bu0026gt;baru0026lt;/bu0026gt;u0026lt;/divu0026gt;” u003c/pu003eu003cpu003eecho $e-u0026gt;innertext; // Returns: ” foo u0026lt;bu0026gt;baru0026lt;/bu0026gt;” u003c/pu003eu003cpu003eecho $e-u0026gt;plaintext; // Returns: ” foo bar”u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e 6.DOM traversing 方法 u003c/pu003eu003cpu003e Method Description u003c/pu003eu003cpu003emixed$e-u0026gt;children ( [int $index] ) 子元素 u003c/pu003eu003cpu003eelement$e-u0026gt;parent () 父元素 u003c/pu003eu003cpu003eelement$e-u0026gt;first_child () 第一个子元素 u003c/pu003eu003cpu003eelement$e-u0026gt;last_child () 最后一个子元素 u003c/pu003eu003cpu003eelement$e-u0026gt;next_sibling () 后一个兄弟元素 u003c/pu003eu003cpu003eelement$e-u0026gt;prev_sibling () 前一个兄弟元素 u0026lt;/pu0026gt;u0026lt;pu0026gt; u003c/pu003eu003cpu003e// Example u003c/pu003eu003cpu003eecho $html-u0026gt;find(“#div1”, 0)-u0026gt;children(1)-u0026gt;children(1)-u0026gt;children(2)-u0026gt;id; u003c/pu003eu003cpu003e// or u003c/pu003eu003cpu003eecho $html-u0026gt;getElementById(“div1″)-u0026gt;childNodes(1)-u0026gt;childNodes(1)-u0026gt;childNodes(2)-u0026gt;getAttribute(‘id’); u003c/pu003eu003cpu003eu0026lt;/pu0026gt; u003c/pu003e”,”voted_down”:false,”rewardable”:true,”show_paid_comment_tips”:false,”share_image_url”:”https://cdn2.jianshu.io/assets/default_avatar/13-394c31a9cb492fcb39c27422ca7d2815.jpg”,”slug”:”d35ddd13cd1b”,”user”:{“liked_by_user”:false,”following_count”:93,”gender”:1,”avatar_widget”:null,”slug”:”88e45a0050f0″,”intro”:”我是七彩，一个喜欢胡思乱想的人！”,”likes_count”:512,”nickname”:”七彩邪云”,”badges”:[],”total_fp_amount”:”182835236693230376327″,”wordage”:26340,”user_ip_addr”:”广东”,”avatar”:”https://cdn2.jianshu.io/assets/default_avatar/13-394c31a9cb492fcb39c27422ca7d2815.jpg”,”id”:4105217,”liked_user”:false},”likes_count”:51,”paid_type”:”free”,”show_ads”:true,”paid_content_accessible”:false,”hide_search_input”:false,”total_fp_amount”:”1602000000000000000″,”trial_open”:false,”reprintable”:true,”vip_note”:false,”bookmarked”:false,”wordage”:2223,”featured_comments_count”:0,”more_notes”:0,”downvotes_count”:0,”wangxin_trial_open”:null,”guideShow”:{“audit_user_nickname_spliter”:0,”pc_note_bottom_btn”:1,”pc_like_author_guidance”:1,”show_more_notes”:0,”h5_real_name_auth_link”:1,”audit_user_background_image_spliter”:0,”audit_note_spliter”:0,”new_user_no_ads”:1,”audit_post_spliter”:0,”include_post”:0,”pc_login_guidance”:1,”audit_comment_spliter”:0,”pc_note_bottom_qrcode”:1,”audit_user_avatar_spliter”:0,”audit_collection_spliter”:0,”show_vips_link”:1,”pc_top_lottery_guidance”:1,”subscription_guide_entry”:1,”creation_muti_function_on”:1,”pdd_ad_percent”:0,”explore_score_searcher”:1,”matched_split_checker”:0,”audit_user_spliter”:0,”h5_ab_test”:1,”split_checker_percent”:0,”home_index_ad”:0,”pc_note_popup”:2},”commentable”:true,”total_rewards_count”:0,”id”:39542187,”notebook”:{“name”:””},”activity_collection_slug”:null,”description”:”下载地址：https://github.com/samacs/simple_html_dom 解析器不仅仅只是帮助我们验证html文档；更能解析不符合W3C标准的html文档…”,”first_shared_at”:1546828691,”views_count”:6145,”notebook_id”:33092598},”baseList”:{“likeList”:[],”rewardList”:[]},”status”:”success”,”statusCode”:0},”user”:{“isLogin”:false,”userInfo”:{}},”comments”:{“list”:[],”featuredList”:[]}},”initialProps”:{“pageProps”:{“query”:{“slug”:”d35ddd13cd1b”}},”localeData”:{“common”:{“jianshu”:”简书”,”diamond”:”简书钻”,”totalAssets”:”总资产{num}”,”diamondValue”:” (约{num}元)”,”login”:”登录”,”logout”:”注销”,”register”:”注册”,”on”:”开”,”off”:”关”,”follow”:”关注”,”followBook”:”关注连载”,”following”:”已关注”,”cancelFollow”:”取消关注”,”publish”:”发布”,”wordage”:”字数”,”audio”:”音频”,”read”:”阅读”,”reward”:”赞赏”,”zan”:”赞”,”comment”:”评论”,”expand”:”展开”,”prevPage”:”上一页”,”nextPage”:”下一页”,”userIp”:”IP属地: {ip}”,”floor”:”楼”,”confirm”:”确定”,”delete”:”删除”,”report”:”举报”,”fontSong”:”宋体”,”fontBlack”:”黑体”,”chs”:”简体”,”cht”:”繁体”,”jianChat”:”简信”,”postRequest”:”投稿请求”,”likeAndZan”:”喜欢和赞”,”rewardAndPay”:”赞赏和付费”,”home”:”我的主页”,”markedNotes”:”收藏的文章”,”likedNotes”:”喜欢的文章”,”paidThings”:”已购内容”,”wallet”:”我的钱包”,”setting”:”设置”,”feedback”:”帮助与反馈”,”loading”:”加载中…”,”needLogin”:”请登录后进行操作”,”trialing”:”文章正在审核中…”,”reprintTip”:”禁止转载，如需转载请通过简信或评论联系作者。”},”error”:{“rewardSelf”:”无法打赏自己的文章哟~”},”message”:{“paidNoteTip”:”付费购买后才可以参与评论哦”,”CommentDisableTip”:”作者关闭了评论功能”,”contentCanNotEmptyTip”:”回复内容不能为空”,”addComment”:”评论发布成功”,”deleteComment”:”评论删除成功”,”likeComment”:”评论点赞成功”,”setReadMode”:”阅读模式设置成功”,”setFontType”:”字体设置成功”,”setLocale”:”显示语言设置成功”,”follow”:”关注成功”,”cancelFollow”:”取消关注成功”,”copySuccess”:”复制代码成功”},”header”:{“homePage”:”首页”,”download”:”下载APP”,”discover”:”发现”,”message”:”消息”,”reward”:”赞赏支持”,”editNote”:”编辑文章”,”writeNote”:”写文章”,”techarea”:”IT技术”,”vips”:”会员”},”note”:{},”noteMeta”:{“lastModified”:”最后编辑于 “,”wordage”:”字数 {num}”,”viewsCount”:”阅读 {num}”,”userIp”:”IP属地: {ip}”},”divider”:{“selfText”:”以下内容为付费内容，定价 ¥{price}”,”paidText”:”已付费，可查看以下内容”,”notPaidText”:”还有 {percent} 的精彩内容”,”modify”:”点击修改”},”paidPanel”:{“buyNote”:”支付 ¥{price} 继续阅读”,”buyBook”:”立即拿下 ¥{price}”,”freeTitle”:”该作品为付费连载”,”freeText”:”购买即可永久获取连载内的所有内容，包括将来更新的内容”,”paidTitle”:”还没看够？拿下整部连载！”,”paidText”:”永久获得连载内的所有内容, 包括将来更新的内容”,”vipLinkTitle”:”加入会员,18元/月,畅读全站付费文章”},”book”:{“last”:”已是最后”,”lookCatalog”:”查看连载目录”,”header”:”文章来自以下连载”},”action”:{“like”:”{num}人点赞”,”collection”:”收入专题”,”report”:”举报文章”},”comment”:{“allComments”:”全部评论”,”featuredComments”:”精彩评论”,”closed”:”评论已关闭”,”close”:”关闭评论”,”open”:”打开评论”,”desc”:”按时间倒序”,”asc”:”按时间正序”,”disableText1″:”用户已关闭评论，”,”disableText2″:”与Ta简信交流”,”placeholder”:”写下你的评论…”,”publish”:”发表”,”create”:” 添加新评论”,”reply”:” 回复”,”restComments”:”还有{num}条评论，”,”expandImage”:”展开剩余{num}张图”,”deleteText”:”确定要删除评论么？”},”collection”:{“title”:”被以下专题收入，发现更多相似内容”,”putToMyCollection”:”收入我的专题”},”seoList”:{“title”:”推荐阅读”,”more”:”更多精彩内容”},”sideList”:{“title”:”推荐阅读”},”wxShareModal”:{“desc”:”打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮”},”bookChapterModal”:{“try”:”试读”,”toggle”:”切换顺序”},”collectionModal”:{“title”:”收入到我管理的专题”,”search”:”搜索我管理的专题”,”newCollection”:”新建专题”,”create”:”创建”,”nothingFound”:”未找到相关专题”,”loadMore”:”展开查看更多”},”contributeModal”:{“search”:”搜索专题投稿”,”newCollection”:”新建专题”,”addNewOne”:”去新建一个”,”nothingFound”:”未找到相关专题”,”loadMore”:”展开查看更多”,”managed”:”我管理的专题”,”recommend”:”推荐专题”},”QRCodeShow”:{“payTitle”:”微信扫码支付”,”payText”:”支付金额”},”rewardModal”:{“title”:”给作者送糖”,”custom”:”自定义”,”placeholder”:”给Ta留言…”,”choose”:”选择支付方式”,”balance”:”简书余额”,”tooltip”:”网站该功能暂时下线，如需使用，请到简书App操作”,”confirm”:”确认支付”,”success”:”赞赏成功”},”payModal”:{“payBook”:”购买连载”,”payNote”:”购买文章”,”promotion”:”优惠券”,”promotionFetching”:”优惠券获取中…”,”noPromotion”:”无可用优惠券”,”promotionNum”:”{num}张可用”,”noUsePromotion”:”不使用优惠券”,”validPromotion”:”可用优惠券”,”invalidPromotion”:”不可用优惠券”,”total”:”支付总额”,”tip1″:”· 你将购买的商品为虚拟内容服务，购买后不支持退订、转让、退换，请斟酌确认。”,”tip2″:”· 购买后可在“已购内容”中查看和使用。”,”success”:”购买成功”},”reportModal”:{“abuse”:”辱骂、人身攻击等不友善内容”,”minors_forbidden”:”未成年违规内容”,”ad”:”广告及垃圾信息”,”plagiarism”:”抄袭或未授权转载”,”placeholder”:”写下举报的详情情况（选填）”,”it_algorithm_rec”:”涉互联网算法推荐”,”it_violence”:”涉网络暴力有害信息”,”success”:”举报成功”},”guidModal”:{“modalAText”:”相似文章推荐”,”subText”:”下载简书APP，浏览更多相似文章”,”btnAText”:”先不下载，下次再说”,”followOkText”:”关注作者成功！”,”followTextTip”:”下载简书APP，作者更多精彩内容更新及时提醒！”,”followBtn”:”下次再说”,”downloadTipText”:”更多精彩内容，就在简书APP”,”footerDownLoadText”:”下载简书APP”,”modabTitle”:”免费送你2次抽奖机会”,”modalbTip”:”抽取10000收益加成卡，下载简书APP概率翻倍”,”modalbFooterTip”:”下载简书APP，天天参与抽大奖”,”modalReward”:”抽奖”,”scanQrtip”:”扫码下载简书APP”,”downloadAppText”:”下载简书APP，随时随地发现和创作内容”,”redText”:”阅读”,”likesText”:”赞”,”downLoadLeft”:”更多好文”,”leftscanText”:”把文字装进口袋”}},”currentLocale”:”zh-CN”,”asPath”:”/p/d35ddd13cd1b”}},”page”:”/p/[slug]”,”query”:{“slug”:”d35ddd13cd1b”},”buildId”:”1aXlynfG63p3ph1-9X0TZ”,”assetPrefix”:”https://cdn2.jianshu.io/shakespeare”}

版权声明 1 本网站名称：黑猫博客
2 本站永久网址：https://lt2.cc
3 本网站的文章所有内容可能来源于网络，为图片防止失效转存至博客服务器，仅供大家学习与参考。
4 本站一切资源不代表本站立场，并不代表本站授权赞同其观点和对其真实性负责。
5 本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6 本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。
7 本站所有内容皆为百度搜索转在，如有侵权，请联系站长 QQ85997338 进行删除处理。

THE END