使PHP像JavaScript一样操作HTML DOM

技能 · 2022-06-16 · 1791 人浏览
使PHP像JavaScript一样操作HTML DOM

我当前有一个需求,需要采集一个漫画网站的列表页(带分页)、漫画详情页、章节页,并且在采集的时候携带Cookie,以达到采集会员章节的目的,再者,我需要自动跳转并适宜在定时任务下使用。

第一步:引入simple_html_dom库

为了能够达到像JavaScript一样操作HTML DOM,我们需要使用到simple_html_dom库,在GitHub下载后,引入simple_html_dom.php文件。

由于我当前使用的框架为TP3,所以需要把simple_html_dom.php放置在/ThinkPHP/Library/Org/Util目录下,并修改名称为simple_html_dom.class.php(其他框架或者原生只需按照规范或自己的喜好引入到要使用的文件当中即可)

接下来需要在使用的文档顶部(namespace下面)use Org\Util\simple_html_dom;

此时,我们就可以在需要使用的方法里$dom = new simple_html_dom();

第二步,获取页面dom

首先我们需要获取HTML的文档数据

该方法输入一个url地址,返回html文档数据

// 获取页面HTML
public function getPageHtml($url) {
    // cookie字段信息,可以查看正常访问页面的时候携带的cookie字段数据,每一个用;号隔开
    $cookie = "key=M2M7t4k0;";
        
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$url);
    
    // 携带cookie字段
    curl_setopt($ch, CURLOPT_COOKIE, $cookie);
        
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
    $html = curl_exec($ch);
    curl_close($ch);
        
    return $html;
}

拿到html文档数据后,我们就可以通过simple_html_dom下的load方法加载html获取html dom模型

$dom = new simple_html_dom();
$dom->load($html);

此时$dom的内容已经是已经转化的html dom模型了,我们举个例子。

例:需要采集www.baidu.com搜索框下面的百度热搜

$html = $this->getPageHtml('https://www.baidu.com');
$dom = new simple_html_dom();
$dom->load($html);
// find方法类似jquery的$('.title-content'),后面加0代表取第一次出现的,不加0为取全部。
$dom->find('.title-content', 0);
// 此时我们就获取到了一个HTML DOM节点,为a标签,class为title-content,下面有四个标签。
// 如果此时我们需要获取.title-content下面的.title-content-title怎么办?有几种方法
// 链式
$dom->find('.title-content', 0)->find('.title-content-title');
// 直接定位
$dom->find('.title-content-title');

// find还支持更多查询方法
$dom->find('span');
$dom->find('span[class=title-content-title]');

// 如果find结果为一个,那么可以直接使用->plaintext来获取text内容
$dom->find('.title-content-title',0)->plaintext;
// 图片src以及a标签的href同理
$dom->find('img',0)->src;
$dom->find('a',0)->href;

// 如果find结果为数组 则需要循环获取
$arr = $dom->find('.title-content-title');
foreach ($arr as $val) {
    echo $val->plaintext;
}

// 在使用结束之后,千万要记得清除内存
$dom->clear();

警告:千万不要dump节点,不然会崩溃,dump底层会去主动格式化数组或对象,如果要查看结构可以使用var_dump

关于跳转

跳转有两种方法,在不同场景下使用,一种是使用javascript跳转,一种使用php内置函数header跳转

// javascript跳转可以在通过浏览器直接访问的时候使用,使用这种方式可以清楚看到当前页面输出的信息
echo '<script>window.location.href="下一页的URL";</script>';
// 而使用header跳转的话,实际测试发现只会执行代码并不会输出信息,所以可以在定时任务种使用。
// Location: 后面有个空格,千万不能少。
Header("Location: 下一页的URL");

在实际测试的时候发现定时任务curl采集接口并不会跳转,采集第一页就结束了,随后查询文档才搞清楚需要curl有个--max-redirs参数,用来规定最大重定向次数,后面的5000为最大次数,根据你的需求更改就ok

curl -L --max-redirs 5000 "https://www.xxx.com/api/autoCollection?page=1&fornum=1"

1、在实际采集中,我们可能会需要去掉某些字符,比如别人使用的是相对路径,我们需要去掉前面的./然后拼接上对方的域名

$website_url = 'https://www.baidu.com/';
$link = './manhua/view/91367'
$new_link = $website_url . str_replace('./','',$link);

// $new_link值为  https://www.baidu.com/manhua/view/91367

// 以防万一可以在$new_link结果上再套一个trim函数,可以去除两边的空格。

2、在实际采集中,有时候会把两个信息写在一个span标签里,比如 VIP 第三章,而我们的需求需要判断该章节是否是VIP章节

if (strpos('VIP 第三章','VIP') !== false) {
    // 是VIP章节
    // 判断过后我们可以使用str_replace函数去除VIP字样,并使用trim去除空格
} else {
    // 不是VIP章节
}
curl php html dom 采集 simple_html_dom 字符串处理
  1. Yurtyfjam 昨天

    Hello. . [url=https://yuzovka.org]yuzovka[/url]
    zwz4753674
    https://yuzovka.org zwz4753674

  2. Hello.
    7Slots is a young online casino brand with a huge selection of gambling games, including both the latest hits and classic slots, as well as roulette, blackjack, poker and baccarat. Welcome bonus now - $1200 + 300FS. Register here: https://spiritredirect.com/d8ee9ac13

  3. Dertaemodo 13 天前

    [url=https://remcold-dn.ru/]https://remcold-dn.ru/[/url]
    https://remcold-dn.ru/
    https://remcold-dn.ru/ remcolddnru

  4. Borjopjam 15 天前

    Hello. . [url=https://yuzovka.org]yuzovka[/url]
    zwz4753674
    https://yuzovka.org zwz4753674

  5. Husam Orabi 16 天前

    Hello,

    I am Husam Orabi, Qatari Investors Group's chief business development and delivery officer. We offer loans and credit facilities at a small interest rate for ten years and a moratorium of up to two years.
    We also finance profit-oriented projects and businesses. We understand that each business is unique, so let us know what you need for your business, and we will tailor our financing to suit your specific requirements.

    Regards,

    Husam Orabi
    CHIEF BUSINESS DEVELOPMENT & DELIVERY OFFICER

    Mobile: +971524239312
    Whatsapp: +971524239312
    husam@qatarinvestors-group.com

  6. Pillssob 18 天前

    Erectile dysfunction treatments available online from TruePills.
    Discreet, next day delivery and lowest price guarantee.

    Trial ED Pack consists of the following ED drugs:

    Viagra Active Ingredient: Sildenafil 100mg 5 pills
    Cialis 20mg 5 pills
    Levitra 20mg 5 pills

    Acquistare viagra con paypal:

    https://cutt.ly/DebeNiy1

    https://cutt.ly/webeBkKx

    https://u.to/BdO4IA

  7. Stevenfrawn 25 天前

    Discover, interact with, and use 1000+ AI models. Transform the way you experience AI with [url=https://hyperspace.ai/]Hyperspace[/url].

  8. Stevenfrawn 25 天前

    Discover, interact with, and use 1000+ AI models. Transform the way you experience AI with [url=https://hyperspace.ai/]Hyperspace[/url].

  9. Stevenfrawn 25 天前

    Discover, interact with, and use 1000+ AI models. Transform the way you experience AI with [url=https://hyperspace.ai/]Hyperspace[/url].

  10. Stevenfrawn 25 天前

    Discover, interact with, and use 1000+ AI models. Transform the way you experience AI with [url=https://hyperspace.ai/]Hyperspace[/url].

  11. Stevenfrawn 28 天前

    Discover, interact with, and use 1000+ AI models. Transform the way you experience AI with [url=https://hyperspace.ai/]Hyperspace[/url].

  12. Jesusdag 30 天前

    https://cs.xuxingdianzikeji.com/forum.php?mod=viewthread&tid=154279 How Much You Need To Expect You'll Pay For A Good adult young matters
    https://tvchannelsguide.com/Erotic_Video_Smackdown Not known Facts About adult careers advice
    https://demo.qkseo.in/viewtopic.php?id=883099 5 Tips about plump beauty co You Can Use Today
    https://pitfmb2024.membership-afismi.org/2024/08/28/erotic-literature-and-erotic-stories-entertain-and-relax-our-spirits/ The smart Trick of dating hotline That Nobody is Discussing
    https://pipewiki.org/wiki/index.php/What_Your_Prospects_In_Reality_Theorize_Some_Your_Boy_Booster Getting My adult neurogenesis To Work
    http://wiki.motorclass.com.au/index.php/4_Causes_Push_Champion_Is_A_Run_Off_Of_Time Not known Facts About dating early west germanic
    https://cs.xuxingdianzikeji.com/forum.php?mod=viewthread&tid=164051 What Does k+ channel Mean?
    https://www.barbecuejunction.com/blog/112837/vii-scarey-friendship-concepts/ The Definitive Guide to dating profile search by phone number
    https://cs.xuxingdianzikeji.com/forum.php?mod=viewthread&tid=154080 The smart Trick of dating hotline That Nobody is Discussing
    https://bbs.zzxfsd.com/forum.php?mod=viewthread&tid=283046 dating tayo - An Overview

    =+rrry

  13. Isaacbum 09-10

    https://AccBulk.com is your one-stop shop for bulk verified accounts across all major social media platforms. Our PVA accounts are secure, reliable, and created using different server IPs to ensure optimal performance. Get started today and enjoy the convenience of fast, hassle-free access.

    Access Link:

    https://AccBulk.com

    Big Thanks

  14. KennethPix 08-31

    https://gogocasino.one

  15. Erectile dysfunction treatments available online from TruePills.
    Discreet, next day delivery and lowest price guarantee.

    Trial ED Pack consists of the following ED drugs:

    Viagra Active Ingredient: Sildenafil 100mg 5 pills
    Cialis 20mg 5 pills
    Levitra 20mg 5 pills

    Acquistare viagra con paypal:

    https://cutt.ly/revrx6EP

    https://cutt.ly/gevrvQ2P

    https://u.to/BdO4IA

  16. 超赞的文章!你详细解释了如何在PHP中使用simple_html_dom。请问你是否也使用过Guzzle来管理HTTP请求?

    1. SK (作者)  08-13
      @IT Flashcards

      Hello,你指的是这个[https://github.com/guzzle/guzzle]请求库吗?或许我们可以交流一下。

      1. @SK

        好的,正是这款库。我觉得它非常易于使用,并且在PHP中管理HTTP请求特别方便。如果你还没用过,强烈推荐试试!

Theme Jasmine by Kent Liao