Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】

您现在的位置：首页 > 技术文档 > Python网络爬虫

来源：中文源码网浏览：154 次日期：2024-05-09 03:58:14

【下载文档: Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】.txt 】

Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】
本文实例讲述了Python HTML解析器BeautifulSoup用法。分享给大家供大家参考，具体如下：
BeautifulSoup简介
我们知道，Python拥有出色的内置HTML解析器模块――HTMLParser，然而还有一个功能更为强大的HTML或XML解析工具――BeautifulSoup（美味的汤），它是一个第三方库。简单来说，BeautifulSoup最主要的功能是从网页抓取数据。本文我们来感受一下BeautifulSoup的优雅而强大的功能吧！
BeautifulSoup安装
BeautifulSoup3 目前已经停止开发，推荐在现在的项目中使用BeautifulSoup4，不过它已经被移植到bs4了，也就是说导入时我们需要 import bs4 。可以利用 pip 或者 easy_install 两种方法来安装。下面采用pip安装。
pip install beautifulsoup4
pip install lxml
建议同时安装"lxml"模块，BeautifulSoup支持Python标准库中的HTML解析器（HTMLParser），还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。
创建对象
安装后，创建对象：
soup = BeautifulSoup(markup='html文件', 'lxml')
格式化输出：
soup.prettify()
BeautifulSoup四大对象类型
BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag（标签）
NavigableString（内容）
BeautifulSoup（文档）
Comment（注释）
1.Tag类型
即HTML的整个标签，如获取标签：<br/>print soup.title<br/>#<title>The Dormouse's story
Tag有两个重要属性：name，attrs。
name
即HTML的标签名称：
print soup.name
#[document]
print soup.head.name
#head
attrs
即HTML的标签属性字典：
print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}
如果想要单独获取某个属性：
print soup.p['class']
#['title']
2.NavigableString类型
既然我们已经得到了整个标签，那么问题来了，我们要想获取标签内部的文字内容怎么办呢？很简单，用 string 即可：
print soup.p.string
#The Dormouse's story
3.BeautifulSoup类型
BeautifulSoup 对象表示的是一个文档的全部内容.：
print soup.name
# [document]
4.Comment类型
HTML的注释内容，注意的是，不包含注释符号。我们首先判断它的类型，是否为 Comment 类型，然后再进行其他操作，如打印输出：
if type(soup.a.string)==bs4.element.Comment:
print soup.a.string
#
遍历文档树
1.子节点
contents
获取所有子节点，返回列表：
print soup.head.contents
#[The Dormouse's story]
children
获取所有子节点，返回列表生成器：
print soup.head.children
#
## 需要遍历
for child in soup.body.children:
print child
## 结果

The Dormouse's story

Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

2.节点内容
string
返回单个文本内容。如果一个标签里面没有标签了，那么 string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 string 也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容，string 的输出结果是 None。例如：
print soup.head.string
print soup.title.string
#The Dormouse's story
#The Dormouse's story
print soup.html.string
# None
strings
返回多个文本内容，且包含空行和空格。
stripped_strings
返回多个文本内容，且不包含空行和空格：
for string in soup.stripped_strings:
print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'
get_text()方法
返回当前节点和子节点的文本内容。
from bs4 import BeautifulSoup
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

"""
soup = BeautifulSoup(markup=html_doc,features='lxml')
node_p_text=soup.find('p',class_='story').get_text()　　　　# 注意class_带下划线
print(node_p_text)
# 结果
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
3.父节点
parent
返回某节点的直接父节点：
p = soup.p
print p.parent.name
#body
parents
返回某节点的所有父辈及以上辈的节点：
content = soup.head.title.string
for parent in content.parents:
print parent.name
## 结果
title
head
html
[document]
4.兄弟节点
next_sibling
next_sibling 属性获取该节点的下一个兄弟节点，结果通常是字符串或空白，因为空白或者换行也可以被视作一个节点。
previous_sibling
previous_sibling 属性获取该节点的上一个兄弟节点。
print soup.p.next_sibling
# 实际该处为空白
print soup.p.prev_sibling
#None 没有前一个兄弟节点，返回 None
print soup.p.next_sibling.next_sibling
#

Once upon a time there were three little sisters; and their names were
#,
#Lacie and
#Tillie;
#and they lived at the bottom of a well.

#下一个节点的下一个兄弟节点是我们可以看到的节点
next_siblings、previous_siblings
迭代获取全部兄弟节点。
5.前后节点
next_element、previous_element
不是针对于兄弟节点，而是在于所有节点，不分层次的前一个和后一个节点。
next_elements、previous_elements
迭代获取所有前和后节点。
搜索文档树
1.find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
find_all()方法搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。
参数说明
name参数
name参数很强大，可以传多种方式的参数，查找所有名字为 name 的tag，字符串对象会被自动忽略掉。
（a）传标签名
最简单的过滤器是标签名。在搜索方法中传入一个标签名参数，BeautifulSoup会查找与标签名完整匹配的内容，下面的例子用于查找文档中所有的标签：
print soup.find_all('a')
#[, Lacie, Tillie]
返回结果列表中的元素仍然是BeautifulSoup对象。
（b）传正则表达式
如果传入正则表达式作为参数，BeautifulSoup会通过正则表达式的 match() 来匹配内容。下面例子中找出所有以b开头的标签,这表示和标签都应该被找到：
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
（c）传列表
如果传入列表参数，BeautifulSoup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有标签和标签：
soup.find_all(["a", "b"])
# [The Dormouse's story,
# Elsie,
# Lacie,
# Tillie]
（d）传True
True 可以匹配任何值，下面代码查找到所有的tag，但是不会返回字符串节点：
for tag in soup.find_all(True):
print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
（e）传函数
如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数。如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False：
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [
The Dormouse's story
,
#
Once upon a time there were...
,
#
...
]
keyword参数
注意的是，如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索，如果包含一个名字为 id 的参数,BeautifulSoup会搜索每个tag的”id”属性：
soup.find_all(id='link2')
# [Lacie]
如果传入 href 参数，Beautiful Soup会搜索每个tag的"href"属性：
soup.find_all(href=re.compile("elsie"))
# [Elsie]
使用多个指定名字的参数可以同时过滤tag的多个属性:
soup.find_all(href=re.compile("elsie"), id='link1')
# [three]
在这里我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以：
soup.find_all("a", class_="sister")
# [Elsie,
# Lacie,
# Tillie]
attrs参数
有些tag属性在搜索不能使用,比如HTML5中的 " data-* " 自定义属性：
data_soup = BeautifulSoup('
foo!
')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
## 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag
data_soup.find_all(attrs={"data-foo": "value"})
# [
foo!
]
text参数
通过 text 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样，text 参数接受字符串、正则表达式、列表、True。
soup.find_all(text="Elsie")
# [u'Elsie']
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(text=re.compile("Dormouse"))　　# 模糊查找
[u"The Dormouse's story", u"The Dormouse's story"]
limit参数
find_all() 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。
soup.find_all("a", limit=2)
# [Elsie,
# Lacie]
recursive参数
调用tag的 find_all() 方法时，BeautifulSoup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数 recursive=False。
soup.html.find_all("title")
# [The Dormouse's story]
soup.html.find_all("title", recursive=False)
# []
2.find( name , attrs , recursive , text , **kwargs )
它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果。
3.find_parents() 和 find_parent()
find_all() 和 find() 只搜索当前节点的所有子节点，孙子节点等。find_parents() 和 find_parent() 用来搜索当前节点的父辈节点，搜索方法与普通tag的搜索方法相同，搜索文档搜索文档包含的内容。
4.find_next_siblings() 和 find_next_sibling()　　
这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代， find_next_siblings() 方法返回所有符合条件的后面的兄弟节点，find_next_sibling() 只返回符合条件的后面的第一个tag节点。
5.find_previous_siblings() 和 find_previous_sibling()
这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代， find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点，find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点。
6.find_all_next() 和 find_next()
这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代， find_all_next() 方法返回所有符合条件的节点， find_next() 方法返回第一个符合条件的节点。
7.find_all_previous() 和 find_previous()
这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代，find_all_previous() 方法返回所有符合条件的节点， find_previous()方法返回第一个符合条件的节点。
CSS选择器
我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list。
通过标签名查找
print soup.select('title')
#[The Dormouse's story]
print soup.select('a')
#[, Lacie, Tillie]
print soup.select('b')
#[The Dormouse's story]
通过类名查找
print soup.select('.sister')
#[, Lacie, Tillie]
通过 id 名查找
print soup.select('#link1')
#[]
组合查找
组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开。
print soup.select('p #link1')
#[]
直接子标签查找：
print soup.select("head > title")
#[The Dormouse's story]
属性查找
查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。
print soup.select('a[class="sister"]')
#[, Lacie, Tillie]
print soup.select('a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[]
同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格：
print soup.select('p a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[]
以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 string或get_text() 方法来获取它的内容：
soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()
for title in soup.select('title'):
print title.get_text()
更多关于Python相关内容可查看本站专题：《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》
希望本文所述对大家Python程序设计有所帮助。

上一篇：python http基本验证方法

下一篇：Python HTML解析模块HTMLParser用法分析【爬虫工具】

相关内容

• 在scrapy中使用phantomJS实现异步爬取的方法

• 用xpath获取指定标签下的所有text的实例

• 用python爬取租房网站信息的代码

• 用Python爬取QQ音乐评论并制成词云图的实例

• 用python3 urllib破解有道翻译反爬虫机制详解

• 一步步教你用python的scrapy编写一个爬虫

• 选择Python写网络爬虫的优势和理由

• 详解用python写网络爬虫-爬取新浪微博评论

• 详解Selenium+PhantomJS+python简单实现爬虫的功能

• 详解Python爬取并下载《电影天堂》3千多部电影

• 详解python爬虫系列之初识爬虫

• 详解Python3网络爬虫(二)：利用urllib.urlopen向有道翻译发送数据获得翻译结果

• 详解python3 + Scrapy爬虫学习之创建项目

• 详解Python 爬取13个旅游城市，告诉你五一大家最爱去哪玩？

• 详解python 爬取12306验证码

• 详解python selenium 爬取网易云音乐歌单名

• 通过python爬虫赚钱的方法

• 使用selenium和pyquery爬取京东商品列表过程解析

• 使用Scrapy爬取动态数据

• 使用python实现抓取腾讯视频所有电影的爬虫

• 使用python爬取微博数据打造一颗“心”

• 使用python爬取抖音视频列表信息

• 使用python itchat包爬取微信好友头像形成矩形头像集的方法

• 如何使用python爬虫爬取要登陆的网站

• 如何爬取通过ajax加载数据的网站

• 浅谈Scrapy网络爬虫框架的工作原理和数据采集

• 浅谈Python爬虫基本套路

• 利用selenium爬虫抓取数据的基础教程

• 利用Python检测URL状态

• 利用Pyhton中的requests包进行网页访问测试的方法

• 利用PyCharm Profile分析异步爬虫效率详解

• 检测python爬虫时是否代理ip伪装成功的方法

• 几行Python代码爬取3000+上市公司的信息

• 基于python历史天气采集的分析

• 基于python框架Scrapy爬取自己的博客内容过程详解

• 基于Python的Post请求数据爬取的方法详解

• 基于Python打造账号共享浏览器功能

• 搞定这套Python爬虫面试题(面试会so easy)

• 对python3中的RE(正则表达式)-详细总结

• 超简单的Python HTTP服务

• windows下搭建python scrapy爬虫框架步骤

• windows7 32、64位下python爬虫框架scrapy环境的搭建方法

• selenium+python设置爬虫代理IP的方法

• selenium+PhantomJS爬取豆瓣读书

• Scrapy框架爬取西刺代理网免费高匿代理的实现代码

• Scrapy框架爬取Boss直聘网Python职位信息的源码

• python做反被爬保护的方法

• python抓取网页内容并进行语音播报的方法

• Python中利用aiohttp制作异步爬虫及简单应用

• python中xpath爬虫实例详解

• Python正则匹配判断手机号是否合法的方法

• python正则爬取某段子网站前20页段子(request库)过程解析

• python正则表达式去除两个特殊字符间的内容方法

• Python正则表达式匹配字符串中的http链接方法

• Python正则表达式匹配日期与时间的方法

• Python正则表达式匹配和提取IP地址

• python正则表达式匹配不包含某几个字符的字符串方法

• python正则表达式匹配[]中间为任意字符的实例

• python正向最大匹配分词和逆向最大匹配分词的实例

• python用match()函数爬数据方法详解

• python协程gevent案例爬取斗鱼图片过程解析

• Python微信爬虫完整实例【单线程与多线程】

• Python网页正文转换语音文件的操作方法

• Python网络爬虫之爬取微博热搜

• Python通过requests实现腾讯新闻抓取爬虫的方法

• Python数据抓取爬虫代理防封IP方法

• Python使用Selenium爬取淘宝异步加载的数据方法

• Python使用scrapy爬取阳光热线问政平台过程解析

• python使用requests模块实现爬取电影天堂最新电影信息

• Python使用mongodb保存爬取豆瓣电影的数据过程解析

• python使用BeautifulSoup与正则表达式爬取时光网不同地区top100电影并对比

• Python使用Beautiful Soup爬取豆瓣音乐排行榜过程解析

• python实现知乎高颜值图片爬取

• python实现爬山算法的思路详解

• Python实现爬取亚马逊数据并打印出Excel文件操作示例

• Python实现爬取马云的微博功能示例

• python实现爬取百度图片的方法示例

• python实现爬虫抓取小说功能示例【抓取金庸小说】

• Python实现的文轩网爬虫完整示例

• Python实现的爬取小说爬虫功能示例

• python实现的爬取电影下载链接功能示例

• Python实现的爬取百度文库功能示例

• Python实现的爬取百度贴吧图片功能完整示例

• Python实现12306火车票抢票系统

• Python如何爬取微信公众号文章和评论(基于 Fiddler 抓包分析)

• python如何爬取网站数据并进行数据可视化

• Python如何爬取实时变化的WebSocket数据的方法

• python批量爬取下载抖音视频

• Python爬取智联招聘数据分析师岗位相关信息的方法

• python爬取指定微信公众号文章

• python爬取学信网登录页面的例子

• python爬取微信公众号文章的方法

• python爬取网易云音乐评论

• python爬取淘宝商品销量信息

• Python爬取数据保存为Json格式的代码示例

• Python爬取视频(其实是一篇福利)过程解析

• Python爬取商家联系电话以及各种数据的方法

• python爬取盘搜的有效链接实现代码

• python爬取内容存入Excel实例

• python爬取酷狗音乐排行榜

• python爬取基于m3u8协议的ts文件并合并

• Python爬取成语接龙类网站

• python爬取百度贴吧前1000页内容（requests库面向对象思想实现）

• python爬取cnvd漏洞库信息的实例

• python爬取Ajax动态加载网页过程解析

• python爬虫租房信息在地图上显示的方法

• Python爬虫抓取技术的一些经验

• python爬虫之自制英汉字典

• python爬虫之自动登录与验证码识别

• Python爬虫之正则表达式的使用教程详解

• python爬虫之验证码篇3-滑动验证码识别技术

• python爬虫之爬取百度音乐的实现方法

• python爬虫之快速对js内容进行破解

• Python爬虫之UserAgent的使用实例

• python爬虫之urllib库常用方法用法总结大全

• python爬虫之urllib,伪装,超时设置,异常处理的方法

• python爬虫增加访问量的方法

• Python爬虫运用正则表达式的方法和优缺点

• Python爬虫学习之获取指定网页源码

• Python爬虫学习之翻译小程序

点击排行

您现在的位置：首页 > 技术文档 > Python网络爬虫

Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】

相关内容