正则匹配所有html标签中的内容|极客笔记

正则匹配所有html标签中的内容

在网页开发中，我们经常需要从HTML文档中提取特定的内容，比如获取所有的HTML标签中的文本内容。这时候就可以使用正则表达式来实现这个功能。在本文中，我们将介绍如何使用正则表达式来匹配所有HTML标签中的内容。

示例代码1：匹配所有的HTML标签

import re

html = "<div>Hello, <span>world</span></div>"
pattern = re.compile(r'<.*?>')
result = pattern.findall(html)

for tag in result:
    print(tag)

Output:

正则匹配所有html标签中的内容

在这个示例中，我们使用正则表达式<.*?>来匹配所有的HTML标签。这个正则表达式的含义是匹配尖括号中的任意字符（除了换行符）零次或多次，加上一个问号表示非贪婪匹配。所以最终匹配到的结果是<div>、<span>和</span>。

示例代码2：匹配带有属性的HTML标签

import re

html = '<a href="https://www.deepinout.com">Deepinout</a>'
pattern = re.compile(r'<.*?>')
result = pattern.findall(html)

for tag in result:
    print(tag)

Output:

正则匹配所有html标签中的内容

这个示例中，我们同样使用正则表达式<.*?>来匹配HTML标签，但是这次的HTML标签带有属性href="https://www.deepinout.com"。正则表达式依然能够正确匹配到<a href="https://www.deepinout.com">和</a>。

示例代码3：匹配带有多个属性的HTML标签

import re

html = '<img src="image.jpg" alt="Image" width="100" height="100">'
pattern = re.compile(r'<.*?>')
result = pattern.findall(html)

for tag in result:
    print(tag)

Output:

正则匹配所有html标签中的内容

这个示例中，我们演示了如何匹配带有多个属性的HTML标签。正则表达式<.*?>能够正确匹配到<img src="image.jpg" alt="Image" width="100" height="100">。

示例代码4：匹配嵌套的HTML标签

import re

html = '<div><p>Hello, <strong>world</strong></p></div>'
pattern = re.compile(r'<.*?>')
result = pattern.findall(html)

for tag in result:
    print(tag)

Output:

正则匹配所有html标签中的内容

在这个示例中，我们演示了如何匹配嵌套的HTML标签。正则表达式<.*?>能够正确匹配到<div>、<p>、<strong>和</strong>。

示例代码5：匹配自闭合的HTML标签

import re

html = '<input type="text" value="Deepinout">'
pattern = re.compile(r'<.*?>')
result = pattern.findall(html)

for tag in result:
    print(tag)

Output:

正则匹配所有html标签中的内容

这个示例中，我们演示了如何匹配自闭合的HTML标签。正则表达式<.*?>能够正确匹配到<input type="text" value="Deepinout">。

示例代码6：匹配包含文本内容的HTML标签

import re

html = '<p>Hello, <strong>world</strong></p>'
pattern = re.compile(r'<.*?>(.*?)</.*?>')
result = pattern.findall(html)

for tag_content in result:
    print(tag_content)

Output:

正则匹配所有html标签中的内容

在这个示例中，我们使用正则表达式<.*?>(.*?)</.*?>来匹配包含文本内容的HTML标签。正则表达式中的(.*?)表示匹配任意字符零次或多次，非贪婪匹配。最终匹配到的结果是Hello,和world。

示例代码7：匹配多行的HTML标签

import re

html = '''
<div>
    <p>Hello, <strong>world</strong></p>
</div>
'''
pattern = re.compile(r'<.*?>', re.S)
result = pattern.findall(html)

for tag in result:
    print(tag)

Output:

正则匹配所有html标签中的内容

这个示例中，我们演示了如何匹配多行的HTML标签。正则表达式<.*?>后面加上re.S标志表示匹配多行。最终匹配到的结果是<div>、<p>、<strong>和</strong>。

示例代码8：匹配特定的HTML标签

import re

html = '<p>Hello, <strong>world</strong></p>'
pattern = re.compile(r'<p>(.*?)</p>')
result = pattern.findall(html)

for tag_content in result:
    print(tag_content)

Output:

正则匹配所有html标签中的内容

在这个示例中，我们使用正则表达式<p>(.*?)</p>来匹配特定的HTML标签<p>。最终匹配到的结果是Hello, <strong>world</strong>。

示例代码9：匹配所有的文本内容

import re

html = '<p>Hello, <strong>world</strong></p>'
pattern = re.compile(r'>(.*?)<')
result = pattern.findall(html)

for text in result:
    print(text)

Output:

正则匹配所有html标签中的内容

这个示例中，我们使用正则表达式>(.*?)<来匹配所有的文本内容。正则表达式中的(.*?)表示匹配任意字符零次或多次，非贪婪匹配。最终匹配到的结果是Hello,和world。

示例代码10：匹配所有的HTML标签和文本内容

import re

html = '<p>Hello, <strong>world</strong></p>'
pattern = re.compile(r'<.*?>(.*?)</.*?>')
result = pattern.findall(html)

for tag_content in result:
    print(tag_content)

Output:

正则匹配所有html标签中的内容