正则表达式匹配HTML标签
在网页开发中,经常会涉及到对HTML标签的处理和匹配。正则表达式是一种强大的工具,可以帮助我们快速准确地匹配HTML标签。本文将介绍如何使用正则表达式来匹配HTML标签,并提供一些示例代码。
匹配HTML标签
示例1:匹配HTML标签
import re
html = "<div class='content'>Hello, deepinout.com</div>"
pattern = "<.*?>"
result = re.findall(pattern, html)
print(result)
Output:
示例2:匹配带属性的HTML标签
import re
html = "<a href='https://www.deepinout.com'>Deepinout</a>"
pattern = "<.*?>"
result = re.findall(pattern, html)
print(result)
Output:
示例3:匹配嵌套的HTML标签
import re
html = "<div><p>Hello, deepinout.com</p></div>"
pattern = "<.*?>"
result = re.findall(pattern, html)
print(result)
Output:
提取HTML标签中的内容
示例4:提取HTML标签中的文本内容
import re
html = "<h1>Welcome to deepinout.com</h1>"
pattern = "<.*?>(.*?)</.*?>"
result = re.findall(pattern, html)
print(result)
Output:
示例5:提取HTML标签中的属性值
import re
html = "<a href='https://www.deepinout.com'>Deepinout</a>"
pattern = "href='(.*?)'"
result = re.findall(pattern, html)
print(result)
Output:
替换HTML标签
示例6:替换HTML标签为纯文本
import re
html = "<p>Hello, <strong>deepinout.com</strong></p>"
pattern = "<.*?>"
result = re.sub(pattern, "", html)
print(result)
Output:
示例7:替换HTML标签为指定文本
import re
html = "<p>Hello, <strong>deepinout.com</strong></p>"
pattern = "<strong>(.*?)</strong>"
result = re.sub(pattern, "Deepinout", html)
print(result)
Output:
匹配特定的HTML标签
示例8:匹配所有的链接标签
import re
html = "<a href='https://www.deepinout.com'>Deepinout</a> <a href='https://www.example.com'>Example</a>"
pattern = "<a.*?</a>"
result = re.findall(pattern, html)
print(result)
Output:
示例9:匹配所有的图片标签
import re
html = "<img src='image1.jpg'> <img src='image2.jpg'>"
pattern = "<img.*?>"
result = re.findall(pattern, html)
print(result)
Output:
匹配特定属性值
示例10:匹配所有链接的URL
import re
html = "<a href='https://www.deepinout.com'>Deepinout</a> <a href='https://www.example.com'>Example</a>"
pattern = "href='(.*?)'"
result = re.findall(pattern, html)
print(result)
Output:
示例11:匹配所有图片的URL
import re
html = "<img src='image1.jpg'> <img src='image2.jpg'>"
pattern = "src='(.*?)'"
result = re.findall(pattern, html)
print(result)
Output:
匹配多个HTML标签
示例12:匹配所有的标题标签
import re
html = "<h1>Title 1</h1> <h2>Title 2</h2> <h3>Title 3</h3>"
pattern = "<h[1-3].*?</h[1-3]>"
result = re.findall(pattern, html)
print(result)
Output:
示例13:匹配所有的列表标签
import re
html = "<ul><li>Item 1</li><li>Item 2</li></ul>"
pattern = "<ul>.*?</ul>"
result = re.findall(pattern, html)
print(result)
Output:
匹配特定内容
示例14:匹配包含指定文本的标签
import re
html = "<p>Hello, deepinout.com</p> <p>Welcome to deepinout.com</p>"
pattern = "<p>.*?deepinout.com.*?</p>"
result = re.findall(pattern, html)
print(result)
Output:
示例15:匹配不包含指定文本的标签
import re
html = "<p>Hello, deepinout.com</p> <p>Welcome to example.com</p>"
pattern = "<p>(?:(?!example.com).)*?</p>"
result = re.findall(pattern, html)
print(result)
Output:
匹配特定结构
示例16:匹配所有的段落标签
import re
html = "<p>Paragraph 1</p> <div>Content</div> <p>Paragraph 2</p>"
pattern = "<p>.*?</p>"
result = re.findall(pattern, html)
print(result)
Output:
示例17:匹配所有的div标签
import re
html = "<p>Paragraph 1</p> <div>Content</div> <p>Paragraph 2</p>"
pattern = "<div>.*?</div>"
result = re.findall(pattern, html)
print(result)
Output:
匹配特定格式
示例18:匹配所有的加粗文本
import re
html = "<p>Hello, <strong>deepinout.com</strong></p> <p>Welcome to <strong>example.com</strong></p>"
pattern = "<strong>.*?</strong>"
result = re.findall(pattern, html)
print(result)
Output:
示例19:匹配所有的斜体文本
import re
html = "<p>Hello, <em>deepinout.com</em></p> <p>Welcome to <em>example.com</em></p>"
pattern = "<em>.*?</em>"
result = re.findall(pattern, html)
print(result)
Output:
匹配特定数量
示例20:匹配重复出现的标签
import re
html = "<p>Paragraph 1</p> <p>Paragraph 2</p> <p>Paragraph 3</p>"
pattern = "<p>.*?</p>"
result = re.findall(pattern, html)
print(result)
Output: