正则表达式匹配html标签内容

在前端开发中，我们经常需要从html代码中提取特定的内容。比如从一个 <a> 标签中获取里面的链接地址或者文本内容。这时候我们就可以使用正则表达式来进行匹配提取。

匹配html标签

首先我们来看一下匹配html标签的正则表达式：

/<[a-z]+[1-6]?\b[^>]*>(.*?)<\/[a-z]+[1-6]?>/g

这个表达式能够匹配到以下类型的html标签：

<a href="http://www.baidu.com">百度一下，你就知道！</a>
<div class="container">
  <p>欢迎来到我的主页！</p>
</div>

其中 .*? 表示非贪婪模式的匹配，能够匹配到最小的内容，以防止出现匹配过多的情况。g 表示全局匹配。

下面给出一个示例代码，演示如何使用JavaScript代码将匹配到的html标签内容提取出来：

const html = `
  <a href="http://www.baidu.com">百度一下，你就知道！</a>
  <div class="container">
    欢迎来到我的主页！
  </div>
`;

const pattern = /<[a-z]+[1-6]?\b[^>]*>(.*?)<\/[a-z]+[1-6]?>/g;
const res = [];
let match;

while ((match = pattern.exec(html)) !== null) {
  const content = match[1].trim();
  res.push(content);
}

console.log(res); // ["百度一下，你就知道！", "欢迎来到我的主页！"]

上面代码使用 exec() 方法遍历整个html字符串，将匹配到的内容存储到 res 数组中。

匹配特定html标签

如果我们只想匹配特定类型的html标签，可以在正则表达式中指定标签名。比如我们只想匹配 “ 标签，那么正则表达式就可以写成：

/<p\b[^>]*>(.*?)<\/p>/g

同样的，我们可以根据需要，修改标签名，获取所需的内容。下面给出一个示例代码，演示如何使用正则表达式匹配 <p> 标签内容：

const html = `
  <div class="container">
    <p>欢迎来到我的主页！
    这是我的个人博客！
    感谢您的访问！
  </div>
`;

const pattern = /<p\b[^>]*>(.*?)<\/p>/g;
const res = [];
let match;

while ((match = pattern.exec(html)) !== null) {
  const content = match[1].trim();
  res.push(content);
}

console.log(res); // ["欢迎来到我的主页！", "这是我的个人博客！", "感谢您的访问！"]

匹配html标签属性

在前端开发中，我们经常需要从标签中获取某个属性的值。比如获取 <a> 标签中的链接地址或者 <img> 标签中的图片地址。

这时候我们也可以使用正则表达式来进行匹配提取。下面给出一个示例代码，演示如何使用正则表达式匹配 <a> 标签中的链接地址：

const html = `
  <a href="http://www.baidu.com">百度一下，你就知道！</a>
`;

const pattern = /<a\b[^>]*\bhref="([^"]+)">/i;
const match = pattern.exec(html);
const url = match[1];

console.log(url); // "http://www.baidu.com"

上面代码使用正则表达式匹配到了含有 href 属性的 <a> 标签，并成功获取了链接地址。