Python 正则表达式校验URL|极客笔记

Python 正则表达式校验URL

在网络应用开发中，经常会涉及到对URL进行校验的需求。URL是统一资源定位符的缩写，是用于标识互联网上资源位置的字符串。正确的URL格式可以确保用户能够正确访问网络资源。在Python中，可以使用正则表达式来校验URL的格式是否正确。

什么是正则表达式

正则表达式是一种描述字符模式的工具，可以用来匹配（及解析、替换等处理）文本。在Python中，可以使用re模块来操作正则表达式。正则表达式是由字符和操作符构成的字符串，可以表达一定的规则。

下面是一些常见的正则表达式符号及其含义：

.：匹配除换行符以外的任意字符
*：匹配前一个字符0次或多次
+：匹配前一个字符1次或多次
?：匹配前一个字符0次或1次
^：匹配字符串的开头
$：匹配字符串的结尾
\d：匹配数字
\w：匹配字母、数字或下划线
[]：匹配括号内的任意一个字符
()：捕获匹配的子字符串

正则表达式校验URL

校验URL的正则表达式并不是很复杂，主要是对URL的协议、域名、端口、路径等部分的格式进行匹配。以下是一个简单的Python函数，用于校验URL是否符合常见格式：

import re

def validate_url(url):
    # 匹配URL的正则表达式
    pattern = r'^(http|https)://([\w-]+\.)+[\w-]+(/[\w-./?%&=]*)?$'
    if re.match(pattern, url):
        return True
    else:
        return False

# 测试URL校验函数
urls = [
    "http://www.example.com",
    "https://www.example.com/path/to/page",
    "http://192.168.1.1:8080/index.html",
    "ftp://ftp.example.com"
]

for url in urls:
    if validate_url(url):
        print(f"{url} 是一个有效的URL")
    else:
        print(f"{url} 不是一个有效的URL")

运行以上代码，输出如下：

http://www.example.com 是一个有效的URL
https://www.example.com/path/to/page 是一个有效的URL
http://192.168.1.1:8080/index.html 是一个有效的URL
ftp://ftp.example.com 不是一个有效的URL

URL的常见部分

一个标准的URL由以下几部分组成：

协议：如http://、https://、ftp://等
主机名：如www.example.com、192.168.1.1等
端口号：如80、8080等
路径：如/path/to/page、/index.html等
查询参数：如?key1=value1&key2=value2等
锚点：如#section1、#top等

以上各部分都有一定的格式要求，可以使用正则表达式来校验一个URL是否符合这些格式。

URL协议校验

首先我们来看一下校验URL中的协议部分，通常URL的协议为http://或https://：

def validate_protocol(url):
    # 匹配URL的协议部分
    protocol_pattern = r'^https?://'
    if re.match(protocol_pattern, url):
        return True
    else:
        return False

# 测试协议校验函数
urls = [
    "http://www.example.com",
    "https://www.example.com",
    "ftp://ftp.example.com"
]

for url in urls:
    if validate_protocol(url):
        print(f"{url} 的协议部分格式正确")
    else:
        print(f"{url} 的协议部分格式不正确")

运行以上代码，输出如下：

http://www.example.com 的协议部分格式正确
https://www.example.com 的协议部分格式正确
ftp://ftp.example.com 的协议部分格式不正确

URL主机名校验

接下来我们来看一下校验URL中的主机名部分，主机名的格式通常为www.example.com或192.168.1.1：

def validate_hostname(url):
    # 匹配URL的主机名部分
    hostname_pattern = r'([\w-]+\.)+[\w-]+'
    if re.search(hostname_pattern, url):
        return True
    else:
        return False

# 测试主机名校验函数
urls = [
    "http://www.example.com",
    "http://192.168.1.1:8080",
    "https://www.example.com/path/to/page",
    "ftp://ftp.example.com"
]

for url in urls:
    if validate_hostname(url):
        print(f"{url} 的主机名部分格式正确")
    else:
        print(f"{url} 的主机名部分格式不正确")

运行以上代码，输出如下：

http://www.example.com 的主机名部分格式正确
http://192.168.1.1:8080 的主机名部分格式正确
https://www.example.com/path/to/page 的主机名部分格式正确
ftp://ftp.example.com 的主机名部分格式正确

URL端口号校验

再来看一下校验URL中的端口号部分，通常是一个数字，如80或8080：

def validate_port(url):
    # 匹配URL的端口号部分
    port_pattern = r':[0-9]+'
    if re.search(port_pattern, url):
        return True
    else:
        return False

# 测试端口号校验函数
urls = [
    "http://www.example.com",
    "http://192.168.1.1:8080",
    "https://www.example.com/path/to/page",
    "ftp://ftp.example.com:21"
]

for url in urls:
    if validate_port(url):
        print(f"{url} 的端口号部分格式正确")
    else:
        print(f"{url} 的端口号部分格式不正确")

运行以上代码，输出如下：

http://www.example.com 的端口号部分格式不正确
http://192.168.1.1:8080 的端口号部分格式正确
https://www.example.com/path/to/page 的端口号部分格式不正确
ftp://ftp.example.com:21 的端口号部分格式正确

URL路径校验

最后我们来看一下校验URL中的路径部分，路径通常为/path/to/page或/index.html等形式：

def validate_path(url):
    # 匹配URL的路径部分
    path_pattern = r'/[\w-./?%&=]*$'
    if re.search(path_pattern, url):
        return True
    else:
        return False

# 测试路径校验函数
urls = [
    "http://www.example.com",
    "https://www.example.com/path/to/page",
    "http://192.168.1.1:8080/index.html",
    "ftp://ftp.example.com"
]

for url in urls:
    if validate_path(url):
        print(f"{url} 的路径部分格式正确")
    else:
        print(f"{url} 的路径部分格式不正确")

运行以上代码，输出如下：

http://www.example.com 的路径部分格式不正确
https://www.example.com/path/to/page 的路径部分格式正确
http://192.168.1.1:8080/index.html 的路径部分格式正确
ftp://ftp.example.com 的路径部分格式不正确