python 文字列のhtml除去処理で「正規表現」と「BeautifulSoup」のパフォーマンスを計測して比較する

作成日 2022.09.10
python
python

pythonで、文字列からhtmlタグを除去する処理を「正規表現」と「BeautifulSoup」のそれぞれで実行したパフォーマンスを計測して比較するコードと結果を記述してます。pythonのバージョンは3.10.0を使用してます。

1. 環境
2. パフォーマンス計測

環境

OS windows11 home 64bit
python 3.10.0

パフォーマンス計測

「time.perf_counter」を使用して、文字列からhtmlタグを除去する処理で「正規表現」と「BeautifulSoup」を10万回実行して、計測した結果を比較してみます。

import time
import re
from bs4 import BeautifulSoup

txt = "<div><b>world</b></div>"
n =  100_000

# 計測開始
time_sta = time.perf_counter()

# 処理
for i in range(n):
    re.sub(re.compile('<.*?>'), '', txt)
    
# 計測終了
time_end = time.perf_counter()

# 結果表示
result = time_end- time_sta
print(f"正規表現 : {result * 1000:.1f} ms") 

# 計測開始
time_sta = time.perf_counter()

# 処理
for i in range(n):
    BeautifulSoup(txt, "lxml").text
    
# 計測終了
time_end = time.perf_counter()

# 結果表示
result = time_end- time_sta
print(f"BeautifulSoup : {result * 1000:.1f} ms")

実行結果をみると「正規表現」の方がパフォーマンスは良さそうです。

【1回目】
正規表現 : 1503.2 ms
BeautifulSoup : 154576.6 ms

【2回目】
正規表現 : 1330.3 ms
BeautifulSoup : 84497.0 ms

【3回目】
正規表現 : 1312.7 ms
BeautifulSoup : 79985.6 ms

python 文字列のhtml除去処理で「正規表現」と「BeautifulSoup」のパフォーマンスを計測して比較する

環境

パフォーマンス計測

javascript エラー「Uncaught TypeError: document.documentElement.webkitRequestFullscreen is not a function」が発生した場合の解決方法 2022.09.10

MariaDB 日時データを比較してから時分秒で差分を取得する 2022.09.10