|
4楼
匿名
发表于 2021-3-28 18:16:25
本帖最后由 匿名 于 2021-3-28 18:21 编辑
- """Fetch 'http://rhetoric.byu.edu/'.
- python 3.7+
- pip install playwright
- python -m playright install
- """
- import re
- from pathlib import Path
- from playwright.sync_api import sync_playwright, Browser
- def main():
- ...
- url = "http://rhetoric.byu.edu/"
- playwright = sync_playwright().start()
- browser = playwright.chromium.launch(headless=False)
- page = browser.new_page()
- page.goto(url)
- frame = page.frame("flowers")
- if not frame:
- raise SystemExit("拿不到东西,检查网络、等等...")
- html = frame.inner_html("html > body", timeout=40 * 1000)
- prefix = "http://rhetoric.byu.edu/Figures/"
- _ = re.sub(r'href="([^"]+)', rf'href="{prefix}\1', html)
- Path("byu.html").write_text(_, "utf8")
- playwright.stop()
- if __name__ == "__main__":
- try:
- main()
- except Exception as exc:
- print(exc)
复制代码
在python环境里运行上面的码,会存一个 byu.html 的文件(见附件),所有的链接都在 byu.html 里。可以用 requests.get
拿到每个词的内容。有兴趣的群友折腾一下。楼主也不用给我分啦,我无聊练习一下爬虫利器 playwright 而已。
|
-
-
byu.zip
5.76 KB, 下载次数: 1, 下载积分: 米 -5 粒
|