金币
UID400479
帖子
主题
积分1592
注册时间2018-11-13
最后登录1970-1-1
听众
性别保密
|
发表于 2023-9-25 10:00:15
|
显示全部楼层
整了个python小爬虫来下,有其他文件太多的帖子也可以用
- import requests
- from bs4 import BeautifulSoup
- import os
- prefix = 'https://www.ouryao.com/'
- download_directory = '文件下载的路径' # 替换需要的文件路径
- # html文件通过下载本帖页面源代码可得
- with open('用于解析下载文件地址的html文件', 'r') as file: # 替换需要的文件路径(包含文件名及后缀)
- html = file.read()
- soup = BeautifulSoup(html, 'html.parser')
- ignore_js_op_tags = soup.find_all('ignore_js_op')
- for ignore_js_op_tag in ignore_js_op_tags:
- a_tag = ignore_js_op_tag.find('a')
- text = a_tag.text
- link = prefix + a_tag['href']
- filename = text + '.pdf'
- file_path = os.path.join(download_directory, filename)
- # cookie通过浏览器开发者模式的请求标头获取
- response = requests.get(link, headers={
- "cookie": "自己账号登录后的cookie" # 替换cookie
- })
- with open(file_path, 'wb') as file:
- file.write(response.content)
复制代码 |
|