Commit 345820f1 by qilei

qilei:add pdf author crawler

parent 1e5d8dd4
# PDF Crawler for Academic Websites
本程序用于批量爬取主流学术网站论文的PDF档案和相关元数据,支持断点续传,并可采集作者、发表期刊、机构等字段(部分字段依赖网站信息精度,需人工复检)。**当前支持的网站列表如下:**
* mdpi
* acm
* springer
* ieee
* proquest
* patents
## Requirement
**1. 安装依赖**
```bash
pip install Selenium pandas
```
**2. 安装 [Chrome 浏览器](https://www.google.com/chrome/)**
---
## Quick Start
### Step 1. 配置 config.py
在项目目录下找到 `config.py`,需要修改如下三个路径并填写为你本地机器对应的绝对路径:
```python
CHROME_USER_OPTION_PATH = "C:\\Users\\你的用户名\\AppData\\Local\\Google\\Chrome\\User Data"
DOWNLOAD_DIR = "D:\\你的下载目录路径\\pdf_tmp"
SAVED_DIR = "D:\\你的保存目录路径\\pdf_saved"
```
#### 路径说明
* **CHROME_USER_OPTION_PATH**
Chrome 用户数据目录。用于selenium启用你的真实已登录Chrome配置,包括cookies与插件,使得下载无需每次重新登录。可在Chrome地址栏里输入 `chrome://version/` 查看本地“个人资料路径(Profile Path)”。
* **DOWNLOAD_DIR**
临时下载目录。程序将PDF文件先下载至此目录,确保有写入权限。
* **SAVED_DIR**
最终保存目录。下载失败与已完成文件会被迁移/整理至此目录做归档。
---
### Step 2. 配置待爬取链接 input.xlsx
创建 `data/input.xlsx` 文件,格式如下:
| url | author | pdf|
| -------------------------------------------- || ------------------ |
| [https://xx.xx/xxxx](https://xx.xx/xxxx.pdf) | 0 | 0 | 0 | 0 |
**说明**
* 第一列 `url`:放置你要爬取的网页或PDF链接。
* 其余每列初始请全部填0,程序自动修改为1表示该资源已处理/已爬取。方便断点恢复和进度追踪。
---
### Step 3. Chrome 登录准备
**注意:部分学术论文网站需要登录后才可下载文献(尤其机构授权的PDF下载权限)。为保证能正常下载PDF,请务必提前完成下列操作:**
1. **启动Chrome浏览器。**
2. **登录你准备爬取的论文网站并认证机构信息。常见网站包含:**
* mdpi
* acm
* springer
* ieee
* sciencedirect
* ...
3. **确认登录状态有效(可直接下载PDF且不弹出登录页/验证码)。**
---
### Step 4. 关闭全部Chrome浏览器
> **必须关闭所有Chrome实例** (包括后台服务),避免Selenium和本地Chrome操作配置文件时冲突、导致启动失败或cookie丢失。
---
### Step 5. 启动爬虫
```bash
python main.py
```
---
## 功能描述
1. **支持断点续传** :重复运行不会重复下载已爬取论文,可跨多次启动继续未完成工作。
2. **PDF下载支持** :适配["mdpi","acm", "springer", "ieee", "proquest", "patents"]等常见学术和专利网站。
3. **元数据采集** :自动提取论文作者、发表期刊名与(部分网站)机构,结果自动保存修改到 `modified_output.csv`
* 注意:机构爬取仅适用于部分网站,且如ieee只能爬取作者当前单位,和实际论文中单位可能不同。
1. **多字段复核建议** :自动爬取的信息仅供参考,一定要人工审核核查。
---
### 注意事项
* 某些网站的机构字段存在一定的不确定性,仅作辅助统计用途。
* **所有结果建议人工复检,确保数据准确。**
* 若首次运行遇到chromedriver下载失败,请检查代理/梯子设置。
---
## 支持网站字段详情
| 网站 | 作者列表 | 发表期刊 | 机构 |
| --------------------------------------------------- | -------- | -------- | ------------------------------------------------ |
| ieeexplore.ieee.org | √ | √ | √(爬取作者主页当前机构,和论文单位可能有区别) |
| dl.acm.org | √ | √ | |
| arxiv.org | √ | √ | |
| [www.mdpi.com](http://www.mdpi.com/) | √ | √ | √ |
| link.springer.com | √ | √ | |
| patents.google.com(专利) | √ | √ | √ |
| [www.sciencedirect.com](http://www.sciencedirect.com/) | √ | √ | √ |
CHROME_USER_OPTION_PATH = "C:\\Users\\26224\\AppData\\Local\\Google\\Chrome\\User Data"
DOWNLOAD_DIR = "D:\\Project\\杂项\\202505引用统计\\tmp"
SAVED_DIR = "D:\\Project\\杂项\\202505引用统计\\saved"
\ No newline at end of file
https://search.proquest.com/openview/29c6f7968a99bbc9b481d85968ed494b/1?pq-origsite=gscholar&cbl=18750&diss=y
https://search.proquest.com/openview/4ab8b9b7b6c338ab4729cc0a11279c45/1?pq-origsite=gscholar&cbl=18750&diss=y
https://search.proquest.com/openview/3a29614b2f0e4addb4b59b10af0d8f2b/1?pq-origsite=gscholar&cbl=18750&diss=y
https://search.proquest.com/openview/19ab6db3db1a1658316f5eb66aa4bc51/1?pq-origsite=gscholar&cbl=18750&diss=y
https://search.proquest.com/openview/16d4ea03d4889fbf706e65ec5534151b/1?pq-origsite=gscholar&cbl=18750&diss=y
https://ieeexplore.ieee.org/abstract/document/10487089/
https://link.springer.com/article/10.1007/s11432-021-3596-x
https://ieeexplore.ieee.org/abstract/document/10247776/
https://arxiv.org/abs/2303.12397
https://dl.acm.org/doi/abs/10.1145/3600092
https://arxiv.org/abs/2312.06086
https://ieeexplore.ieee.org/abstract/document/10173478/
https://ieeexplore.ieee.org/abstract/document/10085688/
https://ieeexplore.ieee.org/abstract/document/10590081/
https://ieeexplore.ieee.org/abstract/document/10476057/
https://ieeexplore.ieee.org/abstract/document/10233885/
https://www.sciencedirect.com/science/article/pii/S0141933123000224
https://www.sciencedirect.com/science/article/pii/S0141933123000224
https://www.sciencedirect.com/science/article/pii/S0141933123000224
https://www.sciencedirect.com/science/article/pii/S0141933123000224
https://www.sciencedirect.com/science/article/pii/S0141933123000224
https://www.sciencedirect.com/science/article/pii/S0141933123000224
https://www.mdpi.com/2079-9268/13/1/5
https://ieeexplore.ieee.org/abstract/document/10265716/
https://patents.google.com/patent/US11675676B2/en
https://link.springer.com/chapter/10.1007/978-3-031-19568-6_10
https://patents.google.com/patent/US11630997B2/en
https://ieeexplore.ieee.org/abstract/document/10317994/
https://ieeexplore.ieee.org/abstract/document/10114404/
https://link.springer.com/chapter/10.1007/978-3-031-42478-6_6
https://link.springer.com/chapter/10.1007/978-3-031-19568-6_6
https://arxiv.org/abs/2302.09564
https://www.jstage.jst.go.jp/article/elex/20/21/20_20.20230379/_article/-char/ja/
https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/cdt2.12060
https://link.springer.com/chapter/10.1007/978-3-031-19568-6_13
https://link.springer.com/chapter/10.1007/978-3-031-19568-6_12
https://era.library.ualberta.ca/items/bf8101dd-1663-489b-9b6a-51c5ab62f6f5
https://search.proquest.com/openview/fbc658c5123012204f7d0c6ec839cbc0/1?pq-origsite=gscholar&cbl=18750&diss=y
https://ieeexplore.ieee.org/abstract/document/10365973/
https://ieeexplore.ieee.org/abstract/document/10130257/
https://patents.google.com/patent/US11609760B2/en
https://ieeexplore.ieee.org/abstract/document/10078001/
https://search.proquest.com/openview/ad3dbd625f5a6efc7cd59bb46892bbfe/1?pq-origsite=gscholar&cbl=18750&diss=y
https://ieeexplore.ieee.org/abstract/document/10168637/
https://link.springer.com/chapter/10.1007/978-3-031-19568-6_1
https://link.springer.com/chapter/10.1007/978-3-031-39932-9_18
https://link.springer.com/chapter/10.1007/978-3-031-42785-5_9
https://link.springer.com/chapter/10.1007/978-3-031-42785-5_9
https://link.springer.com/article/10.1007/s11390-021-1161-y
https://www.worldscientific.com/doi/abs/10.1142/S0218126623502183
https://theses.hal.science/tel-04561235/
https://www.ideals.illinois.edu/items/127346
https://search.proquest.com/openview/8936ef6cce004c2ca50f00606cc46237/1?pq-origsite=gscholar&cbl=18750&diss=y
https://ieeexplore.ieee.org/abstract/document/10466951/
https://ieeexplore.ieee.org/abstract/document/10168589/
https://search.proquest.com/openview/f1dc7b70d091857ef09a7b30cca06c14/1?pq-origsite=gscholar&cbl=18750&diss=y
https://patents.google.com/patent/US11582481B2/en
https://link.springer.com/chapter/10.1007/978-981-99-2897-2_9
https://search.proquest.com/openview/6727545e290befd7c91973960744f5d7/1?pq-origsite=gscholar&cbl=18750&diss=y
https://patents.google.com/patent/US11675624B2/en
https://search.proquest.com/openview/d9d75aa1648608f2c5335c4fe8e9d2f1/1?pq-origsite=gscholar&cbl=18750&diss=y
https://link.springer.com/chapter/10.1007/978-3-031-29970-4_5
https://search.proquest.com/openview/b25dc9caa6d074de23a3ce113e7f8f0e/1?pq-origsite=gscholar&cbl=18750&diss=y
https://drpress.org/ojs/index.php/HSET/article/view/15880
https://advance.sagepub.com/doi/full/10.36227/techrxiv.170326747.73509974/v1
http://asianssr.org/index.php/ajct/article/view/1317
https://elartu.tntu.edu.ua/handle/lib/42599
https://ieeexplore.ieee.org/abstract/document/10176816/
https://ieeexplore.ieee.org/abstract/document/10458910
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4435954
https://patents.google.com/patent/US11681529B2/en
https://patents.google.com/patent/US11829862B2/en
https://repositorio.unal.edu.co/handle/unal/84550
http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2107046
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4399163
https://patents.google.com/patent/US11775313B2/en
https://jilindaxuexuebao.org/dashboard/uploads/8.10016126.pdf
https://drpress.org/ojs/index.php/HSET/article/view/6544
https://www.researching.cn/ArticlePdf/m00002/2023/60/8/0811010.pdf
https://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=10042954&AN=172904540&h=K8%2F2hsy2EdlM9qTMyx6NMB2LevnxdgOTy3vgJ3%2FxBwJwZOP%2Buv%2FzcjvfdqXmO555xyP3P2VtluotDFnlTc1lGQ%3D%3D&crl=c
https://www.mlsoft.in/jespublication.com/upload/2023-V14I111.pdf
https://www.jcad.cn/en/article/id/7e14653a-57fd-4c28-97fe-b881ba06a856
https://dl.ccf.org.cn/article/articleDetail.html?type=qkwz&_ack=1&id=6335372890540032
https://archiv.ub.uni-heidelberg.de/volltextserver/32994/
https://dspace.lib.ntua.gr/xmlui/bitstream/handle/123456789/56672/dimploma_thesis_Strakosi_Lazaros.pdf?sequence=1
https://www.jcad.cn/cn/article/pdf/preview/10.3724/SP.J.1089.2023.19439.pdf
https://patents.google.com/patent/US11763153B2/en
https://patents.google.com/patent/US20230146689A1/en
https://patents.google.com/patent/US11816480B2/en
https://patents.google.com/patent/US11836497B2/en
https://patents.google.com/patent/US11816045B2/en
https://patents.google.com/patent/US11816572B2/en
https://patents.google.com/patent/US11703939B2/en
https://patents.google.com/patent/US11762690B2/en
https://patents.google.com/patent/US11797830B2/en
https://patents.google.com/patent/US11755683B2/en
https://patents.google.com/patent/US11676029B2/en
https://patents.google.com/patent/US11676028B2/en
https://patents.google.com/patent/US11636173B2/en
https://patents.google.com/patent/US11709672B2/en
https://patents.google.com/patent/US11704125B2/en
https://patents.google.com/patent/US11740898B2/en
https://patents.google.com/patent/US11663002B2/en
from urllib.parse import urlparse
from pdfDownloader.downloader import (
PaperDownloader,
IEEEExploreDownloader,
ScienceDirectDownloader,
ProQuestDownloader,
ArxivDownloader,
SpringerDownloader,
acmDownloader,
mdpiDownloader,
patentDownloader
)
from driver import ChromeDriverManager
class DownloaderFactory:
@staticmethod
def get_downloader(url: str) -> PaperDownloader:
domain = urlparse(url).netloc
if "patents.google.com" in domain:
return patentDownloader
elif "arxiv.org" in domain:
return ArxivDownloader
elif "www.mdpi.com" in domain:
return mdpiDownloader
elif "dl.acm.org" in domain:
return acmDownloader
elif "link.springer.com" in domain:
return SpringerDownloader
elif "ieeexplore.ieee.org" in domain:
return IEEEExploreDownloader
elif "search.proquest.com" in domain:
return ProQuestDownloader
elif "sciencedirect.com" in domain:
return ScienceDirectDownloader
else:
raise ValueError(f"Unsupported website: {domain}")
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from config import CHROME_USER_OPTION_PATH, DOWNLOAD_DIR
class ChromeDriver:
_instance = None
download_dir = DOWNLOAD_DIR
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super(ChromeDriver, cls).__new__(cls)
cls._instance._initialized = False
return cls._instance
def __init__(self, start_fullscreen=True, user_agent=None):
if not self._initialized:
driver_path = ChromeDriverManager().install()
service = Service(executable_path=driver_path)
options = webdriver.ChromeOptions()
options.add_argument(
f"--user-data-dir={CHROME_USER_OPTION_PATH}",
)
options.add_experimental_option(
"prefs",
{
"download.default_directory": self.download_dir,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True,
},
)
# options.add_experimental_option("prefs", prefs)
# if start_fullscreen:
# options.add_argument("--start-fullscreen")
# if user_agent:
# options.add_argument(f"user-agent={user_agent}")
self.driver = webdriver.Chrome(options=options)
self._initialized = True
def get_driver(self):
return self.driver
from downloadFactory import DownloaderFactory
from driver import ChromeDriver
import csv
import os
import pandas as pd
# Function to write to CSV
def write_to_csv(file_name, keys, detail, is_modified=False):
with open(file_name, "a", newline="", encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=keys)
if csvfile.tell() == 0: # Check if file is empty to write header
writer.writeheader()
if is_modified:
# Prepare modified data
if isinstance(detail.get("authors"), list):
detail["authors"] = "; ".join(detail["authors"])
if isinstance(detail.get("institutions"), list):
detail["institutions"] = ";\n".join(detail["institutions"])
writer.writerow(detail)
def process_txt_files_in_directory(input_dir):
"""Process each txt file in the given directory."""
# List all txt files in the directory
file_path_list = []
for file_name in os.listdir(input_dir):
if file_name.endswith(".txt"):
file_path_list.append((input_dir, file_name))
return file_path_list
def update_xlsx(file_name, row_index, author_status, pdf_status):
df = pd.read_excel(file_name)
if author_status:
df.at[row_index, 'author'] = 1
if pdf_status:
df.at[row_index, 'pdf'] = 1
df.to_excel(file_name, index=False)
if __name__ == "__main__":
driver = ChromeDriver()
input_file = "data/input.xlsx" # The input Excel file
supported_pdf_sites = ["mdpi","acm", "springer", "ieee", "proquest", "patents"]
# supported_pdf_sites = ["mdpi"]
df = pd.read_excel(input_file)
for idx, row in df.iterrows():
# TODO: add real idx
url = row['url']
author_status = row['author']
pdf_status = row['pdf']
detail = {}
# Skip if already processed
if author_status == 1 and pdf_status == 1:
continue
# 1. get author, institution, journal info
try:
if url.endswith(".pdf"):
print(f"Skipping PDF file: {url}")
continue
print(f"Crawling {url}..................... ")
downloaderClass = DownloaderFactory.get_downloader(url)
downloader = downloaderClass(driver)
updated_author_status = author_status
updated_pdf_status = pdf_status
if author_status != 1:
detail = downloader.get_author_institution_journal(url)
if detail.get("title"):
updated_author_status = 1
except Exception as e:
print(e)
continue
# 2. get pdf
try:
# Download PDF if it's not already done
if any(key in url for key in supported_pdf_sites) and updated_pdf_status != 1:
state = downloader.download(url, str(idx + 1))
updated_pdf_status = 1 if state == True else 0
except Exception as e:
print(e)
continue
# Update the Excel file for this URL
update_xlsx(input_file, idx, updated_author_status, updated_pdf_status)
if detail:
keys = detail.keys()
# Write each detail as it's processed to CSV
write_to_csv("output.csv", keys, detail)
write_to_csv(
"modified_output.csv",
keys,
detail,
is_modified=True,
)
print(f"Crawling {url} done.")
# Make sure to quit the driver after operations are done
# university of science and technology of china
from abc import ABC, abstractmethod
import time, os, shutil
from config import SAVED_DIR, DOWNLOAD_DIR
class PaperDownloader(ABC):
def __init__(self, driver_manager):
"""Initialize the paper downloader with a driver manager."""
self.driver_manager = driver_manager
self.download_dir = DOWNLOAD_DIR
self.saved_dir = SAVED_DIR
print("Driver Manager initialized.")
# @abstractmethod
def download(self, url: str):
pass
@abstractmethod
def get_author_institution_journal(self, url: str):
pass
@staticmethod
def clear_download_dir(download_dir):
for f in os.listdir(download_dir):
if f.endswith('.crdownload') or f.endswith('.pdf'):
os.remove(os.path.join(download_dir, f))
@staticmethod
def wait_for_pdf_download(download_dir, timeout=120, check_interval=1, stable_times=3):
start_time = time.time()
file_path = None
last_size = -1
stable_count = 0
while True:
files = [f for f in os.listdir(download_dir) if f.endswith('.pdf') or f.endswith('.pdf.crdownload')]
if files:
files = sorted(files, key=lambda x: os.path.getmtime(os.path.join(download_dir, x)), reverse=True)
file_path = os.path.join(download_dir, files[0])
try:
curr_size = os.path.getsize(file_path)
except Exception:
curr_size = -1
if curr_size == last_size and curr_size > 0:
stable_count += 1
else:
stable_count = 0
last_size = curr_size
if stable_count >= stable_times:
if file_path.endswith('.crdownload'):
final_pdf = file_path[:-11]
if os.path.exists(final_pdf):
file_path = final_pdf
else:
file_path = None
break
else:
file_path = None
if time.time() - start_time > timeout:
print("等待下载超时")
file_path = None
break
time.sleep(check_interval)
if file_path and file_path.endswith('.pdf') and os.path.exists(file_path):
return file_path
else:
return None
def trigger_and_save_pdf(self, driver, pdf_url, download_dir, saved_dir, paper_idx, wait_sec=15):
# 清空下载目录
self.clear_download_dir(download_dir)
# 跳转PDF,触发下载
try:
driver.get(pdf_url)
except Exception as e:
print(f"打开PDF链接失败: {e}")
return False
# TODO: 改成轮询, self.wait_for_pdf_download
time.sleep(wait_sec)
# 查找pdf文件
files = [f for f in os.listdir(download_dir) if f.endswith('.pdf')]
if not files:
print("下载超时或失败")
return False
file_path = os.path.join(download_dir, files[0])
file_name = files[0]
# 文件重命名与保存
save_path = os.path.join(saved_dir, f"{paper_idx}-{file_name}")
try:
shutil.move(file_path, save_path)
print("[✓] PDF saved:", save_path)
return True
except Exception as e:
print("重命名或保存失败:", e)
return False
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment