qilei:add pdf author crawler

345820f1 · qilei · 1e5d8dd4 · 345820f1 · 345820f1 · 345820f1
Commit 345820f1 authored May 08, 2025 by qilei
9 changed files
--- a/author_pdf_crawler/README.md
+++ b/author_pdf_crawler/README.md
+# PDF Crawler for Academic Websites
+
+本程序用于批量爬取主流学术网站论文的PDF档案和相关元数据，支持断点续传，并可采集作者、发表期刊、机构等字段（部分字段依赖网站信息精度，需人工复检）。**当前支持的网站列表如下：**
+
+* mdpi
+* acm
+* springer
+* ieee
+* proquest
+* patents
+
+## Requirement
+
+**1. 安装依赖**
+
+```bash
+pip install Selenium pandas
+```
+
+**2. 安装 [Chrome 浏览器](https://www.google.com/chrome/)**
+
+---
+
+## Quick Start
+
+### Step 1. 配置 config.py
+
+在项目目录下找到 `config.py`，需要修改如下三个路径并填写为你本地机器对应的绝对路径：
+
+```python
+CHROME_USER_OPTION_PATH = "C:\\Users\\你的用户名\\AppData\\Local\\Google\\Chrome\\User Data"
+DOWNLOAD_DIR = "D:\\你的下载目录路径\\pdf_tmp"
+SAVED_DIR = "D:\\你的保存目录路径\\pdf_saved"
+```
+
+#### 路径说明
+
+* **CHROME_USER_OPTION_PATH**
+  Chrome 用户数据目录。用于selenium启用你的真实已登录Chrome配置，包括cookies与插件，使得下载无需每次重新登录。可在Chrome地址栏里输入 `chrome://version/` 查看本地“个人资料路径（Profile Path）”。
+* **DOWNLOAD_DIR**
+  临时下载目录。程序将PDF文件先下载至此目录，确保有写入权限。
+* **SAVED_DIR**
+  最终保存目录。下载失败与已完成文件会被迁移/整理至此目录做归档。
+
+---
+
+### Step 2. 配置待爬取链接 input.xlsx
+
+创建 `data/input.xlsx` 文件，格式如下：
+
+| url                                          | author | pdf|
+| -------------------------------------------- || ------------------ |
+| [https://xx.xx/xxxx](https://xx.xx/xxxx.pdf) | 0          | 0             | 0              | 0                  |
+
+**说明**
+
+* 第一列 `url`：放置你要爬取的网页或PDF链接。
+* 其余每列初始请全部填0，程序自动修改为1表示该资源已处理/已爬取。方便断点恢复和进度追踪。
+
+---
+
+### Step 3. Chrome 登录准备
+
+**注意：部分学术论文网站需要登录后才可下载文献（尤其机构授权的PDF下载权限）。为保证能正常下载PDF，请务必提前完成下列操作：**
+
+1. **启动Chrome浏览器。**
+2. **登录你准备爬取的论文网站并认证机构信息。常见网站包含：**
+   * mdpi
+   * acm
+   * springer
+   * ieee
+   * sciencedirect
+   * ...
+3. **确认登录状态有效（可直接下载PDF且不弹出登录页/验证码）。**
+
+---
+
+### Step 4. 关闭全部Chrome浏览器
+
+> **必须关闭所有Chrome实例** （包括后台服务），避免Selenium和本地Chrome操作配置文件时冲突、导致启动失败或cookie丢失。
+
+---
+
+### Step 5. 启动爬虫
+
+```bash
+python main.py
+```
+
+---
+
+## 功能描述
+
+1. **支持断点续传** ：重复运行不会重复下载已爬取论文，可跨多次启动继续未完成工作。
+2. **PDF下载支持** ：适配["mdpi","acm", "springer", "ieee", "proquest", "patents"]等常见学术和专利网站。
+3. **元数据采集** ：自动提取论文作者、发表期刊名与（部分网站）机构，结果自动保存修改到 `modified_output.csv`。
+
+* 注意：机构爬取仅适用于部分网站，且如ieee只能爬取作者当前单位，和实际论文中单位可能不同。
+
+1. **多字段复核建议** ：自动爬取的信息仅供参考，一定要人工审核核查。
+
+---
+
+### 注意事项
+
+* 某些网站的机构字段存在一定的不确定性，仅作辅助统计用途。
+* **所有结果建议人工复检，确保数据准确。**
+* 若首次运行遇到chromedriver下载失败，请检查代理/梯子设置。
+
+---
+
+## 支持网站字段详情
+
+| 网站                                                | 作者列表 | 发表期刊 | 机构                                             |
+| --------------------------------------------------- | -------- | -------- | ------------------------------------------------ |
+| ieeexplore.ieee.org                                 | √       | √       | √（爬取作者主页当前机构，和论文单位可能有区别） |
+| dl.acm.org                                          | √       | √       |                                                  |
+| arxiv.org                                           | √       | √       |                                                  |
+| [www.mdpi.com](http://www.mdpi.com/)                   | √       | √       | √                                               |
+| link.springer.com                                   | √       | √       |                                                  |
+| patents.google.com（专利）                          | √       | √       | √                                               |
+| [www.sciencedirect.com](http://www.sciencedirect.com/) | √       | √       | √                                               |
--- a/author_pdf_crawler/config.py
+++ b/author_pdf_crawler/config.py
+CHROME_USER_OPTION_PATH = "C:\\Users\\26224\\AppData\\Local\\Google\\Chrome\\User Data"
+DOWNLOAD_DIR = "D:\\Project\\杂项\\202505引用统计\\tmp"
+SAVED_DIR = "D:\\Project\\杂项\\202505引用统计\\saved"
\ No newline at end of file
--- a/author_pdf_crawler/data/input.xlsx
+++ b/author_pdf_crawler/data/input.xlsx
--- a/author_pdf_crawler/data/url.txt
+++ b/author_pdf_crawler/data/url.txt
+https://search.proquest.com/openview/29c6f7968a99bbc9b481d85968ed494b/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://search.proquest.com/openview/4ab8b9b7b6c338ab4729cc0a11279c45/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://search.proquest.com/openview/3a29614b2f0e4addb4b59b10af0d8f2b/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://search.proquest.com/openview/19ab6db3db1a1658316f5eb66aa4bc51/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://search.proquest.com/openview/16d4ea03d4889fbf706e65ec5534151b/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://ieeexplore.ieee.org/abstract/document/10487089/
+https://link.springer.com/article/10.1007/s11432-021-3596-x
+https://ieeexplore.ieee.org/abstract/document/10247776/
+https://arxiv.org/abs/2303.12397
+https://dl.acm.org/doi/abs/10.1145/3600092
+https://arxiv.org/abs/2312.06086
+https://ieeexplore.ieee.org/abstract/document/10173478/
+https://ieeexplore.ieee.org/abstract/document/10085688/
+https://ieeexplore.ieee.org/abstract/document/10590081/
+https://ieeexplore.ieee.org/abstract/document/10476057/
+https://ieeexplore.ieee.org/abstract/document/10233885/
+https://www.sciencedirect.com/science/article/pii/S0141933123000224
+https://www.sciencedirect.com/science/article/pii/S0141933123000224
+https://www.sciencedirect.com/science/article/pii/S0141933123000224
+https://www.sciencedirect.com/science/article/pii/S0141933123000224
+https://www.sciencedirect.com/science/article/pii/S0141933123000224
+https://www.sciencedirect.com/science/article/pii/S0141933123000224
+https://www.mdpi.com/2079-9268/13/1/5
+https://ieeexplore.ieee.org/abstract/document/10265716/
+https://patents.google.com/patent/US11675676B2/en
+https://link.springer.com/chapter/10.1007/978-3-031-19568-6_10
+https://patents.google.com/patent/US11630997B2/en
+https://ieeexplore.ieee.org/abstract/document/10317994/
+https://ieeexplore.ieee.org/abstract/document/10114404/
+https://link.springer.com/chapter/10.1007/978-3-031-42478-6_6
+https://link.springer.com/chapter/10.1007/978-3-031-19568-6_6
+https://arxiv.org/abs/2302.09564
+https://www.jstage.jst.go.jp/article/elex/20/21/20_20.20230379/_article/-char/ja/
+https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/cdt2.12060
+https://link.springer.com/chapter/10.1007/978-3-031-19568-6_13
+https://link.springer.com/chapter/10.1007/978-3-031-19568-6_12
+https://era.library.ualberta.ca/items/bf8101dd-1663-489b-9b6a-51c5ab62f6f5
+https://search.proquest.com/openview/fbc658c5123012204f7d0c6ec839cbc0/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://ieeexplore.ieee.org/abstract/document/10365973/
+https://ieeexplore.ieee.org/abstract/document/10130257/
+https://patents.google.com/patent/US11609760B2/en
+https://ieeexplore.ieee.org/abstract/document/10078001/
+https://search.proquest.com/openview/ad3dbd625f5a6efc7cd59bb46892bbfe/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://ieeexplore.ieee.org/abstract/document/10168637/
+https://link.springer.com/chapter/10.1007/978-3-031-19568-6_1
+https://link.springer.com/chapter/10.1007/978-3-031-39932-9_18
+https://link.springer.com/chapter/10.1007/978-3-031-42785-5_9
+https://link.springer.com/chapter/10.1007/978-3-031-42785-5_9
+https://link.springer.com/article/10.1007/s11390-021-1161-y
+https://www.worldscientific.com/doi/abs/10.1142/S0218126623502183
+https://theses.hal.science/tel-04561235/
+https://www.ideals.illinois.edu/items/127346
+https://search.proquest.com/openview/8936ef6cce004c2ca50f00606cc46237/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://ieeexplore.ieee.org/abstract/document/10466951/
+https://ieeexplore.ieee.org/abstract/document/10168589/
+https://search.proquest.com/openview/f1dc7b70d091857ef09a7b30cca06c14/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://patents.google.com/patent/US11582481B2/en
+https://link.springer.com/chapter/10.1007/978-981-99-2897-2_9
+https://search.proquest.com/openview/6727545e290befd7c91973960744f5d7/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://patents.google.com/patent/US11675624B2/en
+https://search.proquest.com/openview/d9d75aa1648608f2c5335c4fe8e9d2f1/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://link.springer.com/chapter/10.1007/978-3-031-29970-4_5
+https://search.proquest.com/openview/b25dc9caa6d074de23a3ce113e7f8f0e/1?pq-origsite=gscholar&cbl=18750&diss=y
+https://drpress.org/ojs/index.php/HSET/article/view/15880
+https://advance.sagepub.com/doi/full/10.36227/techrxiv.170326747.73509974/v1
+http://asianssr.org/index.php/ajct/article/view/1317
+https://elartu.tntu.edu.ua/handle/lib/42599
+https://ieeexplore.ieee.org/abstract/document/10176816/
+https://ieeexplore.ieee.org/abstract/document/10458910
+https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4435954
+https://patents.google.com/patent/US11681529B2/en
+https://patents.google.com/patent/US11829862B2/en
+https://repositorio.unal.edu.co/handle/unal/84550
+http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2107046
+https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4399163
+https://patents.google.com/patent/US11775313B2/en
+https://jilindaxuexuebao.org/dashboard/uploads/8.10016126.pdf
+https://drpress.org/ojs/index.php/HSET/article/view/6544
+https://www.researching.cn/ArticlePdf/m00002/2023/60/8/0811010.pdf
+https://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=10042954&AN=172904540&h=K8%2F2hsy2EdlM9qTMyx6NMB2LevnxdgOTy3vgJ3%2FxBwJwZOP%2Buv%2FzcjvfdqXmO555xyP3P2VtluotDFnlTc1lGQ%3D%3D&crl=c
+https://www.mlsoft.in/jespublication.com/upload/2023-V14I111.pdf
+https://www.jcad.cn/en/article/id/7e14653a-57fd-4c28-97fe-b881ba06a856
+https://dl.ccf.org.cn/article/articleDetail.html?type=qkwz&_ack=1&id=6335372890540032
+https://archiv.ub.uni-heidelberg.de/volltextserver/32994/
+https://dspace.lib.ntua.gr/xmlui/bitstream/handle/123456789/56672/dimploma_thesis_Strakosi_Lazaros.pdf?sequence=1
+https://www.jcad.cn/cn/article/pdf/preview/10.3724/SP.J.1089.2023.19439.pdf
+https://patents.google.com/patent/US11763153B2/en
+https://patents.google.com/patent/US20230146689A1/en
+https://patents.google.com/patent/US11816480B2/en
+https://patents.google.com/patent/US11836497B2/en
+https://patents.google.com/patent/US11816045B2/en
+https://patents.google.com/patent/US11816572B2/en
+https://patents.google.com/patent/US11703939B2/en
+https://patents.google.com/patent/US11762690B2/en
+https://patents.google.com/patent/US11797830B2/en
+https://patents.google.com/patent/US11755683B2/en
+https://patents.google.com/patent/US11676029B2/en
+https://patents.google.com/patent/US11676028B2/en
+https://patents.google.com/patent/US11636173B2/en
+https://patents.google.com/patent/US11709672B2/en
+https://patents.google.com/patent/US11704125B2/en
+https://patents.google.com/patent/US11740898B2/en
+https://patents.google.com/patent/US11663002B2/en
--- a/author_pdf_crawler/downloadFactory.py
+++ b/author_pdf_crawler/downloadFactory.py
+from urllib.parse import urlparse
+from pdfDownloader.downloader import (
+    PaperDownloader,
+    IEEEExploreDownloader,
+    ScienceDirectDownloader,
+    ProQuestDownloader,
+    ArxivDownloader,
+    SpringerDownloader,
+    acmDownloader,
+    mdpiDownloader,
+    patentDownloader
+)
+from driver import ChromeDriverManager
+
+
+class DownloaderFactory:    
+
+    @staticmethod
+    def get_downloader(url: str) -> PaperDownloader:
+        domain = urlparse(url).netloc
+
+        if "patents.google.com" in domain:
+            return patentDownloader
+        elif "arxiv.org" in domain:
+            return ArxivDownloader
+        elif "www.mdpi.com" in domain:
+            return mdpiDownloader
+        elif "dl.acm.org" in domain:
+            return acmDownloader
+        elif "link.springer.com" in domain:
+            return SpringerDownloader
+        elif "ieeexplore.ieee.org" in domain:
+            return IEEEExploreDownloader
+        elif "search.proquest.com" in domain:
+            return ProQuestDownloader
+        elif "sciencedirect.com" in domain:
+            return ScienceDirectDownloader
+
+        else:
+            raise ValueError(f"Unsupported website: {domain}")
--- a/author_pdf_crawler/driver.py
+++ b/author_pdf_crawler/driver.py
+from selenium import webdriver
+from selenium.webdriver.chrome.service import Service
+from webdriver_manager.chrome import ChromeDriverManager
+from selenium.webdriver.chrome.options import Options
+from config import CHROME_USER_OPTION_PATH, DOWNLOAD_DIR
+
+class ChromeDriver:
+    _instance = None
+    download_dir = DOWNLOAD_DIR
+
+    def __new__(cls, *args, **kwargs):
+        if not cls._instance:
+            cls._instance = super(ChromeDriver, cls).__new__(cls)
+            cls._instance._initialized = False
+        return cls._instance
+
+    def __init__(self, start_fullscreen=True, user_agent=None):
+        if not self._initialized:
+
+            driver_path = ChromeDriverManager().install()
+            service = Service(executable_path=driver_path)
+
+            options = webdriver.ChromeOptions()
+            options.add_argument(
+                f"--user-data-dir={CHROME_USER_OPTION_PATH}",
+            )
+            options.add_experimental_option(
+                "prefs",
+                {
+                    "download.default_directory": self.download_dir,
+                    "download.prompt_for_download": False,
+                    "download.directory_upgrade": True,
+                    "plugins.always_open_pdf_externally": True,
+                },
+            )
+            # options.add_experimental_option("prefs", prefs)
+            # if start_fullscreen:
+            #     options.add_argument("--start-fullscreen")
+            # if user_agent:
+            #     options.add_argument(f"user-agent={user_agent}")
+
+            self.driver = webdriver.Chrome(options=options)
+
+            self._initialized = True
+
+    def get_driver(self):
+        return self.driver
--- a/author_pdf_crawler/main.py
+++ b/author_pdf_crawler/main.py
+from downloadFactory import DownloaderFactory
+from driver import ChromeDriver
+import csv
+import os
+import pandas as pd
+
+# Function to write to CSV
+def write_to_csv(file_name, keys, detail, is_modified=False):
+    with open(file_name, "a", newline="", encoding="utf-8") as csvfile:
+        writer = csv.DictWriter(csvfile, fieldnames=keys)
+        if csvfile.tell() == 0:  # Check if file is empty to write header
+            writer.writeheader()
+        if is_modified:
+            # Prepare modified data
+            if isinstance(detail.get("authors"), list):
+                detail["authors"] = "; ".join(detail["authors"])
+            if isinstance(detail.get("institutions"), list):
+                detail["institutions"] = ";\n".join(detail["institutions"])
+        writer.writerow(detail)
+
+
+def process_txt_files_in_directory(input_dir):
+    """Process each txt file in the given directory."""
+    # List all txt files in the directory
+    file_path_list = []
+    for file_name in os.listdir(input_dir):
+        if file_name.endswith(".txt"):
+            file_path_list.append((input_dir, file_name))
+    return file_path_list
+
+
+def update_xlsx(file_name, row_index, author_status, pdf_status):
+    df = pd.read_excel(file_name)
+    if author_status:
+        df.at[row_index, 'author'] = 1
+    if pdf_status:
+        df.at[row_index, 'pdf'] = 1
+    df.to_excel(file_name, index=False)
+
+if __name__ == "__main__":
+
+    driver = ChromeDriver()
+    input_file = "data/input.xlsx"  # The input Excel file
+    supported_pdf_sites = ["mdpi","acm", "springer", "ieee", "proquest", "patents"]
+    # supported_pdf_sites = ["mdpi"]
+    df = pd.read_excel(input_file)
+
+    for idx, row in df.iterrows():
+        # TODO: add real idx
+        url = row['url']
+        author_status = row['author']
+        pdf_status = row['pdf']
+        detail = {}
+        # Skip if already processed
+        if author_status == 1 and pdf_status == 1:
+            continue
+        
+        # 1. get author, institution, journal info
+        try:
+            if url.endswith(".pdf"):
+                print(f"Skipping PDF file: {url}")
+                continue
+
+            print(f"Crawling {url}..................... ")
+            downloaderClass = DownloaderFactory.get_downloader(url)
+            downloader = downloaderClass(driver)
+            updated_author_status = author_status
+            updated_pdf_status = pdf_status
+            
+            if author_status != 1:
+                detail = downloader.get_author_institution_journal(url)
+            if detail.get("title"):
+                updated_author_status = 1
+                
+        except Exception as e:
+            print(e)
+            continue
+            
+        # 2. get pdf
+        try:
+            # Download PDF if it's not already done
+            if any(key in url for key in supported_pdf_sites) and updated_pdf_status != 1:
+                state = downloader.download(url, str(idx + 1))
+                updated_pdf_status = 1 if state == True else 0
+        except Exception as e:
+            print(e)
+            continue
+        
+            # Update the Excel file for this URL
+        update_xlsx(input_file, idx, updated_author_status, updated_pdf_status)
+
+        if detail:
+            keys = detail.keys()
+
+            # Write each detail as it's processed to CSV
+            write_to_csv("output.csv", keys, detail)
+            write_to_csv(
+                "modified_output.csv",
+                keys,
+                detail,
+                is_modified=True,
+            )
+
+
+        print(f"Crawling {url} done.")
+# Make sure to quit the driver after operations are done
+# university of science and technology of china
--- a/author_pdf_crawler/pdfDownloader/abstract.py
+++ b/author_pdf_crawler/pdfDownloader/abstract.py
+from abc import ABC, abstractmethod
+import time, os, shutil
+from config import SAVED_DIR, DOWNLOAD_DIR
+
+class PaperDownloader(ABC):
+    
+    def __init__(self, driver_manager):
+        """Initialize the paper downloader with a driver manager."""
+        self.driver_manager = driver_manager
+        self.download_dir = DOWNLOAD_DIR
+        self.saved_dir = SAVED_DIR
+        print("Driver Manager initialized.")
+
+    
+    # @abstractmethod
+    def download(self, url: str):
+        pass
+
+    @abstractmethod
+    def get_author_institution_journal(self, url: str):
+        pass
+    
+    @staticmethod
+    def clear_download_dir(download_dir):
+        for f in os.listdir(download_dir):
+            if f.endswith('.crdownload') or f.endswith('.pdf'):
+                os.remove(os.path.join(download_dir, f))
+
+    @staticmethod
+    def wait_for_pdf_download(download_dir, timeout=120, check_interval=1, stable_times=3):
+        start_time = time.time()
+        file_path = None
+        last_size = -1
+        stable_count = 0
+        while True:
+            files = [f for f in os.listdir(download_dir) if f.endswith('.pdf') or f.endswith('.pdf.crdownload')]
+            if files:
+                files = sorted(files, key=lambda x: os.path.getmtime(os.path.join(download_dir, x)), reverse=True)
+                file_path = os.path.join(download_dir, files[0])
+                try:
+                    curr_size = os.path.getsize(file_path)
+                except Exception:
+                    curr_size = -1
+
+                if curr_size == last_size and curr_size > 0:
+                    stable_count += 1
+                else:
+                    stable_count = 0
+                last_size = curr_size
+
+                if stable_count >= stable_times:
+                    if file_path.endswith('.crdownload'):
+                        final_pdf = file_path[:-11]
+                        if os.path.exists(final_pdf):
+                            file_path = final_pdf
+                        else:
+                            file_path = None
+                    break
+            else:
+                file_path = None
+
+            if time.time() - start_time > timeout:
+                print("等待下载超时")
+                file_path = None
+                break
+
+            time.sleep(check_interval)
+
+        if file_path and file_path.endswith('.pdf') and os.path.exists(file_path):
+            return file_path
+        else:
+            return None
+
+
+    def trigger_and_save_pdf(self, driver, pdf_url, download_dir, saved_dir, paper_idx, wait_sec=15):
+        # 清空下载目录
+        self.clear_download_dir(download_dir)
+
+        # 跳转PDF，触发下载
+        try:
+            driver.get(pdf_url)
+        except Exception as e:
+            print(f"打开PDF链接失败: {e}")
+            return False
+
+        # TODO: 改成轮询, self.wait_for_pdf_download
+        time.sleep(wait_sec)
+
+        # 查找pdf文件
+        files = [f for f in os.listdir(download_dir) if f.endswith('.pdf')]
+        if not files:
+            print("下载超时或失败")
+            return False
+        file_path = os.path.join(download_dir, files[0])
+        file_name = files[0]
+
+        # 文件重命名与保存
+        save_path = os.path.join(saved_dir, f"{paper_idx}-{file_name}")
+        try:
+            shutil.move(file_path, save_path)
+            print("[✓] PDF saved:", save_path)
+            return True
+        except Exception as e:
+            print("重命名或保存失败:", e)
+            return False
\ No newline at end of file
--- a/author_pdf_crawler/pdfDownloader/downloader.py
+++ b/author_pdf_crawler/pdfDownloader/downloader.py