extract journal and session

d6a62c92 · jiangdongchen · 7e4247f9 · d6a62c92 · 7e4247f9 · 7e4247f9
Commit d6a62c92 authored May 08, 2025 by jiangdongchen
13 changed files
--- a/README.md
+++ b/README.md
 # 环境配置
 - 确保执行py的cwd在papertools仓库文件夹下
 - 路径和参数配置都在config.json文件中
+    - api_key
+        - 目前的密钥是东辰同学自己从知乎上打广告赚来的，只有100块的额度，请尽量使用自己的密钥
+        - 如果使用不同的API的密钥注意更改openAI的调用方式，这里推荐硅基流动，因为我就是用硅基流动跑通的
+    - base_url
+        - api接口url
+    - pdf_dir
+        - 放置论文pdf的文件夹
+    - result_dir
+        - 输出关键信息json文件的文件夹
+    - source_excel_path
+        - 放置需要check的excel表格
+        - 第4行开始实际表项
+        - 第一列索引
+        - 第三列论文标题
+        - 第七列论文作者
+    - target_excel_path
+        - 输出的格式化表格
    - logLevel
        - 取10表示DEBUG级别
        - 取20表示INFO级别
-    - tableNum 需要处理的工作表数量
+    - sheetNum 需要处理的工作表数量
    - maxItem 每个工作表的最大条目数
- python3.12.10
+- python3.12
 - 无法import的库使用pip install逐个安装
    - `openai`, `pypdf`
    - `python-Levenshtein`
- 目前的密钥是东辰同学自己从知乎上打广告赚来的，只有100块的额度，请尽量使用自己的密钥
- 如果使用不同的API的密钥注意更改openAI的调用方式，这里推荐硅基流动，因为我就是用硅基流动跑通的

 # 使用方法
 - 查看config.json正确配置参数，让程序能够找到需要的文件位置和参数
+    - 默认配置
+        - 文章的pdf分sheet放置在Papers/sheetname文件夹下
+        - 待check的excel表格放在others文件夹中
+        - 输出的表格放在target文件夹中, pdf会原地标准化重命名
 - python main.py 执行程序
 - 程序执行过程中，不要打开target excel文件，不然会争用权限发生错误
 - 多模型交叉验证
@@ -28,16 +47,20 @@
    2. 输出无法下载的条目
 2. 自动化提取信息和格式化
    1. 通过config.json读取配置对象
-    2. 遍历excel的工作表
-        1. 读取excel表格中的论文名称和索引
-        2. 循环：
-            1. 读取pdf中的论文名称和关键信息，存储到json文件夹下
-            2. 和excel表格中的论文名称进行模糊匹配
-            3. 匹配成功后
-                1. 用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称
-                2. 将pdf文件中的关键信息写入excel表格中, 包括作者姓名、机构、国家
-            4. 匹配失败后，输出无法匹配的条目
-                1. 使用warning记录无法匹配的条目，方便后续处理
+    2. **遍历**excel的sheet
+        1. **遍历**sheet中的论文名称和索引
+            1. 用大模型读取pdf中第一页的论文名称和关键信息，存储到json文件夹下
+            2. **遍历**excel表格中的论文名称进行模糊匹配
+                1. 匹配成功后
+                    1. 用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称
+                    2. 将pdf文件中的关键信息写入excel表格中, 包括
+                        - 标题
+                        - 会议名称
+                        - 作者姓名
+                        - 机构
+                        - 国家
+                2. 匹配失败后，输出无法匹配的条目
+                    1. 使用warning记录无法匹配的条目，方便后续处理

 # 代码结构说明
 1. psrc文件夹下是库函数

--- a/google_scholar_citedby/README.md
+++ b/google_scholar_citedby/README.md
-# crawler
-
-## update
-
-更新陈老师的文章列表信息`yunjichen.json`，后续`main.py`依赖此文件检索引用文献。
-
-## main
-
-创建`papers.txt`并在其中逐行加入你需要检索的论文标题，不建议输入完整标题，输入部分标题即可，如下所示：
-
-```
-Reproducing Concurrency Bugs Using Local Clocks
-binary translator with post-optimization
-timing error mitigation for hardware neural networks
-A Polyvalent Machine Learning Accelerator
-```
-
-已爬取过的链接在`urls.txt`中记录，避免重复爬取，`main.py`文件可重复运行，爬取失败建议检查网络连接。
-
-### 2. **运行脚本**
-
-确保你的环境已安装必要依赖（如`pandas`, `openpyxl`, `tqdm`等）。
-
-命令行运行：
-
-```bash
-python main.py
-```
-
-或 **指定年份**（仅爬取该年份的引用）：
-
-```bash
-python main.py --year 2023
-```
-
-### 3. **查看结果**
-
-抓取后将在`results/`下生成`citations.xlsx`，每个论文一个表单，包含被引论文的Title、URL、作者信息等。
-
------
-
-## **参数说明**
-
- `--year` 指定年份，仅抓取该年度的引用。不加该参数则会爬取**所有年份**的引用信息。
-
-  **用法示例：**
-
-  - 爬取全部（默认）： `python main.py`
-  - 仅抓2022年的引用： `python main.py --year 2022`
\ No newline at end of file
--- a/google_scholar_citedby/main.py
+++ b/google_scholar_citedby/main.py
-import os
-import re
-import string
-import json
-import pandas as pd
-from tqdm import tqdm
-from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
-import time
-import random
-import argparse
-from bs4 import BeautifulSoup
-import requests
-import re
-import string
-
-JSON_FILE = "yunjichen.json"
-PUB_FILE = "papers.txt"
-URL_FILE = "urls.txt"
-DATA_FILE = "citations.xlsx"
-
-
-def get_cited_url_list(citedby_url, year=None):
-    prefix, suffix = citedby_url.split("oi=bibs&hl=en")
-    prefix += "start="
-
-    if year:
-        # Use the year parameter for filtering
-        suffix = f"&hl=en&as_sdt=2005&sciodt=0,5{suffix}&scipsc=&as_ylo={year}&as_yhi={year}&scisbd="
-    else:
-        # No year filter, fetch all
-        suffix = f"&hl=en&as_sdt=2005&sciodt=0,5{suffix}&scipsc="
-
-    for i in range(0, 10000, 10):
-        yield prefix + str(i) + suffix
-
-
-SYMBOL_MORE_AUTHORS = "…"
-
-# 添加请求头
-headers = {
-    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
-}
-
-
-def parse_scholar_results(html_content):
-    """
-    Parses HTML content to extract Google Scholar results.
-
-    Args:
-        html_content (str): The HTML content to parse.
-
-    Returns:
-        list: A list of dictionaries containing '标题', '作者', and '期刊信息'.
-    """
-    soup = BeautifulSoup(html_content, "html.parser")
-    results = soup.find_all("div", class_="gs_r")
-    extracted_data = []
-
-    for result in results:
-        # Extract title
-        title_tag = result.find(["h2", "h3"], class_="gs_rt")
-        if title_tag:
-            a_tag = title_tag.find("a")
-            if a_tag:
-                title = a_tag.get_text()
-            else:
-                title = title_tag.get_text()
-            title = title.replace("[CITATION][C] ", "")
-        else:
-            title = "未找到标题"
-
-        # Extract authors and journal information
-        authors_journal_tag = result.find("div", class_="gs_a")
-        if authors_journal_tag:
-            authors_journal_text = authors_journal_tag.get_text()
-            parts = authors_journal_text.split("-")
-            if len(parts) >= 2:
-                authors = parts[0].strip()
-                journal_info = parts[1].split(",")[0].strip()
-            else:
-                authors = "格式不符合预期"
-                journal_info = "格式不符合预期"
-        else:
-            authors = "未找到作者和期刊信息"
-            journal_info = "未找到作者和期刊信息"
-
-        # Add extracted data to list
-        extracted_data.append(
-            {"标题": title, "作者": authors, "期刊信息": journal_info}
-        )
-
-    return extracted_data
-
-
-def parse_html(url):
-    """
-    return:
-    title: str
-    new_url: url to the paper
-    authors: str
-    more_authors: bool, if there are more authors than shown
-    """
-    # print(f"Fetching {url}")
-    results = []
-    session = requests.Session()
-    response = session.get(url, headers=headers)
-    if response.status_code != 200:
-        # print("Failed to get the page")
-        return False, results
-    soup = BeautifulSoup(response.content, "html.parser")
-    papers = soup.find_all("div", class_="gs_r gs_or gs_scl")
-    if len(papers) == 0:
-        return False, results
-    for paper in papers:
-        div_title = paper.find("h3", class_="gs_rt")
-        try:
-            title = div_title.find("a").get_text()
-            new_url = div_title.find("a")["href"]
-        except:
-            # span blocks
-            title = div_title.find_all("span")[-1].get_text()
-            continue
-        authors = paper.find("div", class_="gs_a").get_text().split("-")[0].split(",")
-        authors[-1] = " ".join(authors[-1].split())
-        authors = ";".join(authors)
-        if SYMBOL_MORE_AUTHORS in authors[-1]:
-            results.append((title, new_url, authors, 1))
-        else:
-            results.append((title, new_url, authors, 0))
-    return True, results
-
-
-def add_scisbd_sort(url):
-    """确保url中有scisbd=1 按date排序"""
-    parsed_url = urlparse(url)
-    query = parse_qs(parsed_url.query)
-    query["scisbd"] = ["1"]
-    new_query = urlencode(query, doseq=True)
-
-    new_url = urlunparse(
-        (parsed_url.scheme, parsed_url.netloc, parsed_url.path, "", new_query, "")
-    )
-    return new_url
-
-
-def main(publications, year):
-    if not os.path.exists(URL_FILE):
-        with open(URL_FILE, "w") as file:
-            file.write("")
-    publications = [pub.lower() for pub in publications]
-    with open(JSON_FILE, "r") as file:
-        author = json.load(file)
-    author_publications = [
-        pub for pub in author["publications"] if pub["container_type"] == "Publication"
-    ]
-    titles = [pub["bib"]["title"] for pub in author_publications]
-    index_publications = []
-    for i, publication in enumerate(publications):
-        found = False
-        for idx, title in enumerate(titles):
-            if publication in title.lower():
-                found = True
-                index_publications.append(idx)
-                publications[i] = title
-                break
-        if not found:
-            index_publications.append(None)
-
-    for idx, publication in enumerate(publications):
-        with open(URL_FILE, "a+") as file:
-            file.seek(0)
-            know_urls = set([line.strip() for line in file.readlines()])
-        name = "_".join(publication.split())
-        name = re.sub(f"[{string.punctuation}]", "", name[:20])
-        print(f"Processing {publication}...to {name}")
-        columns = [
-            "paper idx",
-            "paper Title",
-            "Cite idx",
-            "Cite Title",
-            "URL",
-            "Authors",
-            "More Authors",
-        ]
-
-        if os.path.exists(DATA_FILE) and name in pd.ExcelFile(DATA_FILE).sheet_names:
-            old_df = pd.read_excel(DATA_FILE, sheet_name=name)
-            data = (
-                old_df
-                if set(old_df.columns) == set(columns)
-                else pd.DataFrame(columns=columns)
-            )
-            current_start_idx = old_df["Cite idx"].max() if not old_df.empty else 0
-        else:
-            data = pd.DataFrame(columns=columns)
-            current_start_idx = 0
-
-        index_publication = index_publications[idx]
-        if index_publication is None:
-            continue
-        citations = author_publications[index_publication]["num_citations"]
-        citedby_url = author_publications[index_publication]["citedby_url"]
-
-        citation_count = current_start_idx
-        for i, url in tqdm(enumerate(get_cited_url_list(citedby_url, year))):
-            # url = add_scisbd_sort(url)
-            if url in know_urls:
-                continue
-            if i * 10 > citations:
-                break
-            # 随机延时
-            time.sleep(random.uniform(0.5, 2.0))
-            mark, results = parse_html(url)
-            if not mark:
-                break
-            for rec in results:
-                citation_count += 1
-                title = rec[0] if len(rec) > 0 else ""
-                cite_url = rec[1] if len(rec) > 1 else ""
-                authors = rec[2] if len(rec) > 2 else ""
-                more_authors = rec[3] if len(rec) > 3 else ""
-                row = {
-                    "paper idx": idx + 1,
-                    "paper Title": publication,
-                    "Cite idx": citation_count,
-                    "Cite Title": title,
-                    "URL": cite_url,
-                    "Authors": authors,
-                    "More Authors": more_authors,
-                }
-                data = data._append(row, ignore_index=True)
-            know_urls.add(url)
-
-        if os.path.exists(DATA_FILE):
-            with pd.ExcelWriter(
-                DATA_FILE, mode="a", if_sheet_exists="replace", engine="openpyxl"
-            ) as writer:
-                data.to_excel(writer, sheet_name=name, index=False)
-        else:
-            data.to_excel(DATA_FILE, sheet_name=name, index=False)
-
-        with open(URL_FILE, "w") as file:
-            file.writelines([url + "\n" for url in know_urls])
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Crawl citations with optional year filtering"
-    )
-    parser.add_argument(
-        "--year", type=int, help="The year to filter citations by (optional)"
-    )
-    args = parser.parse_args()
-
-    # Load the publication file
-    with open(PUB_FILE, "r", encoding="utf-8") as file:
-        publications = file.readlines()
-    publications = [pub.strip().lower() for pub in publications]
-    main(publications, args.year)
--- a/google_scholar_citedby/papers.txt
+++ b/google_scholar_citedby/papers.txt
-anNao: A Machine-Learning Supercompute
-
--- a/google_scholar_citedby/update.py
+++ b/google_scholar_citedby/update.py
-from scholarly import scholarly
-from scholarly import ProxyGenerator
-import json
-
-# Activates proxy because Google Scholar otherwise might block the IP address
-pg = ProxyGenerator()
-scholarly.use_proxy(pg, pg)
-
-def main(name):
-    author = next(scholarly.search_author(name))
-    author = scholarly.fill((author),sections = ['publications'])
-    json.dump(author, open(f"{name}.json", "w"), indent=4)
-
-if __name__=="__main__":
-    main("yunji chen")
\ No newline at end of file
--- a/google_scholar_citedby/yunjichen.json
+++ b/google_scholar_citedby/yunjichen.json
--- a/logs/citation_process.log
+++ b/logs/citation_process.log
--- a/main.py
+++ b/main.py
@@ -15,8 +15,6 @@ if __name__ == "__main__":

    # Path对象后跟/用于连接地址

-    # print(excel_path)
-
     # 创建日志目录
    log_dir = cwd_dir / "logs"
    log_dir.mkdir(exist_ok=True)

--- a/psrc/citationProcess.py
+++ b/psrc/citationProcess.py
@@ -6,7 +6,7 @@ import openpyxl
 from fuzzywuzzy import fuzz
 import json

-def get_authors( content, configModel, client):
+def get_key_info( content, configModel, client):
    system_prompt = """
    Act as an expert metadata extraction assistant.
    Analyze the following text, which is extracted from the first page of a document (likely a scientific paper or report).
@@ -21,9 +21,37 @@ def get_authors( content, configModel, client):
        -   Extract all associated institutions of authors.
    -   **Countrys:**
        -   Extract all associated countrys of authors.
+    -   **ISSUE:**
+        -   Extract where the paper is published like journal or session.
    -   Title, authors, institutions and countrys should be four separate keys, not nested together.
    -   Use highcase for first letter of key.
    -   **Handling Missing Data:** If no data of a field can be identified in the text, the field in the JSON should be an empty list `[]`.
+    
+    Example Output:
+    {
+        "Title": "Laius: Towards Latency Awareness and Improved Utilization of Spatial Multitasking Accelerators in Datacenters",
+        "Authors": [
+            "Quan Chen",
+            "Daniel Edward Mawhirter",
+            "Bo Wu",
+            "Chao Li",
+        ],
+        "Institutions": [
+            "Shanghai Jiao Tong University",
+            "Colorado School of Mines",
+            "Colorado School of Mines",
+            "Shanghai Jiao Tong University",
+        ],
+        "Countrys": [
+            "China",
+            "United States",
+            "United States",
+            "China",
+        ],
+        "ISSURE": [
+            "IEEE Transactions on Computers" 
+        ]
+    }
    """

    response = client.chat.completions.create(  
@@ -130,7 +158,7 @@ def citationProcess(config: dict):
            configModel = config["model"]

            # 提取关键信息
-            result = get_authors(first_page_text, configModel, client)
+            result = get_key_info(first_page_text, configModel, client)

            if result is not None:
                # 解析JSON结果, 提取论文标题
@@ -149,7 +177,7 @@ def citationProcess(config: dict):

                    if similarity >= 85:
                        # 重命名PDF文件
-                        new_pdf_name = f"{idx}-{pdf_title.replace(':', '_')}.pdf"  # 将冒号替换为连字符
+                        new_pdf_name = f"{idx}-{pdf_title.replace(':', '_').replace(' ', '_').replace('?', '_')}.pdf"  # 将冒号替换为连字符
                        new_pdf_path = file.parent / new_pdf_name
                        try:
                            file.rename(new_pdf_path)

--- a/share_is_ccfa/CCF_A_list.csv
+++ b/share_is_ccfa/CCF_A_list.csv
-abbr,fullname
-PPoPP,ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming
-FAST,USENIX Conference on File and Storage Technologies
-DAC,Design Automation Conference
-HPCA,IEEE International Symposium on High Performance Computer Architecture
-MICRO,IEEE/ACM International Symposium on Microarchitecture
-SC,"International Conference for High Performance Computing, Networking, Storage, and Analysis"
-ASPLOS,International Conference on Architectural Support for Programming Languages and Operating Systems
-ISCA,International Symposium on Computer Architecture
-USENIX ATC,USENIX Annual Technical Conference
-EuroSys,European Conference on Computer Systems
-SIGCOMM,"ACM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication"
-MobiCom,ACM International Conference on Mobile Computing and Networking
-INFOCOM,IEEE International Conference on Computer Communications
-NSDI,Symposium on Network System Design and Implementation
-CCS,ACM Conference on Computer and Communications Security
-EUROCRYPT,International Conference on the Theory and Applications of Cryptographic Techniques
-S&P,IEEE Symposium on Security and Privacy
-CRYPTO,International Cryptology Conference
-USENIX Security,USENIX Security Symposium
-NDSS,Network and Distributed System Security Symposium
-PLDI,ACM SIGPLAN Conference on Programming Language Design and Implementation
-POPL,ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
-FSE,ACM International Conference on the Foundations of Software Engineering
-SOSP,ACM Symposium on Operating Systems Principles
-OOPSLA,"Conference on Object-Oriented Programming Systems, Languages,and Applications"
-ASE,International Conference on Automated Software Engineering
-ICSE,International Conference on Software Engineering
-ISSTA,International Symposium on Software Testing and Analysis
-OSDI,USENIX Symposium on Operating Systems Design and Implementations
-FM,International Symposium on Formal Methods
-SIGMOD,ACM SIGMOD Conference
-SIGKDD,ACM SIGKDD Conference on Knowledge Discovery and Data Mining
-ICDE,IEEE International Conference on Data Engineering
-SIGIR,International ACM SIGIR Conference on Research and Development in Information Retrieval
-VLDB,International Conference on Very Large Data Bases
-STOC,ACM Symposium on Theory of Computing
-SODA,ACM-SIAM Symposium on Discrete Algorithms
-CAV,International Conference on Computer Aided Verification
-FOCS,IEEE Annual Symposium on Foundations of Computer Science
-LICS,ACM/IEEE Symposium on Logic in Computer Science
-ACM MM,ACM International Conference on Multimedia
-SIGGRAPH,ACM Special Interest Group on Computer Graphics
-VR,IEEE Virtual Reality
-IEEE VIS,IEEE Visualization Conference
-AAAI,AAAI Conference on Artificial Intelligence
-NeurIPS,Conference on Neural Information Processing Systems
-ACL,Annual Meeting of the Association for Computational Linguistics
-CVPR,IEEE/CVF Computer Vision and Pattern Recognition Conference
-ICCV,International Conference on Computer Vision
-ICML,International Conference on Machine Learning
-IJCAI,International Joint Conference on Artificial Intelligence
-CSCW,ACM Conference on Computer Supported Cooperative Work and Social Computing
-CHI,ACM Conference on Human Factors in Computing Systems
-UbiComp/IMWUT,"ACM international joint conference on Pervasive and Ubiquitous Computing/ Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"
-UIST,ACM Symposium on User Interface Software and Technology
-WWW,International World Wide Web Conference
-RTSS,IEEE Real-Time Systems Symposium
-WINE,Conference on Web and Internet Economics
-TOCS,ACM Transactions on Computer Systems
-TOS,ACM Transactions on Storage
-TCAD,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
-TC,IEEE Transactions on Computers
-TPDS,IEEE Transactions on Parallel and Distributed Systems
-TACO,ACM Transactions on Architecture and Code Optimization
-JSAC,IEEE Journal on Selected Areas in Communications
-TMC,IEEE Transactions on Mobile Computing
-TON,IEEE/ACM Transactions on Networking
-TDSC,IEEE Transactions on Dependable and Secure Computing
-TIFS,IEEE Transactions on Information Forensics and Security
-,Journal of Cryptology
-TOPLAS,ACM Transactions on Programming Languages and Systems
-TOSEM,ACM Transactions on Software Engineering and Methodology
-TSE,IEEE Transactions on Software Engineering
-TSC,IEEE Transactions on Services Computing
-TODS,ACM Transactions on Database Systems
-TOIS,ACM Transactions on Information Systems
-TKDE,IEEE Transactions on Knowledge and Data Engineering
-VLDBJ,The VLDB Journal
-TIT,IEEE Transactions on Information Theory
-IANDC,Information and Computation
-SICOMP,SIAM Journal on Computing
-TOG,ACM Transactions on Graphics
-TIP,IEEE Transactions on Image Processing
-TVCG,IEEE Transactions on Visualization and Computer Graphics
-AI,Artificial Intelligence
-TPAMI,IEEE Transactions on Pattern Analysis and Machine Intelligence
-IJCV,International Journal of Computer Vision
-JMLR,Journal of Machine Learning Research
-TOCHI,ACM Transactions on Computer-Human Interaction
-IJHCS,International Journal of Human-Computer Studies
-JACM,Journal of the ACM
-Proc. IEEE,Proceedings of the IEEE
-SCIS,Science China Information Sciences
--- a/share_is_ccfa/data/title_venue/c23.csv
+++ b/share_is_ccfa/data/title_venue/c23.csv
-title,venue
-Learning to Generalize With Object-Centric Agents in the Open World Survival Game Crafter,"IEEE Transactions on Games ( Volume: 16, Issue: 2, June 2024)"
-Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning,"Proceedings of the AAAI Conference on Artificial Intelligence, 3"
-"Advancing DRL Agents in Commercial Fighting Games: Training, Integration,
-and Agent-Human Alignment",Proceedings of the 41th International Conference on Machine Learning (ICML 2024)
-Discovering and Using Structure in Autonomous Machine Learning,ETH Zurich thesis
-Jose Luis Flores Campana,博士论文
--- a/share_is_ccfa/is_ccfa.py
+++ b/share_is_ccfa/is_ccfa.py
--- a/share_is_ccfa/readme.md
+++ b/share_is_ccfa/readme.md
-# ！注意：这个脚本可能判断不准，请一定人工检查！
-## 用法：python is_ccfa.py
-
-## 功能：判断论文是否属于CCF-A类会议
- 输入文件：`CCF_A_list.csv`，包含两列：`abbr,fullname`
- 输入文件夹：`data/title_venue`，包含若干csv文件，每个文件的标题栏是`title,venue`
- 输出文件夹：`data/is_ccfa`,对用`title_venue`文件夹下的每个csv文件进行处理，输出文件名相同，标题栏是`title,venue,is_ccf_a`
\ No newline at end of file