CCFA judge by LLM

3e83f4af · jiangdongchen · 3a8c6d38 · 3e83f4af · 3e83f4af · 3e83f4af
Commit 3e83f4af authored May 08, 2025 by jiangdongchen
11 changed files
--- a/.gitignore
+++ b/.gitignore
 .vscode/
-others/
 Papers/
 psrc/__pycache__/
 json/
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -18,6 +18,8 @@
        - 第七列论文作者
    - target_excel_path
        - 输出的格式化表格
+    - ccfa_excel_path
+        - CCFA的参考表格
    - logLevel
        - 取10表示DEBUG级别
        - 取20表示INFO级别
@@ -36,10 +38,10 @@
        - 输出的表格放在target文件夹中, pdf会原地标准化重命名
 - python main.py 执行程序
 - 程序执行过程中，不要打开target excel文件，不然会争用权限发生错误
- 多模型交叉验证
 - 成功后的日志样例在logs文件夹下
- 如果在excel某个序号之前的pdf都正确提取了信息，并且正确修改了excel，下一个序号开始的pdf出错了
+- 断点处理：如果在excel某个序号之前的pdf都正确提取了信息，并且正确修改了excel，下一个序号开始的pdf出错了
    - 建议将正确的pdf都转移到其他文件夹，这样再次运行脚本将处理剩下的pdf
+- TODO:多模型交叉验证
 # 需求与解决方案
 1. 下载论文pdf
@@ -49,9 +51,11 @@
    1. 通过config.json读取配置对象
    2. **遍历**excel的sheet
        1. **遍历**sheet中的论文名称和索引
-            1. 用大模型读取pdf中第一页的论文名称和关键信息，存储到json文件夹下
+            1. 用**大模型**读取pdf中第一页的论文名称和关键信息，存储到json文件夹下
-            2. **遍历**excel表格中的论文名称进行模糊匹配
+            2. **遍历**excel表格中的论文名称进行模糊匹配, 匹配成功后
-                1. 匹配成功后,用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称,将pdf文件中的关键信息写入json文件中进行保存, 包括 标题 会议名称 作者姓名 机构 国家.
+                1. 用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称
+                2. 将pdf文件中的关键信息写入json文件中进行保存, 包括 标题 会议名称 作者姓名 机构 国家.
+                3. 将pdf文件中的会议或者期刊名称和CCFA的会议或者期刊名称的表格交给**大模型**匹配,匹配结果以“是/否”的形式写入目标excel表格中.
            2. 匹配失败后，输出无法匹配的条目,使用warning记录无法匹配的条目，方便后续处理.
 # 代码结构说明

--- a/config.json
+++ b/config.json
@@ -5,8 +5,9 @@
    "pdf_dir": "./Papers",
    "result_dir": "./json",
    "source_excel_path": "./others/论文被引用情况-陈老师-2025.05.01.xlsx",
+    "ccfa_excel_path": "./others/CCFA.xlsx",
    "target_excel_path": "./others/target.xlsx",
    "logLevel": 20,
    "sheetNum": 1,
-    "maxItem": 64
+    "maxItem": 10
 }
\ No newline at end of file
--- a/logs/citation_process.log
+++ b/logs/citation_process.log
@@ -864,3 +864,31 @@
 2025-05-08 15:23:40,370 - INFO - Change: Towards_Efficient_Elastic_Parallelism_for_Deep_Learning_Processor.pdf -> 46-Towards_Efficient_Elastic_Parallelism_for_Deep_Learning_Processor.pdf
 2025-05-08 15:23:40,370 - INFO - Processing Zero-cost abstractions for irregular data shapes in a high-performance parallel language.pdf
 2025-05-08 15:23:44,782 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:05:22,916 - INFO - 程序启动，日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
+2025-05-08 16:05:23,454 - INFO - Processing sheet: j24-DianNao family
+2025-05-08 16:05:23,455 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
+2025-05-08 16:06:19,879 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:09:29,927 - INFO - 程序启动，日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
+2025-05-08 16:09:30,443 - INFO - [94mProcessing sheet: j24-DianNao family[0m
+2025-05-08 16:09:30,444 - INFO - [94mProcessing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf[0m
+2025-05-08 16:10:26,109 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:10:26,113 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
+2025-05-08 16:10:26,162 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 400 Bad Request"
+2025-05-08 16:13:03,786 - INFO - 程序启动，日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
+2025-05-08 16:13:04,301 - INFO - [94mProcessing sheet: j24-DianNao family[0m
+2025-05-08 16:13:04,302 - INFO - [94mProcessing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf[0m
+2025-05-08 16:14:02,701 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:14:02,704 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
+2025-05-08 16:14:02,756 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 400 Bad Request"
+2025-05-08 16:16:03,169 - INFO - 程序启动，日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
+2025-05-08 16:16:03,684 - INFO - [94mProcessing sheet: j24-DianNao family[0m
+2025-05-08 16:16:03,685 - INFO - [94mProcessing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf[0m
+2025-05-08 16:16:58,281 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:16:58,283 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
+2025-05-08 16:17:05,708 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:18:28,748 - INFO - 程序启动，日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
+2025-05-08 16:18:29,264 - INFO - [94mProcessing sheet: j24-DianNao family[0m
+2025-05-08 16:18:29,265 - INFO - [94mProcessing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf[0m
+2025-05-08 16:19:23,091 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
+2025-05-08 16:19:23,093 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
+2025-05-08 16:19:30,059 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
--- a/others/CCFA.xlsx
+++ b/others/CCFA.xlsx
--- a/others/target.xlsx
+++ b/others/target.xlsx
--- a/others/李昊晨.xlsx
+++ b/others/李昊晨.xlsx
--- a/others/论文被引用情况-陈老师-2025.05.01 copy.xlsx
+++ b/others/论文被引用情况-陈老师-2025.05.01 copy.xlsx
--- a/others/论文被引用情况-陈老师-2025.05.01.xlsx
+++ b/others/论文被引用情况-陈老师-2025.05.01.xlsx
--- a/psrc/checkCCFA.py
+++ b/psrc/checkCCFA.py
@@ -52,8 +52,8 @@ if __name__ == "__main__":
    client = OpenAI(api_key=config["api_key"], base_url=config["base_url"])
    configModel = config["model"]
-    excel_path2 = Path(config["excel_path2"])
+    ccfa_excel_path = Path(config["ccfa_excel_path"])
-    wb = openpyxl.load_workbook(excel_path2)
+    wb = openpyxl.load_workbook(ccfa_excel_path)
    sheetCCF = wb["CCF-A列表"]
    # 序号	简称	全称
    # 1	PPoPP	ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming

--- a/psrc/citationProcess.py
+++ b/psrc/citationProcess.py
 from pathlib import Path
 import logging
+from xml.etree.ElementPath import get_parent_map
 from openai import OpenAI
 import pypdf
 import openpyxl
 from fuzzywuzzy import fuzz
 import json
+RED = '\033[91m'
+GREEN = '\033[92m'
+BLUE = '\033[94m'
+RESET = '\033[0m'
+def chechCCFA( conferenceJournal, CCFA, configModel, client):
+    system_prompt = f"""
+    You are an expert academic conference/journal classifier. Your task is to determine if the given conference/journal name matches any entry in the provided CCF-A list.
+    CCF-A List (comma-separated): {CCFA}
+    Analysis Guidelines:
+    1. Perform fuzzy matching considering:
+       - Abbreviations vs full names (e.g. 'PPoPP' vs 'ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming')
+       - Common variations (e.g. 'IEEE Transactions' vs 'IEEE Trans.')
+       - Minor spelling differences
+    2. Return JSON with:
+       - "IsCCFA": ture/false
+       - "MatchedName": the matched name from CCF-A list (empty string if no match)
+       - "Confidence": your confidence score (0-100)
+    Example Output:
+    {{
+        "IsCCFA": "ture",
+        "MatchedName": "IEEE International Symposium on High Performance Computer Architecture",
+        "Confidence": 0.95,
+        "Reason": "The input matches HPCA's full name"
+    }}
+    """
+    response = client.chat.completions.create(  
+        model=configModel,  
+        messages=[  
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": conferenceJournal},
+        ],  
+        temperature=0.2,  
+        max_tokens=4096,
+        # stream=True,
+        response_format={"type": "json_object"}  
+    ) 
+    return response.choices[0].message.content
 def get_key_info( content, configModel, client):
    system_prompt = """
    Act as an expert metadata extraction assistant.
@@ -124,16 +167,26 @@ def citationProcess(config: dict):
    excel_path = Path(config["source_excel_path"])
    target_path = Path(config["target_excel_path"])
+    ccfa_excel_path = Path(config["ccfa_excel_path"])
    # 读取Excel文件
    wb = openpyxl.load_workbook(excel_path)
+    # 读取CCFA列表
+    # 序号	简称	全称
+    # 1	PPoPP	ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming
+    # 2	FAST	USENIX Conference on File and Storage Technologies
+    # 3	DAC	Design Automation Conference
+    # 4	HPCA	IEEE International Symposium on High Performance Computer Architecture
+    # 5	MICRO	IEEE/ACM International Symposium on Microarchitecture
+    ccfa_wb = openpyxl.load_workbook(ccfa_excel_path)
+    sheetCCF = ccfa_wb["CCF-A列表"]
    # 遍历工作簿中的所有工作表
    for idx, sheet_name in enumerate(wb.sheetnames):
        if idx == config["sheetNum"]:
            break
        sheet = wb[sheet_name]
-        logging.info(f"Processing sheet: {sheet_name}")
+        logging.info(f"{BLUE}Processing sheet: {sheet_name}{RESET}")
        index_list, paperName_list = read_rough_nameIndex_from_excel(sheet, config["maxItem"])
@@ -147,7 +200,7 @@ def citationProcess(config: dict):
        # 遍历当前工作表对应的所有PDF文件
        for file in pdf_files:
-            logging.info(f"Processing {file.name}")
+            logging.info(f"{BLUE}Processing {file.name}{RESET}")
            first_page_text = extract_first_page_text(file)
@@ -164,6 +217,7 @@ def citationProcess(config: dict):
                # 解析JSON结果, 提取论文标题
                result_dict = json.loads(result)
                pdf_title = result_dict["Title"]
+                pdf_issue = result_dict["ISSUE"]
                # 遍历Excel表项进行模糊匹配
                for idx, excel_name in zip(index_list, paperName_list):
@@ -197,6 +251,24 @@ def citationProcess(config: dict):
                        authors = ";".join(authors_list) if isinstance(authors_list, list) else ""
                        sheet.cell(row=idx+4, column=7, value=authors)  # 第7列是作者名称
+                        # CCFA判断
+                        logging.info(f"Judge CCFA.")
+                        CCFA_list = []
+                        for row in sheetCCF.iter_rows(min_row=2, values_only=True): # 从第二行开始遍历
+                            if row[0] and row[1]: # 确保索引和论文名称都存在
+                                CCFA_list.append(row[1])
+                                CCFA_list.append(row[2])
+                        # 把list转为长的字符串, ','分割
+                        CCFA = ','.join(CCFA_list)
+                        conferenceJournal = pdf_issue[0]
+                        CCFA_flag = "否"
+                        print(conferenceJournal)
+                        if conferenceJournal == "":
+                            CCFA_flag = "否"
+                        else:
+                            CCFA_flag = "是" if chechCCFA(conferenceJournal, CCFA, configModel, client) else "否"
+                        sheet.cell(row=idx+4, column=5, value=CCFA_flag)  # 第7列是作者名称
                        # 保存修改后的Excel文件
                        wb.save(target_path)