Commit 3e83f4af by jiangdongchen

CCFA judge by LLM

parent 3a8c6d38
.vscode/ .vscode/
others/
Papers/ Papers/
psrc/__pycache__/ psrc/__pycache__/
json/ json/
\ No newline at end of file
...@@ -18,6 +18,8 @@ ...@@ -18,6 +18,8 @@
- 第七列论文作者 - 第七列论文作者
- target_excel_path - target_excel_path
- 输出的格式化表格 - 输出的格式化表格
- ccfa_excel_path
- CCFA的参考表格
- logLevel - logLevel
- 取10表示DEBUG级别 - 取10表示DEBUG级别
- 取20表示INFO级别 - 取20表示INFO级别
...@@ -36,10 +38,10 @@ ...@@ -36,10 +38,10 @@
- 输出的表格放在target文件夹中, pdf会原地标准化重命名 - 输出的表格放在target文件夹中, pdf会原地标准化重命名
- python main.py 执行程序 - python main.py 执行程序
- 程序执行过程中,不要打开target excel文件,不然会争用权限发生错误 - 程序执行过程中,不要打开target excel文件,不然会争用权限发生错误
- 多模型交叉验证
- 成功后的日志样例在logs文件夹下 - 成功后的日志样例在logs文件夹下
- 如果在excel某个序号之前的pdf都正确提取了信息,并且正确修改了excel,下一个序号开始的pdf出错了 - 断点处理:如果在excel某个序号之前的pdf都正确提取了信息,并且正确修改了excel,下一个序号开始的pdf出错了
- 建议将正确的pdf都转移到其他文件夹,这样再次运行脚本将处理剩下的pdf - 建议将正确的pdf都转移到其他文件夹,这样再次运行脚本将处理剩下的pdf
- TODO:多模型交叉验证
# 需求与解决方案 # 需求与解决方案
1. 下载论文pdf 1. 下载论文pdf
...@@ -49,9 +51,11 @@ ...@@ -49,9 +51,11 @@
1. 通过config.json读取配置对象 1. 通过config.json读取配置对象
2. **遍历**excel的sheet 2. **遍历**excel的sheet
1. **遍历**sheet中的论文名称和索引 1. **遍历**sheet中的论文名称和索引
1. 用大模型读取pdf中第一页的论文名称和关键信息,存储到json文件夹下 1.**大模型**读取pdf中第一页的论文名称和关键信息,存储到json文件夹下
2. **遍历**excel表格中的论文名称进行模糊匹配 2. **遍历**excel表格中的论文名称进行模糊匹配, 匹配成功后
1. 匹配成功后,用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称,将pdf文件中的关键信息写入json文件中进行保存, 包括 标题 会议名称 作者姓名 机构 国家. 1. 用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称
2. 将pdf文件中的关键信息写入json文件中进行保存, 包括 标题 会议名称 作者姓名 机构 国家.
3. 将pdf文件中的会议或者期刊名称和CCFA的会议或者期刊名称的表格交给**大模型**匹配,匹配结果以“是/否”的形式写入目标excel表格中.
2. 匹配失败后,输出无法匹配的条目,使用warning记录无法匹配的条目,方便后续处理. 2. 匹配失败后,输出无法匹配的条目,使用warning记录无法匹配的条目,方便后续处理.
# 代码结构说明 # 代码结构说明
......
...@@ -5,8 +5,9 @@ ...@@ -5,8 +5,9 @@
"pdf_dir": "./Papers", "pdf_dir": "./Papers",
"result_dir": "./json", "result_dir": "./json",
"source_excel_path": "./others/论文被引用情况-陈老师-2025.05.01.xlsx", "source_excel_path": "./others/论文被引用情况-陈老师-2025.05.01.xlsx",
"ccfa_excel_path": "./others/CCFA.xlsx",
"target_excel_path": "./others/target.xlsx", "target_excel_path": "./others/target.xlsx",
"logLevel": 20, "logLevel": 20,
"sheetNum": 1, "sheetNum": 1,
"maxItem": 64 "maxItem": 10
} }
\ No newline at end of file
...@@ -864,3 +864,31 @@ ...@@ -864,3 +864,31 @@
2025-05-08 15:23:40,370 - INFO - Change: Towards_Efficient_Elastic_Parallelism_for_Deep_Learning_Processor.pdf -> 46-Towards_Efficient_Elastic_Parallelism_for_Deep_Learning_Processor.pdf 2025-05-08 15:23:40,370 - INFO - Change: Towards_Efficient_Elastic_Parallelism_for_Deep_Learning_Processor.pdf -> 46-Towards_Efficient_Elastic_Parallelism_for_Deep_Learning_Processor.pdf
2025-05-08 15:23:40,370 - INFO - Processing Zero-cost abstractions for irregular data shapes in a high-performance parallel language.pdf 2025-05-08 15:23:40,370 - INFO - Processing Zero-cost abstractions for irregular data shapes in a high-performance parallel language.pdf
2025-05-08 15:23:44,782 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK" 2025-05-08 15:23:44,782 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:05:22,916 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-08 16:05:23,454 - INFO - Processing sheet: j24-DianNao family
2025-05-08 16:05:23,455 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:06:19,879 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:09:29,927 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-08 16:09:30,443 - INFO - Processing sheet: j24-DianNao family
2025-05-08 16:09:30,444 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:10:26,109 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:10:26,113 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:10:26,162 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 400 Bad Request"
2025-05-08 16:13:03,786 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-08 16:13:04,301 - INFO - Processing sheet: j24-DianNao family
2025-05-08 16:13:04,302 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:14:02,701 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:14:02,704 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:14:02,756 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 400 Bad Request"
2025-05-08 16:16:03,169 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-08 16:16:03,684 - INFO - Processing sheet: j24-DianNao family
2025-05-08 16:16:03,685 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:16:58,281 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:16:58,283 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:17:05,708 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:18:28,748 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-08 16:18:29,264 - INFO - Processing sheet: j24-DianNao family
2025-05-08 16:18:29,265 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:19:23,091 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-08 16:19:23,093 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-08 16:19:30,059 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
...@@ -52,8 +52,8 @@ if __name__ == "__main__": ...@@ -52,8 +52,8 @@ if __name__ == "__main__":
client = OpenAI(api_key=config["api_key"], base_url=config["base_url"]) client = OpenAI(api_key=config["api_key"], base_url=config["base_url"])
configModel = config["model"] configModel = config["model"]
excel_path2 = Path(config["excel_path2"]) ccfa_excel_path = Path(config["ccfa_excel_path"])
wb = openpyxl.load_workbook(excel_path2) wb = openpyxl.load_workbook(ccfa_excel_path)
sheetCCF = wb["CCF-A列表"] sheetCCF = wb["CCF-A列表"]
# 序号 简称 全称 # 序号 简称 全称
# 1 PPoPP ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming # 1 PPoPP ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming
......
from pathlib import Path from pathlib import Path
import logging import logging
from xml.etree.ElementPath import get_parent_map
from openai import OpenAI from openai import OpenAI
import pypdf import pypdf
import openpyxl import openpyxl
from fuzzywuzzy import fuzz from fuzzywuzzy import fuzz
import json import json
RED = '\033[91m'
GREEN = '\033[92m'
BLUE = '\033[94m'
RESET = '\033[0m'
def chechCCFA( conferenceJournal, CCFA, configModel, client):
system_prompt = f"""
You are an expert academic conference/journal classifier. Your task is to determine if the given conference/journal name matches any entry in the provided CCF-A list.
CCF-A List (comma-separated): {CCFA}
Analysis Guidelines:
1. Perform fuzzy matching considering:
- Abbreviations vs full names (e.g. 'PPoPP' vs 'ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming')
- Common variations (e.g. 'IEEE Transactions' vs 'IEEE Trans.')
- Minor spelling differences
2. Return JSON with:
- "IsCCFA": ture/false
- "MatchedName": the matched name from CCF-A list (empty string if no match)
- "Confidence": your confidence score (0-100)
Example Output:
{{
"IsCCFA": "ture",
"MatchedName": "IEEE International Symposium on High Performance Computer Architecture",
"Confidence": 0.95,
"Reason": "The input matches HPCA's full name"
}}
"""
response = client.chat.completions.create(
model=configModel,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": conferenceJournal},
],
temperature=0.2,
max_tokens=4096,
# stream=True,
response_format={"type": "json_object"}
)
return response.choices[0].message.content
def get_key_info( content, configModel, client): def get_key_info( content, configModel, client):
system_prompt = """ system_prompt = """
Act as an expert metadata extraction assistant. Act as an expert metadata extraction assistant.
...@@ -124,16 +167,26 @@ def citationProcess(config: dict): ...@@ -124,16 +167,26 @@ def citationProcess(config: dict):
excel_path = Path(config["source_excel_path"]) excel_path = Path(config["source_excel_path"])
target_path = Path(config["target_excel_path"]) target_path = Path(config["target_excel_path"])
ccfa_excel_path = Path(config["ccfa_excel_path"])
# 读取Excel文件 # 读取Excel文件
wb = openpyxl.load_workbook(excel_path) wb = openpyxl.load_workbook(excel_path)
# 读取CCFA列表
# 序号 简称 全称
# 1 PPoPP ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming
# 2 FAST USENIX Conference on File and Storage Technologies
# 3 DAC Design Automation Conference
# 4 HPCA IEEE International Symposium on High Performance Computer Architecture
# 5 MICRO IEEE/ACM International Symposium on Microarchitecture
ccfa_wb = openpyxl.load_workbook(ccfa_excel_path)
sheetCCF = ccfa_wb["CCF-A列表"]
# 遍历工作簿中的所有工作表 # 遍历工作簿中的所有工作表
for idx, sheet_name in enumerate(wb.sheetnames): for idx, sheet_name in enumerate(wb.sheetnames):
if idx == config["sheetNum"]: if idx == config["sheetNum"]:
break break
sheet = wb[sheet_name] sheet = wb[sheet_name]
logging.info(f"Processing sheet: {sheet_name}") logging.info(f"{BLUE}Processing sheet: {sheet_name}{RESET}")
index_list, paperName_list = read_rough_nameIndex_from_excel(sheet, config["maxItem"]) index_list, paperName_list = read_rough_nameIndex_from_excel(sheet, config["maxItem"])
...@@ -147,7 +200,7 @@ def citationProcess(config: dict): ...@@ -147,7 +200,7 @@ def citationProcess(config: dict):
# 遍历当前工作表对应的所有PDF文件 # 遍历当前工作表对应的所有PDF文件
for file in pdf_files: for file in pdf_files:
logging.info(f"Processing {file.name}") logging.info(f"{BLUE}Processing {file.name}{RESET}")
first_page_text = extract_first_page_text(file) first_page_text = extract_first_page_text(file)
...@@ -164,6 +217,7 @@ def citationProcess(config: dict): ...@@ -164,6 +217,7 @@ def citationProcess(config: dict):
# 解析JSON结果, 提取论文标题 # 解析JSON结果, 提取论文标题
result_dict = json.loads(result) result_dict = json.loads(result)
pdf_title = result_dict["Title"] pdf_title = result_dict["Title"]
pdf_issue = result_dict["ISSUE"]
# 遍历Excel表项进行模糊匹配 # 遍历Excel表项进行模糊匹配
for idx, excel_name in zip(index_list, paperName_list): for idx, excel_name in zip(index_list, paperName_list):
...@@ -197,6 +251,24 @@ def citationProcess(config: dict): ...@@ -197,6 +251,24 @@ def citationProcess(config: dict):
authors = ";".join(authors_list) if isinstance(authors_list, list) else "" authors = ";".join(authors_list) if isinstance(authors_list, list) else ""
sheet.cell(row=idx+4, column=7, value=authors) # 第7列是作者名称 sheet.cell(row=idx+4, column=7, value=authors) # 第7列是作者名称
# CCFA判断
logging.info(f"Judge CCFA.")
CCFA_list = []
for row in sheetCCF.iter_rows(min_row=2, values_only=True): # 从第二行开始遍历
if row[0] and row[1]: # 确保索引和论文名称都存在
CCFA_list.append(row[1])
CCFA_list.append(row[2])
# 把list转为长的字符串, ','分割
CCFA = ','.join(CCFA_list)
conferenceJournal = pdf_issue[0]
CCFA_flag = "否"
print(conferenceJournal)
if conferenceJournal == "":
CCFA_flag = "否"
else:
CCFA_flag = "是" if chechCCFA(conferenceJournal, CCFA, configModel, client) else "否"
sheet.cell(row=idx+4, column=5, value=CCFA_flag) # 第7列是作者名称
# 保存修改后的Excel文件 # 保存修改后的Excel文件
wb.save(target_path) wb.save(target_path)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment