Commit 39026d58 by jiangdongchen

README

parent f6d43a6b
...@@ -3,4 +3,5 @@ psrc/__pycache__/ ...@@ -3,4 +3,5 @@ psrc/__pycache__/
psrc/stage1/__pycache__/ psrc/stage1/__pycache__/
psrc/stage2/__pycache__/ psrc/stage2/__pycache__/
Papers/j20/ Papers/j20/
Papers/j29/
json/ json/
\ No newline at end of file
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
- 放置需要check的excel表格 - 放置需要check的excel表格
- 第context_start+1行开始实际表项 - 第context_start+1行开始实际表项
- https://docs.qq.com/sheet/DZEVmZ2thTEd4R1Zh?tab=000001&nlc=1 - https://docs.qq.com/sheet/DZEVmZ2thTEd4R1Zh?tab=000001&nlc=1
- **一定一定一定要按照上述链接中的格式并且第三行加上bibtex!不然会失败** - **一定一定一定要按照上述链接中的格式,使用上述链接的表头,并且第三行加上bibtex!不然会失败**
- others中的j24表格是示范excel表格,请仿照该格式简单更改原表格保证规范, 注意要在第三行加上bibtex - others中的j24表格是示范excel表格,请仿照该格式简单更改原表格保证规范, 注意要在第三行加上bibtex
- target_excel_path - target_excel_path
- 输出的格式化表格 - 输出的格式化表格
...@@ -71,9 +71,7 @@ ...@@ -71,9 +71,7 @@
1. 根据每个pdf的索引找到对应excel表项,和pdf进行核对,然后将该excel表项标记已经处理 1. 根据每个pdf的索引找到对应excel表项,和pdf进行核对,然后将该excel表项标记已经处理
2. 无索引值的pdf即未匹配的pdf,需要人工二次匹配 2. 无索引值的pdf即未匹配的pdf,需要人工二次匹配
3. 未匹配的excel条目需要人工二次匹配 3. 未匹配的excel条目需要人工二次匹配
3. stage2: 知名企业、牛人判断 3. stage2: 知名企业、牛人判断, 使用其他人的脚本
1. 进入psrc/2-qiye目录按照README.md中的步骤执行得到excel
2. 进入psrc/3-niurenshaixuan目录按照README.md中的步骤执行得到excel
# 代码结构说明 # 代码结构说明
1. psrc文件夹下是库函数 1. psrc文件夹下是库函数
......
...@@ -4,11 +4,11 @@ ...@@ -4,11 +4,11 @@
"model": "Pro/deepseek-ai/DeepSeek-V3", "model": "Pro/deepseek-ai/DeepSeek-V3",
"pdf_dir": "./Papers", "pdf_dir": "./Papers",
"result_dir": "./json", "result_dir": "./json",
"source_excel_path": "./others/j24.xlsx", "source_excel_path": "./others/j29.xlsx",
"content_start": 4, "content_start": 4,
"ccfa_excel_path": "./others/CCFA.xlsx", "ccfa_excel_path": "./others/CCFA.xlsx",
"target_excel_path": "./others/target.xlsx", "target_excel_path": "./others/target.xlsx",
"logLevel": 20, "logLevel": 20,
"sheetNum": 1, "sheetNum": 1,
"maxItem": 64 "maxItem": 70
} }
\ No newline at end of file
2025-05-09 14:06:56,845 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-09 14:06:57,373 - INFO - Processing sheet: j20
2025-05-09 14:06:57,383 - WARNING - No BibTeX entry found in sheet j20 row 3
2025-05-09 14:06:57,384 - INFO - Processing 10331530.pdf
2025-05-09 14:07:25,227 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-09 14:07:25,754 - INFO - Processing sheet: j24-DianNao family
2025-05-09 14:07:25,761 - INFO - Processing 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-09 14:08:20,887 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:22,118 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:25,593 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:25,652 - INFO - Renamed: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-09 14:08:25,653 - INFO - Standardization issue info.
2025-05-09 14:08:25,654 - INFO - Standardization author info.
2025-05-09 14:08:25,654 - INFO - Standardization institution info.
2025-05-09 14:08:27,014 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:27,017 - INFO - Standardization countrys info.
2025-05-09 14:08:27,017 - INFO - Judge CCFA.
2025-05-09 14:08:34,888 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:34,935 - INFO - CCFA: International Symposium on Computer Architecture
2025-05-09 14:08:34,935 - INFO - Reason: The input 'ISCA '17' is an abbreviation and year variation of the CCF-A listed conference 'International Symposium on Computer Architecture'
2025-05-09 14:08:34,936 - INFO - 是
2025-05-09 14:08:36,251 - INFO - Matched: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> idx: 1, excel_name: In-datacenter performance analysis of a tensor processing unit
2025-05-09 14:08:36,252 - INFO - Change: 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf -> 1-In-Datacenter_Performance_Analysis_of_a_Tensor_Processing_Unit.pdf
2025-05-09 14:08:36,252 - INFO - Processing 11-From_Cloud_Down_to_Things_An_Overview_of_Machine_Learning_in_Internet_of_Things.pdf
2025-05-09 14:08:49,455 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:52,552 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:52,585 - INFO - Renamed: 11-From_Cloud_Down_to_Things_An_Overview_of_Machine_Learning_in_Internet_of_Things.pdf -> 11-From_Cloud_Down_to_Things_An_Overview_of_Machine_Learning_in_Internet_of_Things.pdf
2025-05-09 14:08:52,588 - INFO - Standardization issue info.
2025-05-09 14:08:52,588 - INFO - Standardization author info.
2025-05-09 14:08:52,589 - INFO - Standardization institution info.
2025-05-09 14:08:53,414 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:53,415 - INFO - Standardization countrys info.
2025-05-09 14:08:53,416 - INFO - Judge CCFA.
2025-05-09 14:08:59,286 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:08:59,292 - INFO - 否
2025-05-09 14:09:00,366 - INFO - Matched: 11-From_Cloud_Down_to_Things_An_Overview_of_Machine_Learning_in_Internet_of_Things.pdf -> idx: 11, excel_name: From Cloud Down to Things: An Overview of Machine Learning in Internet of Things
2025-05-09 14:09:00,366 - INFO - Change: 11-From_Cloud_Down_to_Things_An_Overview_of_Machine_Learning_in_Internet_of_Things.pdf -> 11-From_Cloud_Down_to_Things_An_Overview_of_Machine_Learning_in_Internet_of_Things.pdf
2025-05-09 14:09:00,367 - INFO - Processing 12-CASH_Compiler_Assisted_Hardware_Design_for_Improving_DRAM_Energy_Efficiency_in_CNN_Inference.pdf
2025-05-09 14:09:13,651 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:18,523 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:22,513 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:22,517 - INFO - Renamed: 12-CASH_Compiler_Assisted_Hardware_Design_for_Improving_DRAM_Energy_Efficiency_in_CNN_Inference.pdf -> 12-CASH_Compiler_Assisted_Hardware_Design_for_Improving_DRAM_Energy_Efficiency_in_CNN_Inference.pdf
2025-05-09 14:09:22,547 - INFO - Standardization issue info.
2025-05-09 14:09:22,548 - INFO - Standardization author info.
2025-05-09 14:09:22,550 - INFO - Standardization institution info.
2025-05-09 14:09:23,808 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:23,810 - INFO - Standardization countrys info.
2025-05-09 14:09:23,811 - INFO - Judge CCFA.
2025-05-09 14:09:29,079 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:29,080 - INFO - 否
2025-05-09 14:09:30,267 - INFO - Matched: 12-CASH_Compiler_Assisted_Hardware_Design_for_Improving_DRAM_Energy_Efficiency_in_CNN_Inference.pdf -> idx: 12, excel_name: CASH: Compiler Assisted Hardware Design for Improving DRAM Energy Efficiency in CNN Inference
2025-05-09 14:09:30,268 - INFO - Change: 12-CASH_Compiler_Assisted_Hardware_Design_for_Improving_DRAM_Energy_Efficiency_in_CNN_Inference.pdf -> 12-CASH_Compiler_Assisted_Hardware_Design_for_Improving_DRAM_Energy_Efficiency_in_CNN_Inference.pdf
2025-05-09 14:09:30,268 - INFO - Processing 13-Laius_Towards_Latency_Awareness_and_Improved_Utilization_of_Spatial_Multitasking_Accelerators_in_Datacenters.pdf
2025-05-09 14:09:50,795 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:52,218 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:52,222 - INFO - Renamed: 13-Laius_Towards_Latency_Awareness_and_Improved_Utilization_of_Spatial_Multitasking_Accelerators_in_Datacenters.pdf -> 13-Laius_Towards_Latency_Awareness_and_Improved_Utilization_of_Spatial_Multitasking_Accelerators_in_Datacenters.pdf
2025-05-09 14:09:52,252 - INFO - Standardization issue info.
2025-05-09 14:09:52,253 - INFO - Standardization author info.
2025-05-09 14:09:52,253 - INFO - Standardization institution info.
2025-05-09 14:09:55,843 - INFO - HTTP Request: POST https://api.siliconflow.cn/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-09 14:09:55,846 - INFO - Standardization countrys info.
2025-05-09 14:09:55,846 - INFO - Judge CCFA.
2025-05-09 15:33:19,128 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-09 15:33:34,364 - INFO - 程序启动,日志文件保存在: C:\Users\17046\Documents\papertools\logs\citation_process.log
2025-05-09 15:33:34,559 - INFO - Processing sheet: Sheet1
File added
No preview for this file type
本脚本支持多sheet,excel输入excel输出。
请确保excel第一行是论文名,第二行是序号等目标格式名,否则会出错,如:![1746716733551](1746716733551.png)
需在`qiyeguojia.py`中修改`input_name(无.xlsx后缀)`,会输出 `input_name-qiye.xlsx`
请确保不同机构以 `;``\n `分隔,否则会被当成一个机构
请确保同一机构名称中间不要存在 `\n ` ,否则会被当成两个机构,无法匹配成功
在匹配知名机构时,通过机构名匹配`知名企业5.8.xlsx`,按大小写匹配(注意`知名企业5.8.xlsx`需及时更新最新版)
在匹配国家时,首先通过机构名匹配`机构国家汇总5.8.xlsx`得到国家英文名(转换成小写进行全字匹配, 匹配时对','做了拆分,能应对机构名包含学院国家等情况, 匹配不上会填?),再去 `全局国家地区5.8.xlsx` 匹配对应的中文国家名与序号。
# Siming Lan
# 导入必要的库
import pandas as pd # 数据处理
import re # 正则表达式
import xlwings as xw # Excel操作
from openpyxl import load_workbook # Excel文件操作
from openpyxl.utils.dataframe import dataframe_to_rows # DataFrame转Excel行
import pdb # 调试工具
# 定义输入输出文件名
input_name = 'j24_target.xlsx' # 输入文件基础名
input_file = input_name + '.xlsx' # 输入Excel文件名
output_file = input_name + '-qiye.xlsx' # 输出Excel文件名
def find_header_rows(file_path):
"""
查找Excel文件中每个工作表的表头行
参数:
file_path: Excel文件路径
返回:
字典: {工作表名: 表头行号}
"""
workbook = xw.Book(file_path) # 打开Excel文件
header_rows_dict = {} # 存储结果
for sheet in workbook.sheets: # 遍历所有工作表
header = None
# 检查前5行是否有"序号"列(用于确定表头位置)
for row in range(1, 6):
if sheet.range(f'A{row}').value == "序号":
header = row - 1 # 表头行是"序号"所在行的上一行
break
header_rows_dict[sheet.name] = header # 记录表头行号
workbook.close() # 关闭Excel文件
return header_rows_dict
# 读取知名企业列表Excel文件
famous_companies_df = pd.read_excel("知名企业5.8.xlsx")
company_dict = {} # 存储企业名和别名的字典,格式: {企业名/别名: 索引}
# 构建企业名称字典
for idx, row in famous_companies_df.iterrows():
company_name = row["企业名"] # 获取企业名
aliases = row.get("别名", "") # 获取别名(可能为空)
# 处理别名(按分号分隔)
if isinstance(aliases, str):
aliases = aliases.split(';')
else:
aliases = []
# 添加企业名到字典(索引从1开始)
company_dict[company_name.strip()] = idx + 1
# 添加所有别名到字典(指向同一个索引)
for alias in aliases:
alias = alias.strip()
if alias: # 跳过空别名
company_dict[alias] = idx + 1
# 存储处理结果
results = {}
# 获取输入文件中所有工作表的表头行号
header_rows = find_header_rows(input_file)
def process_column_name(col_name):
"""处理列名: 去除换行符和首尾空格"""
return col_name.replace('\n', '').strip()
# 处理每个工作表
for sheet_name, header in header_rows.items():
# 读取Excel工作表(跳过表头行)
df = pd.read_excel(input_file, sheet_name=sheet_name, header=header)
# 处理列名: 去除换行符和空格
df.columns = [process_column_name(col) for col in df.columns]
# 检查必要列是否存在
required_columns = ["引文机构", "知名企业名称(参考知名企业列表)", "引文机构在知名企业中的索引"]
if not all(col in df.columns for col in required_columns):
print(f"跳过 {sheet_name}:缺少必要列")
continue # 跳过缺少必要列的工作表
# 初始化结果列表
company_names_list = [] # 存储匹配的企业名称
company_indexes_list = [] # 存储匹配的企业索引
# 处理每一行数据
for _, row in df.iterrows():
# 分割"引文机构"列(按分号或换行符), 并过滤空值
institutions = [inst for inst in re.split(r'[;\n]', str(row["引文机构"]).strip()) if inst]
names = [] # 存储当前行匹配的企业名
indexes = [] # 存储当前行匹配的企业索引
# 处理每个机构名称
for inst in institutions:
inst_clean = inst.strip()
# 处理中文逗号,避免影响匹配
inst_clean = inst_clean.replace(",", ",")
inst_clean = inst_clean.replace(",", " , ").strip()
# 标志是否匹配到企业
matched = False
# 遍历企业字典进行匹配
for company in company_dict:
# 判断企业名是否包含英文字母
if re.search(r'[a-zA-Z]', company):
# 英文企业名匹配: 前后必须是单词边界(防止部分匹配)
pattern = r'(^|\s)' + re.escape(company) + r'(\s|$)'
if re.search(pattern, inst_clean):
names.append(company)
indexes.append(str(company_dict[company]))
matched = True
break
else:
# 中文企业名匹配: 直接包含判断
if company in inst_clean:
names.append(company)
indexes.append(str(company_dict[company]))
matched = True
break
# 如果没有匹配到企业
if not matched:
names.append("") # 空字符串表示未匹配
indexes.append("无") # "无"表示未匹配
# 将当前行的匹配结果拼接成字符串
company_names = ";".join([n for n in names if n]) # 过滤空名称
company_indexes = ";".join(indexes)
company_names_list.append(company_names)
company_indexes_list.append(company_indexes)
# 更新DataFrame
df["知名企业名称(参考知名企业列表)"] = company_names_list
df["引文机构在知名企业中的索引"] = company_indexes_list
# 保存处理后的结果
results[sheet_name] = df
# 加载源文件(用于保留格式)
wb = load_workbook(input_file)
# 创建新的Excel文件
with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
# 处理每个工作表
for sheet_name, df in results.items():
# 获取源工作表
source_sheet = wb[sheet_name]
# 创建新工作表
dest_sheet = writer.book.create_sheet(sheet_name)
# 复制第一行第一列的内容和格式(通常是标题)
source_cell = source_sheet.cell(row=1, column=1)
dest_cell = dest_sheet.cell(row=1, column=1)
dest_cell.value = source_cell.value
if source_cell.font.color:
dest_cell.font = source_cell.font.copy() # Updated to use the recommended copy method
dest_cell.font.color = source_cell.font.color # Set color separately
# 将DataFrame写入Excel
for r_idx, row in enumerate(dataframe_to_rows(df, index=False, header=True), start=2):
for c_idx, value in enumerate(row, start=1):
dest_cell = dest_sheet.cell(row=r_idx, column=c_idx)
dest_cell.value = value
# 设置活动工作表(与源文件相同)
if wb.active:
writer.book.active = writer.book[wb.active.title]
\ No newline at end of file
# 使用方法
修改`main.py`中的`input_file_path``output_file_path`
输出的excel中包含`牛人``牛人署名顺序``疑似牛人`三列。其中疑似牛人的格式为 `疑似牛人名字(疑似牛人在info/new_niuren_format-merged_turing.csv中的索引)`
请初步判断是否为牛人。如果认为是牛人,则找李慧老师复核,由李慧老师添加到全局牛人表中。请不要修改全局牛人表。
如果是脚本运行相关的问题,请联系马天云同学。
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment