Commit d6a62c92 by jiangdongchen

extract journal and session

parent 7e4247f9
# 环境配置
- 确保执行py的cwd在papertools仓库文件夹下
- 路径和参数配置都在config.json文件中
- api_key
- 目前的密钥是东辰同学自己从知乎上打广告赚来的,只有100块的额度,请尽量使用自己的密钥
- 如果使用不同的API的密钥注意更改openAI的调用方式,这里推荐硅基流动,因为我就是用硅基流动跑通的
- base_url
- api接口url
- pdf_dir
- 放置论文pdf的文件夹
- result_dir
- 输出关键信息json文件的文件夹
- source_excel_path
- 放置需要check的excel表格
- 第4行开始实际表项
- 第一列索引
- 第三列论文标题
- 第七列论文作者
- target_excel_path
- 输出的格式化表格
- logLevel
- 取10表示DEBUG级别
- 取20表示INFO级别
- tableNum 需要处理的工作表数量
- sheetNum 需要处理的工作表数量
- maxItem 每个工作表的最大条目数
- python3.12.10
- python3.12
- 无法import的库使用pip install逐个安装
- `openai`, `pypdf`
- `python-Levenshtein`
- 目前的密钥是东辰同学自己从知乎上打广告赚来的,只有100块的额度,请尽量使用自己的密钥
- 如果使用不同的API的密钥注意更改openAI的调用方式,这里推荐硅基流动,因为我就是用硅基流动跑通的
# 使用方法
- 查看config.json正确配置参数,让程序能够找到需要的文件位置和参数
- 默认配置
- 文章的pdf分sheet放置在Papers/sheetname文件夹下
- 待check的excel表格放在others文件夹中
- 输出的表格放在target文件夹中, pdf会原地标准化重命名
- python main.py 执行程序
- 程序执行过程中,不要打开target excel文件,不然会争用权限发生错误
- 多模型交叉验证
......@@ -28,16 +47,20 @@
2. 输出无法下载的条目
2. 自动化提取信息和格式化
1. 通过config.json读取配置对象
2. 遍历excel的工作表
1. 读取excel表格中的论文名称和索引
2. 循环:
1. 读取pdf中的论文名称和关键信息,存储到json文件夹下
2. 和excel表格中的论文名称进行模糊匹配
3. 匹配成功后
1. 用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称
2. 将pdf文件中的关键信息写入excel表格中, 包括作者姓名、机构、国家
4. 匹配失败后,输出无法匹配的条目
1. 使用warning记录无法匹配的条目,方便后续处理
2. **遍历**excel的sheet
1. **遍历**sheet中的论文名称和索引
1. 用大模型读取pdf中第一页的论文名称和关键信息,存储到json文件夹下
2. **遍历**excel表格中的论文名称进行模糊匹配
1. 匹配成功后
1. 用pdf文件中的论文名称和索引标准化重命名pdf文件和excel表格中的论文名称
2. 将pdf文件中的关键信息写入excel表格中, 包括
- 标题
- 会议名称
- 作者姓名
- 机构
- 国家
2. 匹配失败后,输出无法匹配的条目
1. 使用warning记录无法匹配的条目,方便后续处理
# 代码结构说明
1. psrc文件夹下是库函数
......
# crawler
## update
更新陈老师的文章列表信息`yunjichen.json`,后续`main.py`依赖此文件检索引用文献。
## main
创建`papers.txt`并在其中逐行加入你需要检索的论文标题,不建议输入完整标题,输入部分标题即可,如下所示:
```
Reproducing Concurrency Bugs Using Local Clocks
binary translator with post-optimization
timing error mitigation for hardware neural networks
A Polyvalent Machine Learning Accelerator
```
已爬取过的链接在`urls.txt`中记录,避免重复爬取,`main.py`文件可重复运行,爬取失败建议检查网络连接。
### 2. **运行脚本**
确保你的环境已安装必要依赖(如`pandas`, `openpyxl`, `tqdm`等)。
命令行运行:
```bash
python main.py
```
**指定年份**(仅爬取该年份的引用):
```bash
python main.py --year 2023
```
### 3. **查看结果**
抓取后将在`results/`下生成`citations.xlsx`,每个论文一个表单,包含被引论文的Title、URL、作者信息等。
------
## **参数说明**
- `--year` 指定年份,仅抓取该年度的引用。不加该参数则会爬取**所有年份**的引用信息。
**用法示例:**
- 爬取全部(默认): `python main.py`
- 仅抓2022年的引用: `python main.py --year 2022`
\ No newline at end of file
import os
import re
import string
import json
import pandas as pd
from tqdm import tqdm
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
import time
import random
import argparse
from bs4 import BeautifulSoup
import requests
import re
import string
JSON_FILE = "yunjichen.json"
PUB_FILE = "papers.txt"
URL_FILE = "urls.txt"
DATA_FILE = "citations.xlsx"
def get_cited_url_list(citedby_url, year=None):
prefix, suffix = citedby_url.split("oi=bibs&hl=en")
prefix += "start="
if year:
# Use the year parameter for filtering
suffix = f"&hl=en&as_sdt=2005&sciodt=0,5{suffix}&scipsc=&as_ylo={year}&as_yhi={year}&scisbd="
else:
# No year filter, fetch all
suffix = f"&hl=en&as_sdt=2005&sciodt=0,5{suffix}&scipsc="
for i in range(0, 10000, 10):
yield prefix + str(i) + suffix
SYMBOL_MORE_AUTHORS = "…"
# 添加请求头
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
def parse_scholar_results(html_content):
"""
Parses HTML content to extract Google Scholar results.
Args:
html_content (str): The HTML content to parse.
Returns:
list: A list of dictionaries containing '标题', '作者', and '期刊信息'.
"""
soup = BeautifulSoup(html_content, "html.parser")
results = soup.find_all("div", class_="gs_r")
extracted_data = []
for result in results:
# Extract title
title_tag = result.find(["h2", "h3"], class_="gs_rt")
if title_tag:
a_tag = title_tag.find("a")
if a_tag:
title = a_tag.get_text()
else:
title = title_tag.get_text()
title = title.replace("[CITATION][C] ", "")
else:
title = "未找到标题"
# Extract authors and journal information
authors_journal_tag = result.find("div", class_="gs_a")
if authors_journal_tag:
authors_journal_text = authors_journal_tag.get_text()
parts = authors_journal_text.split("-")
if len(parts) >= 2:
authors = parts[0].strip()
journal_info = parts[1].split(",")[0].strip()
else:
authors = "格式不符合预期"
journal_info = "格式不符合预期"
else:
authors = "未找到作者和期刊信息"
journal_info = "未找到作者和期刊信息"
# Add extracted data to list
extracted_data.append(
{"标题": title, "作者": authors, "期刊信息": journal_info}
)
return extracted_data
def parse_html(url):
"""
return:
title: str
new_url: url to the paper
authors: str
more_authors: bool, if there are more authors than shown
"""
# print(f"Fetching {url}")
results = []
session = requests.Session()
response = session.get(url, headers=headers)
if response.status_code != 200:
# print("Failed to get the page")
return False, results
soup = BeautifulSoup(response.content, "html.parser")
papers = soup.find_all("div", class_="gs_r gs_or gs_scl")
if len(papers) == 0:
return False, results
for paper in papers:
div_title = paper.find("h3", class_="gs_rt")
try:
title = div_title.find("a").get_text()
new_url = div_title.find("a")["href"]
except:
# span blocks
title = div_title.find_all("span")[-1].get_text()
continue
authors = paper.find("div", class_="gs_a").get_text().split("-")[0].split(",")
authors[-1] = " ".join(authors[-1].split())
authors = ";".join(authors)
if SYMBOL_MORE_AUTHORS in authors[-1]:
results.append((title, new_url, authors, 1))
else:
results.append((title, new_url, authors, 0))
return True, results
def add_scisbd_sort(url):
"""确保url中有scisbd=1 按date排序"""
parsed_url = urlparse(url)
query = parse_qs(parsed_url.query)
query["scisbd"] = ["1"]
new_query = urlencode(query, doseq=True)
new_url = urlunparse(
(parsed_url.scheme, parsed_url.netloc, parsed_url.path, "", new_query, "")
)
return new_url
def main(publications, year):
if not os.path.exists(URL_FILE):
with open(URL_FILE, "w") as file:
file.write("")
publications = [pub.lower() for pub in publications]
with open(JSON_FILE, "r") as file:
author = json.load(file)
author_publications = [
pub for pub in author["publications"] if pub["container_type"] == "Publication"
]
titles = [pub["bib"]["title"] for pub in author_publications]
index_publications = []
for i, publication in enumerate(publications):
found = False
for idx, title in enumerate(titles):
if publication in title.lower():
found = True
index_publications.append(idx)
publications[i] = title
break
if not found:
index_publications.append(None)
for idx, publication in enumerate(publications):
with open(URL_FILE, "a+") as file:
file.seek(0)
know_urls = set([line.strip() for line in file.readlines()])
name = "_".join(publication.split())
name = re.sub(f"[{string.punctuation}]", "", name[:20])
print(f"Processing {publication}...to {name}")
columns = [
"paper idx",
"paper Title",
"Cite idx",
"Cite Title",
"URL",
"Authors",
"More Authors",
]
if os.path.exists(DATA_FILE) and name in pd.ExcelFile(DATA_FILE).sheet_names:
old_df = pd.read_excel(DATA_FILE, sheet_name=name)
data = (
old_df
if set(old_df.columns) == set(columns)
else pd.DataFrame(columns=columns)
)
current_start_idx = old_df["Cite idx"].max() if not old_df.empty else 0
else:
data = pd.DataFrame(columns=columns)
current_start_idx = 0
index_publication = index_publications[idx]
if index_publication is None:
continue
citations = author_publications[index_publication]["num_citations"]
citedby_url = author_publications[index_publication]["citedby_url"]
citation_count = current_start_idx
for i, url in tqdm(enumerate(get_cited_url_list(citedby_url, year))):
# url = add_scisbd_sort(url)
if url in know_urls:
continue
if i * 10 > citations:
break
# 随机延时
time.sleep(random.uniform(0.5, 2.0))
mark, results = parse_html(url)
if not mark:
break
for rec in results:
citation_count += 1
title = rec[0] if len(rec) > 0 else ""
cite_url = rec[1] if len(rec) > 1 else ""
authors = rec[2] if len(rec) > 2 else ""
more_authors = rec[3] if len(rec) > 3 else ""
row = {
"paper idx": idx + 1,
"paper Title": publication,
"Cite idx": citation_count,
"Cite Title": title,
"URL": cite_url,
"Authors": authors,
"More Authors": more_authors,
}
data = data._append(row, ignore_index=True)
know_urls.add(url)
if os.path.exists(DATA_FILE):
with pd.ExcelWriter(
DATA_FILE, mode="a", if_sheet_exists="replace", engine="openpyxl"
) as writer:
data.to_excel(writer, sheet_name=name, index=False)
else:
data.to_excel(DATA_FILE, sheet_name=name, index=False)
with open(URL_FILE, "w") as file:
file.writelines([url + "\n" for url in know_urls])
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Crawl citations with optional year filtering"
)
parser.add_argument(
"--year", type=int, help="The year to filter citations by (optional)"
)
args = parser.parse_args()
# Load the publication file
with open(PUB_FILE, "r", encoding="utf-8") as file:
publications = file.readlines()
publications = [pub.strip().lower() for pub in publications]
main(publications, args.year)
anNao: A Machine-Learning Supercompute
from scholarly import scholarly
from scholarly import ProxyGenerator
import json
# Activates proxy because Google Scholar otherwise might block the IP address
pg = ProxyGenerator()
scholarly.use_proxy(pg, pg)
def main(name):
author = next(scholarly.search_author(name))
author = scholarly.fill((author),sections = ['publications'])
json.dump(author, open(f"{name}.json", "w"), indent=4)
if __name__=="__main__":
main("yunji chen")
\ No newline at end of file
......@@ -15,8 +15,6 @@ if __name__ == "__main__":
# Path对象后跟/用于连接地址
# print(excel_path)
# 创建日志目录
log_dir = cwd_dir / "logs"
log_dir.mkdir(exist_ok=True)
......
......@@ -6,7 +6,7 @@ import openpyxl
from fuzzywuzzy import fuzz
import json
def get_authors( content, configModel, client):
def get_key_info( content, configModel, client):
system_prompt = """
Act as an expert metadata extraction assistant.
Analyze the following text, which is extracted from the first page of a document (likely a scientific paper or report).
......@@ -21,9 +21,37 @@ def get_authors( content, configModel, client):
- Extract all associated institutions of authors.
- **Countrys:**
- Extract all associated countrys of authors.
- **ISSUE:**
- Extract where the paper is published like journal or session.
- Title, authors, institutions and countrys should be four separate keys, not nested together.
- Use highcase for first letter of key.
- **Handling Missing Data:** If no data of a field can be identified in the text, the field in the JSON should be an empty list `[]`.
Example Output:
{
"Title": "Laius: Towards Latency Awareness and Improved Utilization of Spatial Multitasking Accelerators in Datacenters",
"Authors": [
"Quan Chen",
"Daniel Edward Mawhirter",
"Bo Wu",
"Chao Li",
],
"Institutions": [
"Shanghai Jiao Tong University",
"Colorado School of Mines",
"Colorado School of Mines",
"Shanghai Jiao Tong University",
],
"Countrys": [
"China",
"United States",
"United States",
"China",
],
"ISSURE": [
"IEEE Transactions on Computers"
]
}
"""
response = client.chat.completions.create(
......@@ -130,7 +158,7 @@ def citationProcess(config: dict):
configModel = config["model"]
# 提取关键信息
result = get_authors(first_page_text, configModel, client)
result = get_key_info(first_page_text, configModel, client)
if result is not None:
# 解析JSON结果, 提取论文标题
......@@ -149,7 +177,7 @@ def citationProcess(config: dict):
if similarity >= 85:
# 重命名PDF文件
new_pdf_name = f"{idx}-{pdf_title.replace(':', '_')}.pdf" # 将冒号替换为连字符
new_pdf_name = f"{idx}-{pdf_title.replace(':', '_').replace(' ', '_').replace('?', '_')}.pdf" # 将冒号替换为连字符
new_pdf_path = file.parent / new_pdf_name
try:
file.rename(new_pdf_path)
......
abbr,fullname
PPoPP,ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming
FAST,USENIX Conference on File and Storage Technologies
DAC,Design Automation Conference
HPCA,IEEE International Symposium on High Performance Computer Architecture
MICRO,IEEE/ACM International Symposium on Microarchitecture
SC,"International Conference for High Performance Computing, Networking, Storage, and Analysis"
ASPLOS,International Conference on Architectural Support for Programming Languages and Operating Systems
ISCA,International Symposium on Computer Architecture
USENIX ATC,USENIX Annual Technical Conference
EuroSys,European Conference on Computer Systems
SIGCOMM,"ACM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication"
MobiCom,ACM International Conference on Mobile Computing and Networking
INFOCOM,IEEE International Conference on Computer Communications
NSDI,Symposium on Network System Design and Implementation
CCS,ACM Conference on Computer and Communications Security
EUROCRYPT,International Conference on the Theory and Applications of Cryptographic Techniques
S&P,IEEE Symposium on Security and Privacy
CRYPTO,International Cryptology Conference
USENIX Security,USENIX Security Symposium
NDSS,Network and Distributed System Security Symposium
PLDI,ACM SIGPLAN Conference on Programming Language Design and Implementation
POPL,ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
FSE,ACM International Conference on the Foundations of Software Engineering
SOSP,ACM Symposium on Operating Systems Principles
OOPSLA,"Conference on Object-Oriented Programming Systems, Languages,and Applications"
ASE,International Conference on Automated Software Engineering
ICSE,International Conference on Software Engineering
ISSTA,International Symposium on Software Testing and Analysis
OSDI,USENIX Symposium on Operating Systems Design and Implementations
FM,International Symposium on Formal Methods
SIGMOD,ACM SIGMOD Conference
SIGKDD,ACM SIGKDD Conference on Knowledge Discovery and Data Mining
ICDE,IEEE International Conference on Data Engineering
SIGIR,International ACM SIGIR Conference on Research and Development in Information Retrieval
VLDB,International Conference on Very Large Data Bases
STOC,ACM Symposium on Theory of Computing
SODA,ACM-SIAM Symposium on Discrete Algorithms
CAV,International Conference on Computer Aided Verification
FOCS,IEEE Annual Symposium on Foundations of Computer Science
LICS,ACM/IEEE Symposium on Logic in Computer Science
ACM MM,ACM International Conference on Multimedia
SIGGRAPH,ACM Special Interest Group on Computer Graphics
VR,IEEE Virtual Reality
IEEE VIS,IEEE Visualization Conference
AAAI,AAAI Conference on Artificial Intelligence
NeurIPS,Conference on Neural Information Processing Systems
ACL,Annual Meeting of the Association for Computational Linguistics
CVPR,IEEE/CVF Computer Vision and Pattern Recognition Conference
ICCV,International Conference on Computer Vision
ICML,International Conference on Machine Learning
IJCAI,International Joint Conference on Artificial Intelligence
CSCW,ACM Conference on Computer Supported Cooperative Work and Social Computing
CHI,ACM Conference on Human Factors in Computing Systems
UbiComp/IMWUT,"ACM international joint conference on Pervasive and Ubiquitous Computing/ Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies"
UIST,ACM Symposium on User Interface Software and Technology
WWW,International World Wide Web Conference
RTSS,IEEE Real-Time Systems Symposium
WINE,Conference on Web and Internet Economics
TOCS,ACM Transactions on Computer Systems
TOS,ACM Transactions on Storage
TCAD,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
TC,IEEE Transactions on Computers
TPDS,IEEE Transactions on Parallel and Distributed Systems
TACO,ACM Transactions on Architecture and Code Optimization
JSAC,IEEE Journal on Selected Areas in Communications
TMC,IEEE Transactions on Mobile Computing
TON,IEEE/ACM Transactions on Networking
TDSC,IEEE Transactions on Dependable and Secure Computing
TIFS,IEEE Transactions on Information Forensics and Security
,Journal of Cryptology
TOPLAS,ACM Transactions on Programming Languages and Systems
TOSEM,ACM Transactions on Software Engineering and Methodology
TSE,IEEE Transactions on Software Engineering
TSC,IEEE Transactions on Services Computing
TODS,ACM Transactions on Database Systems
TOIS,ACM Transactions on Information Systems
TKDE,IEEE Transactions on Knowledge and Data Engineering
VLDBJ,The VLDB Journal
TIT,IEEE Transactions on Information Theory
IANDC,Information and Computation
SICOMP,SIAM Journal on Computing
TOG,ACM Transactions on Graphics
TIP,IEEE Transactions on Image Processing
TVCG,IEEE Transactions on Visualization and Computer Graphics
AI,Artificial Intelligence
TPAMI,IEEE Transactions on Pattern Analysis and Machine Intelligence
IJCV,International Journal of Computer Vision
JMLR,Journal of Machine Learning Research
TOCHI,ACM Transactions on Computer-Human Interaction
IJHCS,International Journal of Human-Computer Studies
JACM,Journal of the ACM
Proc. IEEE,Proceedings of the IEEE
SCIS,Science China Information Sciences
title,venue
Learning to Generalize With Object-Centric Agents in the Open World Survival Game Crafter,"IEEE Transactions on Games ( Volume: 16, Issue: 2, June 2024)"
Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning,"Proceedings of the AAAI Conference on Artificial Intelligence, 3"
"Advancing DRL Agents in Commercial Fighting Games: Training, Integration,
and Agent-Human Alignment",Proceedings of the 41th International Conference on Machine Learning (ICML 2024)
Discovering and Using Structure in Autonomous Machine Learning,ETH Zurich thesis
Jose Luis Flores Campana,博士论文
# !注意:这个脚本可能判断不准,请一定人工检查!
## 用法:python is_ccfa.py
## 功能:判断论文是否属于CCF-A类会议
- 输入文件:`CCF_A_list.csv`,包含两列:`abbr,fullname`
- 输入文件夹:`data/title_venue`,包含若干csv文件,每个文件的标题栏是`title,venue`
- 输出文件夹:`data/is_ccfa`,对用`title_venue`文件夹下的每个csv文件进行处理,输出文件名相同,标题栏是`title,venue,is_ccf_a`
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment