Back to research
research v1.0.0 5.2 min read 281 lines

arxiv

arXiv 무료 REST API로 학술 논문 검색/조회 — 키워드, 저자, 카테고리, ID 검색

Hermes Agent
MIT

arXiv Research

Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.

Quick Reference

| Action | Command |
|--------|---------|
| Search papers | curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5" |
| Get specific paper | curl "https://export.arxiv.org/api/query?id_list=2402.03300" |
| Read abstract (web) | web_extract(urls=["https://arxiv.org/abs/2402.03300"]) |
| Read full paper (PDF) | web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) |

Searching Papers

The API returns Atom XML. Parse with grep/sed or pipe through python3 for clean output.

Basic search

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"

Clean output (parse XML to readable format)

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
published = entry.find('a:published', ns).text[:10]
authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
summary = entry.find('a:summary', ns).text.strip()[:200]
cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
print(f'{i+1}. [{arxiv_id}] {title}')
print(f' Authors: {authors}')
print(f' Published: {published} | Categories: {cats}')
print(f' Abstract: {summary}...')
print(f' PDF: https://arxiv.org/pdf/{arxiv_id}')
print()
"

Search Query Syntax

| Prefix | Searches | Example |
|--------|----------|---------|
| all: | All fields | all:transformer+attention |
| ti: | Title | ti:large+language+models |
| au: | Author | au:vaswani |
| abs: | Abstract | abs:reinforcement+learning |
| cat: | Category | cat:cs.AI |
| co: | Comment | co:accepted+NeurIPS |

Boolean operators

# AND (default when using +)
search_query=all:transformer+attention

OR


search_query=all:GPT+OR+all:BERT

AND NOT


search_query=all:language+model+ANDNOT+all:vision

Exact phrase


search_query=ti:"chain+of+thought"

Combined


search_query=au:hinton+AND+cat:cs.LG

Sort and Pagination

| Parameter | Options |
|-----------|---------|
| sortBy | relevance, lastUpdatedDate, submittedDate |
| sortOrder | ascending, descending |
| start | Result offset (0-based) |
| max_results | Number of results (default 10, max 30000) |

# Latest 10 papers in cs.AI
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"

Fetching Specific Papers

# By arXiv ID
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"

Multiple papers


curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"

BibTeX Generation

After fetching metadata for a paper, generate a BibTeX entry:

{% raw %}

curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f' title = {{{title}}},')
print(f' author = {{{authors}}},')
print(f' year = {{{year}}},')
print(f' eprint = {{{raw_id}}},')
print(f' archivePrefix = {{arXiv}},')
print(f' primaryClass = {{{primary}}},')
print(f' url = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"

{% endraw %}

Reading Paper Content

After finding a paper, read it:

# Abstract page (fast, metadata + abstract)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper (PDF → markdown via Firecrawl)


web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

For local PDF processing, see the ocr-and-documents skill.

Common Categories

| Category | Field |
|----------|-------|
| cs.AI | Artificial Intelligence |
| cs.CL | Computation and Language (NLP) |
| cs.CV | Computer Vision |
| cs.LG | Machine Learning |
| cs.CR | Cryptography and Security |
| stat.ML | Machine Learning (Statistics) |
| math.OC | Optimization and Control |
| physics.comp-ph | Computational Physics |

Full list: https://arxiv.org/category_taxonomy

Helper Script

The scripts/search_arxiv.py script handles XML parsing and provides clean output:

python scripts/search_arxiv.py "GRPO reinforcement learning"
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
python scripts/search_arxiv.py --category cs.AI --sort date
python scripts/search_arxiv.py --id 2402.03300
python scripts/search_arxiv.py --id 2402.03300,2401.12345

No dependencies — uses only Python stdlib.


Semantic Scholar (Citations, Related Papers, Author Profiles)

arXiv doesn't provide citation data or recommendations. Use the Semantic Scholar API for that — free, no key needed for basic use (1 req/sec), returns JSON.

Get paper details + citations

# By arXiv ID
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,citationCount,referenceCount,influentialCitationCount,year,abstract" | python3 -m json.tool

By Semantic Scholar paper ID or DOI


curl -s "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,citationCount"

Get citations OF a paper (who cited it)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Get references FROM a paper (what it cites)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Search papers (alternative to arXiv search, returns JSON)

curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool

Get paper recommendations

curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
-H "Content-Type: application/json" \
-d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool

Author profile

curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool

Useful Semantic Scholar fields

title, authors, year, abstract, citationCount, referenceCount, influentialCitationCount, isOpenAccess, openAccessPdf, fieldsOfStudy, publicationVenue, externalIds (contains arXiv ID, DOI, etc.)


Complete Research Workflow

  • Discover: python scripts/search_arxiv.py "your topic" --sort date --max 10
  • Assess impact: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
  • Read abstract: web_extract(urls=["https://arxiv.org/abs/ID"])
  • Read full paper: web_extract(urls=["https://arxiv.org/pdf/ID"])
  • Find related work: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
  • Get recommendations: POST to Semantic Scholar recommendations endpoint
  • Track authors: curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"

Rate Limits

| API | Rate | Auth |
|-----|------|------|
| arXiv | ~1 req / 3 seconds | None needed |
| Semantic Scholar | 1 req / second | None (100/sec with API key) |

Notes

  • arXiv returns Atom XML — use the helper script or parsing snippet for clean output
  • Semantic Scholar returns JSON — pipe through python3 -m json.tool for readability
  • arXiv IDs: old format (hep-th/0601001) vs new (2402.03300)
  • PDF: https://arxiv.org/pdf/{id} — Abstract: https://arxiv.org/abs/{id}
  • HTML (when available): https://arxiv.org/html/{id}
  • For local PDF processing, see the ocr-and-documents skill

ID Versioning

  • arxiv.org/abs/1706.03762 always resolves to the latest version
  • arxiv.org/abs/1706.03762v1 points to a specific immutable version
  • When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
  • The API field returns the versioned URL (e.g., http://arxiv.org/abs/1706.03762v7)

Withdrawn Papers

Papers can be withdrawn after submission. When this happens:

  • The field contains a withdrawal notice (look for "withdrawn" or "retracted")
  • Metadata fields may be incomplete
  • Always check the summary before treating a result as a valid paper

Related Skills / 관련 스킬

blogwatcher

blogwatcher-cli로 블로그/RSS/Atom 피드 모니터링 — 새 글 스캔, 읽음 상태 추적

llm-wiki

Karpathy의 LLM Wiki — 영구적 인터링크 마크다운 지식 베이스 구축/유지

polymarket

Polymarket 예측 시장 데이터 조회 — 시장 검색, 가격, 오더북, 가격 이력

research-paper-writing

ML/AI 연구 논문 엔드투엔드 작성 — 실험 설계부터 분석, 초안, 수정, 제출까지