# Searching for Research Papers in arXiv

This notebook demonstrates how to interact with the `arXiv` API using `dapr-agents`, specifically through the `ArxivFetcher` class. We will explore:

* How to search for papers using advanced query strings.
* How to filter results by date (e.g., last 24 hours).
* How to retrieve metadata for papers.
* How to download the top 5 papers for further exploration.
* How to extract and process text from the downloaded PDFs, with each page stored as a separate document.

## Install Required Libraries
Before starting, ensure the required libraries are installed:

In [None]:
!pip install dapr-agents python-dotenv arxiv

## Initialize Logging

In [1]:
import logging
logging.basicConfig(level=logging.INFO)

## Importing Necessary Modules

Import the required module and set up the `ArxivFetcher` to start searching for papers.

In [2]:
from dapr_agents.document import ArxivFetcher

# Initialize the fetcher
fetcher = ArxivFetcher()

## Basic Search by Query String

In this example, we search for papers related to "machine learning". The results are returned as `Document` objects with `text` as the summary and `metadata` containing details.

In [3]:
# Search for papers related to "machine learning"
results = fetcher.search(query="machine learning", max_results=5)

# Display the metadata and summaries of the retrieved documents
for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: machine learning
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=machine+learning&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 100 of 378290 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 5 results for query: machine learning


Title: CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation
Authors: Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat
Summary: Large language models (LLMs) have shown great potential in natural language
processing tasks, but their application to machine translation (MT) remains
challenging due to pretraining on English-centric data and the complexity of
reinforcement learning from human feedback (RLHF). Direct Preference
Optimization (DPO) has emerged as a simpler and more efficient alternative, but
its performance depends heavily on the quality of preference data. To address
this, we propose Confidence-Reward driven Preference Optimization (CRPO), a
novel method that combines reward scores with model confidence to improve data
selection for fine-tuning. CRPO selects challenging sentence pairs where the
model is uncertain or underperforms, leading to more effective learning. While
primarily designed for LLMs, CRPO also generalizes to encoder-

## Advanced Query Strings

Here we demonstrate using advanced query strings with logical operators like `AND`, `OR`, and `NOT`.

Search for papers where "agents" and "cybersecurity" both appear:

In [4]:
results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 96 of 96 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 10 results for query: all:(agents AND cybersecurity)


Title: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework
Authors: He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu
Summary: Penetration testing is a vital practice for identifying and mitigating
vulnerabilities in cybersecurity systems, but its manual execution is
labor-intensive and time-consuming. Existing large language model
(LLM)-assisted or automated penetration testing approaches often suffer from
inefficiencies, such as a lack of contextual understanding and excessive,
unstructured data generation. This paper presents VulnBot, an automated
penetration testing framework that leverages LLMs to simulate the collaborative
workflow of human penetration testing teams through a multi-agent system. To
address the inefficiencies and reliance on manual intervention in traditional
penetration testing methods, VulnBot decomposes complex tasks into three
specialized phases: reconnaissance, scanning, and exploitation. These phases
are guided by

Search for papers where "quantum" appears but not "computing":

In [5]:
results = fetcher.search(query="all:(quantum NOT computing)", max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(quantum NOT computing)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28quantum+NOT+computing%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 100 of 356985 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 10 results for query: all:(quantum NOT computing)


Title: Exponentially slow thermalization in 1D fragmented dynamics
Authors: Cheng Wang, Shankar Balasubramanian, Yiqiu Han, Ethan Lake, Xiao Chen, Zhi-Cheng Yang
Summary: We investigate the thermalization dynamics of 1D systems with local
constraints coupled to an infinite temperature bath at one boundary. The
coupling to the bath eventually erases the effects of the constraints, causing
the system to tend towards a maximally mixed state at long times. We show that
for a large class of local constraints, the time at which thermalization occurs
can be extremely long. In particular, we present evidence for the following
conjecture: when the constrained dynamics displays strong Hilbert space
fragmentation, the thermalization time diverges exponentially with system size.
We show that this conjecture holds for a wide range of dynamical constraints,
including dipole-conserving dynamics, the $tJ_z$ model, and a large class of
group-based dynamics, and relate a general proof of our conjecture 

Search for papers authored by a specific person

In [6]:
results = fetcher.search(query='au:"John Doe"', max_results=10)

for doc in results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Summary: {doc.text}\n")

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: au:"John Doe"
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=au%3A%22John+Doe%22&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 1 of 1 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 1 results for query: au:"John Doe"


Title: Double Deep Q-Learning in Opponent Modeling
Authors: Yangtianze Tao, John Doe
Summary: Multi-agent systems in which secondary agents with conflicting agendas also
alter their methods need opponent modeling. In this study, we simulate the main
agent's and secondary agents' tactics using Double Deep Q-Networks (DDQN) with
a prioritized experience replay mechanism. Then, under the opponent modeling
setup, a Mixture-of-Experts architecture is used to identify various opponent
strategy patterns. Finally, we analyze our models in two environments with
several agents. The findings indicate that the Mixture-of-Experts model, which
is based on opponent modeling, performs better than DDQN.



## Filter Papers by Date (e.g., Last 15 Days)

In [7]:
from datetime import datetime, timedelta

# Calculate date 48 hours ago
last_24_hours = (datetime.now() - timedelta(days=15)).strftime("%Y%m%d")

# Search for recent papers
recent_results = fetcher.search(
    query="all:(agents AND cybersecurity)",
    from_date=last_24_hours,
    to_date=datetime.now().strftime("%Y%m%d"),
    max_results=5
)

# Display recent papers
for doc in recent_results:
    print(f"Title: {doc.metadata['title']}")
    print(f"Authors: {', '.join(doc.metadata['authors'])}")
    print(f"Published: {doc.metadata['published']}")
    print(f"Summary: {doc.text}\n")

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29+AND+submittedDate%3A%5B20250110+TO+20250125%5D&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 2 of 2 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 2 results for query: all:(agents AND cybersecurity) AND submittedDate:[20250110 TO 20250125]


Title: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework
Authors: He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu
Published: 2025-01-23
Summary: Penetration testing is a vital practice for identifying and mitigating
vulnerabilities in cybersecurity systems, but its manual execution is
labor-intensive and time-consuming. Existing large language model
(LLM)-assisted or automated penetration testing approaches often suffer from
inefficiencies, such as a lack of contextual understanding and excessive,
unstructured data generation. This paper presents VulnBot, an automated
penetration testing framework that leverages LLMs to simulate the collaborative
workflow of human penetration testing teams through a multi-agent system. To
address the inefficiencies and reliance on manual intervention in traditional
penetration testing methods, VulnBot decomposes complex tasks into three
specialized phases: reconnaissance, scanning, and exploitation. Thes

## Download Top 5 Papers as PDF Files

In [8]:
import os
from pathlib import Path

# Create a directory for downloaded papers
os.makedirs("arxiv_papers", exist_ok=True)

# Search and download PDFs
download_results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=5, download=True, dirpath=Path("arxiv_papers"))

for paper in download_results:
    print(f"Downloaded Paper: {paper['title']}")
    print(f"File Path: {paper['file_path']}\n")

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 96 of 96 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)
INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf
INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf
INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf
INFO:da

Downloaded Paper: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework
File Path: arxiv_papers/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf

Downloaded Paper: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education
File Path: arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf

Downloaded Paper: What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics
File Path: arxiv_papers/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf

Downloaded Paper: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
File Path: arxiv_papers/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf

Downloaded Paper: BotSim: LLM-Powered 

In [9]:
download_results[0]

{'entry_id': 'http://arxiv.org/abs/2501.13411v1',
 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework',
 'authors': ['He Kong',
  'Die Hu',
  'Jingguo Ge',
  'Liangxiong Li',
  'Tong Li',
  'Bingzhen Wu'],
 'published': '2025-01-23',
 'updated': '2025-01-23',
 'primary_category': 'cs.SE',
 'categories': ['cs.SE'],
 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1',
 'file_path': 'arxiv_papers/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf'}

## Download Top 5 Papers as PDF Files (Include Summary)

In [10]:
import os
from pathlib import Path

# Create a directory for downloaded papers
os.makedirs("arxiv_papers", exist_ok=True)

# Search and download PDFs
download_results = fetcher.search(query="all:(agents AND cybersecurity)", max_results=5, download=True, dirpath=Path("more_arxiv"), include_summary=True)

INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)
INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100
INFO:arxiv:Got first page: 96 of 96 total results
INFO:dapr_agents.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)
INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf
INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf
INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf
INFO:dapr_age

In [11]:
download_results[0]

{'entry_id': 'http://arxiv.org/abs/2501.13411v1',
 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework',
 'authors': ['He Kong',
  'Die Hu',
  'Jingguo Ge',
  'Liangxiong Li',
  'Tong Li',
  'Bingzhen Wu'],
 'published': '2025-01-23',
 'updated': '2025-01-23',
 'primary_category': 'cs.SE',
 'categories': ['cs.SE'],
 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1',
 'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf',
 'summary': 'Penetration testing is a vital practice for identifying and mitigating\nvulnerabilities in cybersecurity systems, but its manual execution is\nlabor-intensive and time-consuming. Existing large language model\n(LLM)-assisted or automated penetration testing approaches often suffer from\ninefficiencies, such as a lack of contextual understanding and excessive,\nunstructured data generation. This paper presents VulnBot, an automated\npenetration testin

In [12]:
print(download_results[0]["summary"])

Penetration testing is a vital practice for identifying and mitigating
vulnerabilities in cybersecurity systems, but its manual execution is
labor-intensive and time-consuming. Existing large language model
(LLM)-assisted or automated penetration testing approaches often suffer from
inefficiencies, such as a lack of contextual understanding and excessive,
unstructured data generation. This paper presents VulnBot, an automated
penetration testing framework that leverages LLMs to simulate the collaborative
workflow of human penetration testing teams through a multi-agent system. To
address the inefficiencies and reliance on manual intervention in traditional
penetration testing methods, VulnBot decomposes complex tasks into three
specialized phases: reconnaissance, scanning, and exploitation. These phases
are guided by a penetration task graph (PTG) to ensure logical task execution.
Key design features include role specialization, penetration path planning,
inter-agent communication, and

## Reading Downloaded PDFs

To read the downloaded PDF files, we'll use the `PyPDFReader` class from `dapr_agents.document`. This allows us to extract the content of each page while retaining the associated metadata for further processing.

In [None]:
# Ensure you have the required library for reading PDFs installed. If not, you can install it using the following command:
!pip install pypdf

The following code reads each downloaded PDF file and extracts its pages. Each page is stored as a separate Document object, containing both the page's text and the metadata from the original PDF.

In [14]:
from pathlib import Path
from dapr_agents.document import PyPDFReader

# Initialize the PDF reader
docs_read = []
reader = PyPDFReader()

# Remove 'summary' from metadata in download_results
for paper in download_results:
    paper.pop("summary", None)  # Remove the 'summary' key if it exists

# Process each downloaded PDF
for paper in download_results:
    local_pdf_path = Path(paper["file_path"])  # Ensure the key matches the output
    documents = reader.load(local_pdf_path, additional_metadata=paper)  # Load the PDF with metadata
    
    # Append each page's document to the main list
    docs_read.extend(documents)  # Flatten into one list of all documents

# Verify the results
print(f"Extracted {len(docs_read)} documents from the PDFs.")

Extracted 93 documents from the PDFs.


In [15]:
docs_read[0:15]

[Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 1, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative\nFramework\nHe Kong1,2, Die Hu1,2, Jingguo Ge1,2, Liangxiong Li1, Tong Li1 , and Bingzhen Wu1\n1State Key Laboratory of Cyberspace Security Defense, Institute of Information Engineering,\nChinese Academy of Sciences\n2School of Cyber Security, University of Chinese Academy of Sciences\nAbstract\nPenetration testing is a vital practice for identifyi