dapr-agents/cookbook/arxiv_search.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Searching for Research Papers in arXiv\n",
    "\n",
    "This notebook demonstrates how to interact with the `arXiv` API using `dapr-agents`, specifically through the `ArxivFetcher` class. We will explore:\n",
    "\n",
    "* How to search for papers using advanced query strings.\n",
    "* How to filter results by date (e.g., last 24 hours).\n",
    "* How to retrieve metadata for papers.\n",
    "* How to download the top 5 papers for further exploration.\n",
    "* How to extract and process text from the downloaded PDFs, with each page stored as a separate document."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Required Libraries\n",
    "Before starting, ensure the required libraries are installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install dapr-agents python-dotenv arxiv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize Logging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "logging.basicConfig(level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Importing Necessary Modules\n",
    "\n",
    "Import the required module and set up the `ArxivFetcher` to start searching for papers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dapr_agents.document import ArxivFetcher\n",
    "\n",
    "# Initialize the fetcher\n",
    "fetcher = ArxivFetcher()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Search by Query String\n",
    "\n",
    "In this example, we search for papers related to \"machine learning\". The results are returned as `Document` objects with `text` as the summary and `metadata` containing details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: machine learning\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=machine+learning&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 100 of 378290 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 5 results for query: machine learning\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title: CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation\n",
      "Authors: Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat\n",
      "Summary: Large language models (LLMs) have shown great potential in natural language\n",
      "processing tasks, but their application to machine translation (MT) remains\n",
      "challenging due to pretraining on English-centric data and the complexity of\n",
      "reinforcement learning from human feedback (RLHF). Direct Preference\n",
      "Optimization (DPO) has emerged as a simpler and more efficient alternative, but\n",
      "its performance depends heavily on the quality of preference data. To address\n",
      "this, we propose Confidence-Reward driven Preference Optimization (CRPO), a\n",
      "novel method that combines reward scores with model confidence to improve data\n",
      "selection for fine-tuning. CRPO selects challenging sentence pairs where the\n",
      "model is uncertain or underperforms, leading to more effective learning. While\n",
      "primarily designed for LLMs, CRPO also generalizes to encoder-decoder models\n",
      "like NLLB, demonstrating its versatility. Empirical results show that CRPO\n",
      "outperforms existing methods such as RS-DPO, RSO and MBR score in both\n",
      "translation accuracy and data efficiency.\n",
      "\n",
      "Title: Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization\n",
      "Authors: Hao Dong, Eleni Chatzi, Olga Fink\n",
      "Summary: Test-time adaptation (TTA) has demonstrated significant potential in\n",
      "addressing distribution shifts between training and testing data. Open-set\n",
      "test-time adaptation (OSTTA) aims to adapt a source pre-trained model online to\n",
      "an unlabeled target domain that contains unknown classes. This task becomes\n",
      "more challenging when multiple modalities are involved. Existing methods have\n",
      "primarily focused on unimodal OSTTA, often filtering out low-confidence samples\n",
      "without addressing the complexities of multimodal data. In this work, we\n",
      "present Adaptive Entropy-aware Optimization (AEO), a novel framework\n",
      "specifically designed to tackle Multimodal Open-set Test-time Adaptation\n",
      "(MM-OSTTA) for the first time. Our analysis shows that the entropy difference\n",
      "between known and unknown samples in the target domain strongly correlates with\n",
      "MM-OSTTA performance. To leverage this, we propose two key components:\n",
      "Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality\n",
      "Prediction Discrepancy Optimization (AMP). These components enhance the ability\n",
      "of model to distinguish unknown class samples during online adaptation by\n",
      "amplifying the entropy difference between known and unknown samples. To\n",
      "thoroughly evaluate our proposed methods in the MM-OSTTA setting, we establish\n",
      "a new benchmark derived from existing datasets. This benchmark includes two\n",
      "downstream tasks and incorporates five modalities. Extensive experiments across\n",
      "various domain shift situations demonstrate the efficacy and versatility of the\n",
      "AEO framework. Additionally, we highlight the strong performance of AEO in\n",
      "long-term and continual MM-OSTTA settings, both of which are challenging and\n",
      "highly relevant to real-world applications. Our source code is available at\n",
      "https://github.com/donghao51/AEO.\n",
      "\n",
      "Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models\n",
      "Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li\n",
      "Summary: With the rapid development of diffusion models, text-to-image(T2I) models\n",
      "have made significant progress, showcasing impressive abilities in prompt\n",
      "following and image generation. Recently launched models such as FLUX.1 and\n",
      "Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have\n",
      "demonstrated exceptional performance across various complex tasks, raising\n",
      "questions about whether T2I models are moving towards general-purpose\n",
      "applicability. Beyond traditional image generation, these models exhibit\n",
      "capabilities across a range of fields, including controllable generation, image\n",
      "editing, video, audio, 3D, and motion generation, as well as computer vision\n",
      "tasks like semantic segmentation and depth estimation. However, current\n",
      "evaluation frameworks are insufficient to comprehensively assess these models'\n",
      "performance across expanding domains. To thoroughly evaluate these models, we\n",
      "developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0,\n",
      "Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided\n",
      "into five key domains: structured output generation, realism, and physical\n",
      "consistency, specific domain generation, challenging scenario generation, and\n",
      "multi-style creation tasks. This comprehensive assessment highlights each\n",
      "model's strengths and limitations, particularly the outstanding performance of\n",
      "FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring\n",
      "the expanding applications and potential of T2I models as foundational AI\n",
      "tools. This study provides valuable insights into the current state and future\n",
      "trajectory of T2I models as they evolve towards general-purpose usability.\n",
      "Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.\n",
      "\n",
      "Title: Temporal Preference Optimization for Long-Form Video Understanding\n",
      "Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy\n",
      "Summary: Despite significant advancements in video large multimodal models\n",
      "(video-LMMs), achieving effective temporal grounding in long-form videos\n",
      "remains a challenge for existing models. To address this limitation, we propose\n",
      "Temporal Preference Optimization (TPO), a novel post-training framework\n",
      "designed to enhance the temporal grounding capabilities of video-LMMs through\n",
      "preference learning. TPO adopts a self-training approach that enables models to\n",
      "differentiate between well-grounded and less accurate temporal responses by\n",
      "leveraging curated preference datasets at two granularities: localized temporal\n",
      "grounding, which focuses on specific video segments, and comprehensive temporal\n",
      "grounding, which captures extended temporal dependencies across entire video\n",
      "sequences. By optimizing on these preference datasets, TPO significantly\n",
      "enhances temporal understanding while reducing reliance on manually annotated\n",
      "data. Extensive experiments on three long-form video understanding\n",
      "benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness\n",
      "of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO\n",
      "establishes itself as the leading 7B model on the Video-MME benchmark,\n",
      "underscoring the potential of TPO as a scalable and efficient solution for\n",
      "advancing temporal reasoning in long-form video understanding. Project page:\n",
      "https://ruili33.github.io/tpo_website.\n",
      "\n",
      "Title: Improving Video Generation with Human Feedback\n",
      "Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang\n",
      "Summary: Video generation has achieved significant advances through rectified flow\n",
      "techniques, but issues like unsmooth motion and misalignment between videos and\n",
      "prompts persist. In this work, we develop a systematic pipeline that harnesses\n",
      "human feedback to mitigate these problems and refine the video generation\n",
      "model. Specifically, we begin by constructing a large-scale human preference\n",
      "dataset focused on modern video generation models, incorporating pairwise\n",
      "annotations across multi-dimensions. We then introduce VideoReward, a\n",
      "multi-dimensional video reward model, and examine how annotations and various\n",
      "design choices impact its rewarding efficacy. From a unified reinforcement\n",
      "learning perspective aimed at maximizing reward with KL regularization, we\n",
      "introduce three alignment algorithms for flow-based models by extending those\n",
      "from diffusion models. These include two training-time strategies: direct\n",
      "preference optimization for flow (Flow-DPO) and reward weighted regression for\n",
      "flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies\n",
      "reward guidance directly to noisy videos. Experimental results indicate that\n",
      "VideoReward significantly outperforms existing reward models, and Flow-DPO\n",
      "demonstrates superior performance compared to both Flow-RWR and standard\n",
      "supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom\n",
      "weights to multiple objectives during inference, meeting personalized video\n",
      "quality needs. Project page: https://gongyeliu.github.io/videoalign.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Search for papers related to \"machine learning\"\n",
    "results = fetcher.search(query=\"machine learning\", max_results=5)\n",
    "\n",
    "# Display the metadata and summaries of the retrieved documents\n",
    "for doc in results:\n",
    "    print(f\"Title: {doc.metadata['title']}\")\n",
    "    print(f\"Authors: {', '.join(doc.metadata['authors'])}\")\n",
    "    print(f\"Summary: {doc.text}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Query Strings\n",
    "\n",
    "Here we demonstrate using advanced query strings with logical operators like `AND`, `OR`, and `NOT`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Search for papers where \"agents\" and \"cybersecurity\" both appear:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 96 of 96 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 10 results for query: all:(agents AND cybersecurity)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework\n",
      "Authors: He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu\n",
      "Summary: Penetration testing is a vital practice for identifying and mitigating\n",
      "vulnerabilities in cybersecurity systems, but its manual execution is\n",
      "labor-intensive and time-consuming. Existing large language model\n",
      "(LLM)-assisted or automated penetration testing approaches often suffer from\n",
      "inefficiencies, such as a lack of contextual understanding and excessive,\n",
      "unstructured data generation. This paper presents VulnBot, an automated\n",
      "penetration testing framework that leverages LLMs to simulate the collaborative\n",
      "workflow of human penetration testing teams through a multi-agent system. To\n",
      "address the inefficiencies and reliance on manual intervention in traditional\n",
      "penetration testing methods, VulnBot decomposes complex tasks into three\n",
      "specialized phases: reconnaissance, scanning, and exploitation. These phases\n",
      "are guided by a penetration task graph (PTG) to ensure logical task execution.\n",
      "Key design features include role specialization, penetration path planning,\n",
      "inter-agent communication, and generative penetration behavior. Experimental\n",
      "results demonstrate that VulnBot outperforms baseline models such as GPT-4 and\n",
      "Llama3 in automated penetration testing tasks, particularly showcasing its\n",
      "potential in fully autonomous testing on real-world machines.\n",
      "\n",
      "Title: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education\n",
      "Authors: Tianyu Wang, Nianjun Zhou, Zhixiong Chen\n",
      "Summary: Many non-traditional students in cybersecurity programs often lack access to\n",
      "advice from peers, family members and professors, which can hinder their\n",
      "educational experiences. Additionally, these students may not fully benefit\n",
      "from various LLM-powered AI assistants due to issues like content relevance,\n",
      "locality of advice, minimum expertise, and timing. This paper addresses these\n",
      "challenges by introducing an application designed to provide comprehensive\n",
      "support by answering questions related to knowledge, skills, and career\n",
      "preparation advice tailored to the needs of these students. We developed a\n",
      "learning tool platform, CyberMentor, to address the diverse needs and pain\n",
      "points of students majoring in cybersecurity. Powered by agentic workflow and\n",
      "Generative Large Language Models (LLMs), the platform leverages\n",
      "Retrieval-Augmented Generation (RAG) for accurate and contextually relevant\n",
      "information retrieval to achieve accessibility and personalization. We\n",
      "demonstrated its value in addressing knowledge requirements for cybersecurity\n",
      "education and for career marketability, in tackling skill requirements for\n",
      "analytical and programming assignments, and in delivering real time on demand\n",
      "learning support. Using three use scenarios, we showcased CyberMentor in\n",
      "facilitating knowledge acquisition and career preparation and providing\n",
      "seamless skill-based guidance and support. We also employed the LangChain\n",
      "prompt-based evaluation methodology to evaluate the platform's impact,\n",
      "confirming its strong performance in helpfulness, correctness, and\n",
      "completeness. These results underscore the system's ability to support students\n",
      "in developing practical cybersecurity skills while improving equity and\n",
      "sustainability within higher education. Furthermore, CyberMentor's open-source\n",
      "design allows for adaptation across other disciplines, fostering educational\n",
      "innovation and broadening its potential impact.\n",
      "\n",
      "Title: What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics\n",
      "Authors: Lynnette Hui Xian Ng, Kathleen M. Carley\n",
      "Summary: Chatter on social media is 20% bots and 80% humans. Chatter by bots and\n",
      "humans is consistently different: bots tend to use linguistic cues that can be\n",
      "easily automated while humans use cues that require dialogue understanding.\n",
      "Bots use words that match the identities they choose to present, while humans\n",
      "may send messages that are not related to the identities they present. Bots and\n",
      "humans differ in their communication structure: sampled bots have a star\n",
      "interaction structure, while sampled humans have a hierarchical structure.\n",
      "These conclusions are based on a large-scale analysis of social media tweets\n",
      "across ~200mil users across 7 events. Social media bots took the world by storm\n",
      "when social-cybersecurity researchers realized that social media users not only\n",
      "consisted of humans but also of artificial agents called bots. These bots wreck\n",
      "havoc online by spreading disinformation and manipulating narratives. Most\n",
      "research on bots are based on special-purposed definitions, mostly predicated\n",
      "on the event studied. This article first begins by asking, \"What is a bot?\",\n",
      "and we study the underlying principles of how bots are different from humans.\n",
      "We develop a first-principle definition of a social media bot. With this\n",
      "definition as a premise, we systematically compare characteristics between bots\n",
      "and humans across global events, and reflect on how the software-programmed bot\n",
      "is an Artificial Intelligent algorithm, and its potential for evolution as\n",
      "technology advances. Based on our results, we provide recommendations for the\n",
      "use and regulation of bots. Finally, we discuss open challenges and future\n",
      "directions: Detect, to systematically identify these automated and potentially\n",
      "evolving bots; Differentiate, to evaluate the goodness of the bot in terms of\n",
      "their content postings and relationship interactions; Disrupt, to moderate the\n",
      "impact of malicious bots.\n",
      "\n",
      "Title: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity\n",
      "Authors: Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo\n",
      "Summary: Evaluating Large Language Models (LLMs) is crucial for understanding their\n",
      "capabilities and limitations across various applications, including natural\n",
      "language processing and code generation. Existing benchmarks like MMLU, C-Eval,\n",
      "and HumanEval assess general LLM performance but lack focus on specific expert\n",
      "domains such as cybersecurity. Previous attempts to create cybersecurity\n",
      "datasets have faced limitations, including insufficient data volume and a\n",
      "reliance on multiple-choice questions (MCQs). To address these gaps, we propose\n",
      "SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in\n",
      "the cybersecurity domain. SecBench includes questions in various formats (MCQs\n",
      "and short-answer questions (SAQs)), at different capability levels (Knowledge\n",
      "Retention and Logical Reasoning), in multiple languages (Chinese and English),\n",
      "and across various sub-domains. The dataset was constructed by collecting\n",
      "high-quality data from open sources and organizing a Cybersecurity Question\n",
      "Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used\n",
      "the powerful while cost-effective LLMs to (1). label the data and (2).\n",
      "constructing a grading agent for automatic evaluation of SAQs. Benchmarking\n",
      "results on 16 SOTA LLMs demonstrate the usability of SecBench, which is\n",
      "arguably the largest and most comprehensive benchmark dataset for LLMs in\n",
      "cybersecurity. More information about SecBench can be found at our website, and\n",
      "the dataset can be accessed via the artifact link.\n",
      "\n",
      "Title: BotSim: LLM-Powered Malicious Social Botnet Simulation\n",
      "Authors: Boyu Qiao, Kun Li, Wei Zhou, Shilong Li, Qianqian Lu, Songlin Hu\n",
      "Summary: Social media platforms like X(Twitter) and Reddit are vital to global\n",
      "communication. However, advancements in Large Language Model (LLM) technology\n",
      "give rise to social media bots with unprecedented intelligence. These bots\n",
      "adeptly simulate human profiles, conversations, and interactions, disseminating\n",
      "large amounts of false information and posing significant challenges to\n",
      "platform regulation. To better understand and counter these threats, we\n",
      "innovatively design BotSim, a malicious social botnet simulation powered by\n",
      "LLM. BotSim mimics the information dissemination patterns of real-world social\n",
      "networks, creating a virtual environment composed of intelligent agent bots and\n",
      "real human users. In the temporal simulation constructed by BotSim, these\n",
      "advanced agent bots autonomously engage in social interactions such as posting\n",
      "and commenting, effectively modeling scenarios of information flow and user\n",
      "interaction. Building on the BotSim framework, we construct a highly\n",
      "human-like, LLM-driven bot dataset called BotSim-24 and benchmark multiple bot\n",
      "detection strategies against it. The experimental results indicate that\n",
      "detection methods effective on traditional bot datasets perform worse on\n",
      "BotSim-24, highlighting the urgent need for new detection strategies to address\n",
      "the cybersecurity threats posed by these advanced bots.\n",
      "\n",
      "Title: algoTRIC: Symmetric and asymmetric encryption algorithms for Cryptography -- A comparative analysis in AI era\n",
      "Authors: Naresh Kshetri, Mir Mehedi Rahman, Md Masud Rana, Omar Faruq Osama, James Hutson\n",
      "Summary: The increasing integration of artificial intelligence (AI) within\n",
      "cybersecurity has necessitated stronger encryption methods to ensure data\n",
      "security. This paper presents a comparative analysis of symmetric (SE) and\n",
      "asymmetric encryption (AE) algorithms, focusing on their role in securing\n",
      "sensitive information in AI-driven environments. Through an in-depth study of\n",
      "various encryption algorithms such as AES, RSA, and others, this research\n",
      "evaluates the efficiency, complexity, and security of these algorithms within\n",
      "modern cybersecurity frameworks. Utilizing both qualitative and quantitative\n",
      "analysis, this research explores the historical evolution of encryption\n",
      "algorithms and their growing relevance in AI applications. The comparison of SE\n",
      "and AE algorithms focuses on key factors such as processing speed, scalability,\n",
      "and security resilience in the face of evolving threats. Special attention is\n",
      "given to how these algorithms are integrated into AI systems and how they\n",
      "manage the challenges posed by large-scale data processing in multi-agent\n",
      "environments. Our results highlight that while SE algorithms demonstrate\n",
      "high-speed performance and lower computational demands, AE algorithms provide\n",
      "superior security, particularly in scenarios requiring enhanced encryption for\n",
      "AI-based networks. The paper concludes by addressing the security concerns that\n",
      "encryption algorithms must tackle in the age of AI and outlines future research\n",
      "directions aimed at enhancing encryption techniques for cybersecurity.\n",
      "\n",
      "Title: The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap\n",
      "Authors: Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong\n",
      "Summary: Large Language Models (LLMs) have emerged as a transformative AI paradigm,\n",
      "profoundly influencing daily life through their exceptional language\n",
      "understanding and contextual generation capabilities. Despite their remarkable\n",
      "performance, LLMs face a critical challenge: the propensity to produce\n",
      "unreliable outputs due to the inherent limitations of their learning-based\n",
      "nature. Formal methods (FMs), on the other hand, are a well-established\n",
      "computation paradigm that provides mathematically rigorous techniques for\n",
      "modeling, specifying, and verifying the correctness of systems. FMs have been\n",
      "extensively applied in mission-critical software engineering, embedded systems,\n",
      "and cybersecurity. However, the primary challenge impeding the deployment of\n",
      "FMs in real-world settings lies in their steep learning curves, the absence of\n",
      "user-friendly interfaces, and issues with efficiency and adaptability.\n",
      "  This position paper outlines a roadmap for advancing the next generation of\n",
      "trustworthy AI systems by leveraging the mutual enhancement of LLMs and FMs.\n",
      "First, we illustrate how FMs, including reasoning and certification techniques,\n",
      "can help LLMs generate more reliable and formally certified outputs.\n",
      "Subsequently, we highlight how the advanced learning capabilities and\n",
      "adaptability of LLMs can significantly enhance the usability, efficiency, and\n",
      "scalability of existing FM tools. Finally, we show that unifying these two\n",
      "computation paradigms -- integrating the flexibility and intelligence of LLMs\n",
      "with the rigorous reasoning abilities of FMs -- has transformative potential\n",
      "for the development of trustworthy AI software systems. We acknowledge that\n",
      "this integration has the potential to enhance both the trustworthiness and\n",
      "efficiency of software engineering practices while fostering the development of\n",
      "intelligent FM tools capable of addressing complex yet real-world challenges.\n",
      "\n",
      "Title: Out-of-Distribution Detection for Neurosymbolic Autonomous Cyber Agents\n",
      "Authors: Ankita Samaddar, Nicholas Potteiger, Xenofon Koutsoukos\n",
      "Summary: Autonomous agents for cyber applications take advantage of modern defense\n",
      "techniques by adopting intelligent agents with conventional and\n",
      "learning-enabled components. These intelligent agents are trained via\n",
      "reinforcement learning (RL) algorithms, and can learn, adapt to, reason about\n",
      "and deploy security rules to defend networked computer systems while\n",
      "maintaining critical operational workflows. However, the knowledge available\n",
      "during training about the state of the operational network and its environment\n",
      "may be limited. The agents should be trustworthy so that they can reliably\n",
      "detect situations they cannot handle, and hand them over to cyber experts. In\n",
      "this work, we develop an out-of-distribution (OOD) Monitoring algorithm that\n",
      "uses a Probabilistic Neural Network (PNN) to detect anomalous or OOD situations\n",
      "of RL-based agents with discrete states and discrete actions. To demonstrate\n",
      "the effectiveness of the proposed approach, we integrate the OOD monitoring\n",
      "algorithm with a neurosymbolic autonomous cyber agent that uses behavior trees\n",
      "with learning-enabled components. We evaluate the proposed approach in a\n",
      "simulated cyber environment under different adversarial strategies.\n",
      "Experimental results over a large number of episodes illustrate the overall\n",
      "efficiency of our proposed approach.\n",
      "\n",
      "Title: Hacking CTFs with Plain Agents\n",
      "Authors: Rustem Turtayev, Artem Petrov, Dmitrii Volkov, Denis Volk\n",
      "Summary: We saturate a high-school-level hacking benchmark with plain LLM agent\n",
      "design. Concretely, we obtain 95% performance on InterCode-CTF, a popular\n",
      "offensive security benchmark, using prompting, tool use, and multiple attempts.\n",
      "This beats prior work by Phuong et al. 2024 (29%) and Abramovich et al. 2024\n",
      "(72%).\n",
      "  Our results suggest that current LLMs have surpassed the high school level in\n",
      "offensive cybersecurity. Their hacking capabilities remain underelicited: our\n",
      "ReAct&Plan prompting strategy solves many challenges in 1-2 turns without\n",
      "complex engineering or advanced harnessing.\n",
      "\n",
      "Title: Explore Reinforced: Equilibrium Approximation with Reinforcement Learning\n",
      "Authors: Ryan Yu, Mateusz Nowak, Qintong Xie, Michelle Yilin Feng, Peter Chin\n",
      "Summary: Current approximate Coarse Correlated Equilibria (CCE) algorithms struggle\n",
      "with equilibrium approximation for games in large stochastic environments but\n",
      "are theoretically guaranteed to converge to a strong solution concept. In\n",
      "contrast, modern Reinforcement Learning (RL) algorithms provide faster training\n",
      "yet yield weaker solutions. We introduce Exp3-IXrl - a blend of RL and\n",
      "game-theoretic approach, separating the RL agent's action selection from the\n",
      "equilibrium computation while preserving the integrity of the learning process.\n",
      "We demonstrate that our algorithm expands the application of equilibrium\n",
      "approximation algorithms to new environments. Specifically, we show the\n",
      "improved performance in a complex and adversarial cybersecurity network\n",
      "environment - the Cyber Operations Research Gym - and in the classical\n",
      "multi-armed bandit settings.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "results = fetcher.search(query=\"all:(agents AND cybersecurity)\", max_results=10)\n",
    "\n",
    "for doc in results:\n",
    "    print(f\"Title: {doc.metadata['title']}\")\n",
    "    print(f\"Authors: {', '.join(doc.metadata['authors'])}\")\n",
    "    print(f\"Summary: {doc.text}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Search for papers where \"quantum\" appears but not \"computing\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(quantum NOT computing)\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28quantum+NOT+computing%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 100 of 356985 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 10 results for query: all:(quantum NOT computing)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title: Exponentially slow thermalization in 1D fragmented dynamics\n",
      "Authors: Cheng Wang, Shankar Balasubramanian, Yiqiu Han, Ethan Lake, Xiao Chen, Zhi-Cheng Yang\n",
      "Summary: We investigate the thermalization dynamics of 1D systems with local\n",
      "constraints coupled to an infinite temperature bath at one boundary. The\n",
      "coupling to the bath eventually erases the effects of the constraints, causing\n",
      "the system to tend towards a maximally mixed state at long times. We show that\n",
      "for a large class of local constraints, the time at which thermalization occurs\n",
      "can be extremely long. In particular, we present evidence for the following\n",
      "conjecture: when the constrained dynamics displays strong Hilbert space\n",
      "fragmentation, the thermalization time diverges exponentially with system size.\n",
      "We show that this conjecture holds for a wide range of dynamical constraints,\n",
      "including dipole-conserving dynamics, the $tJ_z$ model, and a large class of\n",
      "group-based dynamics, and relate a general proof of our conjecture to a\n",
      "different conjecture about the existence of certain expander graphs.\n",
      "\n",
      "Title: Efficient Mitigation of Error Floors in Quantum Error Correction using Non-Binary Low-Density Parity-Check Codes\n",
      "Authors: Kenta Kasai\n",
      "Summary: In this paper, we propose an efficient method to reduce error floors in\n",
      "quantum error correction using non-binary low-density parity-check (LDPC)\n",
      "codes. We identify and classify cycle structures in the parity-check matrix\n",
      "where estimated noise becomes trapped, and develop tailored decoding methods\n",
      "for each cycle type. For Type-I cycles, we propose a method to make the\n",
      "difference between estimated and true noise degenerate. Type-II cycles are\n",
      "shown to be uncorrectable, while for Type-III cycles, we utilize the fact that\n",
      "cycles in non-binary LDPC codes do not necessarily correspond to codewords,\n",
      "allowing us to estimate the true noise. Our method significantly improves\n",
      "decoding performance and reduces error floors.\n",
      "\n",
      "Title: Hamiltonian Simulation via Stochastic Zassenhaus Expansions\n",
      "Authors: Joseph Peetz, Prineha Narang\n",
      "Summary: We introduce the stochastic Zassenhaus expansions (SZEs), a class of\n",
      "ancilla-free quantum algorithms for Hamiltonian simulation. These algorithms\n",
      "map nested Zassenhaus formulas onto quantum gates and then employ randomized\n",
      "sampling to minimize circuit depths. Unlike Suzuki-Trotter product formulas,\n",
      "which grow exponentially long with approximation order, the nested commutator\n",
      "structures of SZEs enable high-order formulas for many systems of interest. For\n",
      "a 10-qubit transverse-field Ising model, we construct an 11th-order SZE with\n",
      "42x fewer CNOTs than the standard 10th-order product formula. Further, we\n",
      "empirically demonstrate regimes where SZEs reduce simulation errors by many\n",
      "orders of magnitude compared to leading algorithms.\n",
      "\n",
      "Title: Topological $X$-states in a quantum impurity model\n",
      "Authors: Moallison F. Cavalcante, Marcus V. S. Bonança, Eduardo Miranda, Sebastian Deffner\n",
      "Summary: Topological qubits are inherently resistant to noise and errors. However,\n",
      "experimental demonstrations have been elusive as their realization and control\n",
      "is highly complex. In the present work, we demonstrate the emergence of\n",
      "topological $X$-states in the long-time response of a locally perturbed quantum\n",
      "impurity model. The emergence of the double-qubit state is heralded by the lack\n",
      "of decay of the response function as well as the out-of-time order correlator\n",
      "signifying the trapping of excitations, and hence information in local edge\n",
      "modes.\n",
      "\n",
      "Title: Secure Quantum Key Distribution with Room-Temperature Quantum Emitter\n",
      "Authors: Ömer S. Tapşın, Furkan Ağlarcı, Serkan Ateş\n",
      "Summary: On-demand generation of single photons from solid-state quantum emitters is a\n",
      "key building block for future quantum networks, particularly quantum key\n",
      "distribution (QKD) systems, by enabling higher secure key rates (SKR) and lower\n",
      "quantum bit error rates (QBER). In this work, we demonstrate the B92 protocol\n",
      "based on single photons from defects in hexagonal boron nitride (hBN). The\n",
      "results show a sifted key rate (SiKR) of 17.5 kbps with a QBER of 6.49 % at a\n",
      "dynamic polarization encoding rate of 40 MHz. Finite-key analysis yields a SKR\n",
      "of 7 kbps, as one of the highest SKR obtained for any room-temperature single\n",
      "photon source. Our results highlight the potential of hBN defects in advancing\n",
      "quantum communication technologies.\n",
      "\n",
      "Title: The simplest 2D quantum walk detects chaoticity\n",
      "Authors: C. Alonso-Lobo, Gabriel G. Carlo, F. Borondo\n",
      "Summary: Quantum walks have been actively studied from many perspectives, mainly from\n",
      "the statistical physics and quantum information points of view. We here\n",
      "determine the influence of basic chaotic features on the walker behavior. We\n",
      "consider an extremely simple model consisting of alternate one-dimensional\n",
      "walks along the two spatial coordinates of bidimensional closed domains (hard\n",
      "wall billiards). The chaotic or regular behavior that the shape of the boundary\n",
      "induces in the deterministic classical equations of motion and that translates\n",
      "into chaotic signatures for the quantized problem also results in sharp\n",
      "differences for the spectral statistics and morphology of the eigenfunctions of\n",
      "the quantum walker. Unexpectedly, two different quantum mechanical problems\n",
      "share the same kind of features related to the corresponding classical dynamics\n",
      "of one of them.\n",
      "\n",
      "Title: Quantum model reduction for continuous-time quantum filters\n",
      "Authors: Tommaso Grigoletto, Clément Pellegrini, Francesco Ticozzi\n",
      "Summary: The use of quantum stochastic models is widespread in dynamical reduction,\n",
      "simulation of open systems, feedback control and adaptive estimation. In many\n",
      "applications only part of the information contained in the filter's state is\n",
      "actually needed to reconstruct the target observable quantities; thus, filters\n",
      "of smaller dimensions could be in principle implemented to perform the same\n",
      "task.In this work, we propose a systematic method to find, when possible,\n",
      "reduced-order quantum filters that are capable of exactly reproducing the\n",
      "evolution of expectation values of interest. In contrast with existing\n",
      "reduction techniques, the reduced model we obtain is exact and in the form of a\n",
      "Belavkin filtering equation, ensuring physical interpretability.This is\n",
      "attained by leveraging tools from the theory of both minimal realization and\n",
      "non-commutative conditional expectations. The proposed procedure is tested on\n",
      "prototypical examples, laying the groundwork for applications in quantum\n",
      "trajectory simulation and quantum feedback control.\n",
      "\n",
      "Title: Efficient Fermi-Hubbard model ground-state preparation by coupling to a classical reservoir in the instantaneous-response limit\n",
      "Authors: Zekun He, A. F. Kemper, J. K. Freericks\n",
      "Summary: Preparing the ground state of the Fermi-Hubbard model is challenging, in part\n",
      "due to the exponentially large Hilbert space, which complicates efficiently\n",
      "finding a path from an initial state to the ground state using the variational\n",
      "principle. In this work, we propose an approach for ground state preparation of\n",
      "interacting models by involving a classical reservoir, simplified to the\n",
      "instantaneous-response limit, which can be described using a Hamiltonian\n",
      "formalism. The resulting time evolution operator consist of spin-adapted\n",
      "nearest-neighbor hopping and on-site interaction terms similar to those in the\n",
      "Hubbard model, without expanding the Hilbert space. We can engineer the\n",
      "coupling to rapidly drive the system from an initial product state to its\n",
      "interacting ground state by numerically minimizing the final state energy. This\n",
      "ansatz closely resembles the Hamiltonian variational ansatz, offering a fresh\n",
      "perspective on it.\n",
      "\n",
      "Title: High-intensity wave vortices around subwavelength holes: from ocean tides to nanooptics\n",
      "Authors: Kateryna Domina, Pablo Alonso-González, Andrei Bylinkin, María Barra-Burillo, Ana I. F. Tresguerres-Mata, Francisco Javier Alfaro-Mozaz, Saül Vélez, Fèlix Casanova, Luis E. Hueso, Rainer Hillenbrand, Konstantin Y. Bliokh, Alexey Y. Nikitin\n",
      "Summary: Vortices are ubiquitous in nature; they appear in a variety of phenomena\n",
      "ranging from galaxy formation in astrophysics to topological defects in quantum\n",
      "fluids. In particular, wave vortices have attracted enormous attention and\n",
      "found applications in optics, acoustics, electron microscopy, etc. Such\n",
      "vortices carry quantized phase singularities accompanied by zero intensity in\n",
      "the center, and quantum-like orbital angular momentum, with the minimum\n",
      "localization scale of the wavelength. Here we describe a conceptually novel\n",
      "type of wave vortices, which can appear around arbitrarily small `holes' (i.e.,\n",
      "excluded areas or defects) in a homogeneous 2D plane. Such vortices are\n",
      "characterized by high intensity and confinement at the edges of the hole and\n",
      "hence subwavelength localization of the angular momentum. We demonstrate the\n",
      "appearance of such vortices in: (i) optical near fields around metallic\n",
      "nanodiscs on a dielectric substrate, (ii) phonon-polariton fields around\n",
      "nanoholes in a polaritonic slab, and (iii) ocean tidal waves around islands of\n",
      "New Zealand and Madagascar. We also propose a simple toy model of the\n",
      "generation of such subwavelength vortices via the interference of a\n",
      "point-dipole source and a plane wave, where the vortex sign is controlled by\n",
      "the mutual phase between these waves. Our findings open avenues for\n",
      "subwavelength vortex/angular-momentum-based applications in various wave\n",
      "fields.\n",
      "\n",
      "Title: Redshift leverage for the search of GRB neutrinos affected by quantum properties of spacetime\n",
      "Authors: Giovanni Amelino-Camelia, Giacomo D'Amico, Vittorio D'Esposito, Giuseppe Fabiano, Domenico Frattulillo, Giulia Gubitosi, Dafne Guetta, Alessandro Moia, Giacomo Rosati\n",
      "Summary: Some previous studies based on IceCube neutrinos had found intriguing\n",
      "preliminary evidence that some of them might be GRB neutrinos with travel times\n",
      "affected by quantum properties of spacetime delaying them proportionally to\n",
      "their energy, an effect often labeled as \"quantum-spacetime-induced in-vacuo\n",
      "dispersion\". Those previous studies looked for candidate GRB neutrinos in a\n",
      "fixed (neutrino-energy-independent) time window after the GRB onset and relied\n",
      "rather crucially on crude estimates of the redshift of GRBs whose redshift has\n",
      "not been measured. We here introduce a complementary approach to the search of\n",
      "quantum-spacetime-affected GRB neutrinos which restricts the analysis to GRBs\n",
      "of sharply known redshift, and, in a way that we argue is synergistic with\n",
      "having sharp information on redshift, adopts a neutrino-energy-dependent time\n",
      "window. We find that knowing the redshift of the GRBs strengthens the analysis\n",
      "enough to compensate for the fact that of course the restriction to GRBs of\n",
      "known redshift reduces the number of candidate GRB neutrinos. And rather\n",
      "remarkably our estimate of the magnitude of the in-vacuo-dispersion effects is\n",
      "fully consistent with what had been found using the previous approach. Our\n",
      "findings are still inconclusive, since their significance is quantified by a\n",
      "$p$-value of little less than $0.01$, but provide motivation for monitoring the\n",
      "accrual of neutrino observations by IceCube and KM3NeT as well as for further\n",
      "refinements of the strategy of analysis here proposed.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "results = fetcher.search(query=\"all:(quantum NOT computing)\", max_results=10)\n",
    "\n",
    "for doc in results:\n",
    "    print(f\"Title: {doc.metadata['title']}\")\n",
    "    print(f\"Authors: {', '.join(doc.metadata['authors'])}\")\n",
    "    print(f\"Summary: {doc.text}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Search for papers authored by a specific person"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: au:\"John Doe\"\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=au%3A%22John+Doe%22&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 1 of 1 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 1 results for query: au:\"John Doe\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title: Double Deep Q-Learning in Opponent Modeling\n",
      "Authors: Yangtianze Tao, John Doe\n",
      "Summary: Multi-agent systems in which secondary agents with conflicting agendas also\n",
      "alter their methods need opponent modeling. In this study, we simulate the main\n",
      "agent's and secondary agents' tactics using Double Deep Q-Networks (DDQN) with\n",
      "a prioritized experience replay mechanism. Then, under the opponent modeling\n",
      "setup, a Mixture-of-Experts architecture is used to identify various opponent\n",
      "strategy patterns. Finally, we analyze our models in two environments with\n",
      "several agents. The findings indicate that the Mixture-of-Experts model, which\n",
      "is based on opponent modeling, performs better than DDQN.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "results = fetcher.search(query='au:\"John Doe\"', max_results=10)\n",
    "\n",
    "for doc in results:\n",
    "    print(f\"Title: {doc.metadata['title']}\")\n",
    "    print(f\"Authors: {', '.join(doc.metadata['authors'])}\")\n",
    "    print(f\"Summary: {doc.text}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter Papers by Date (e.g., Last 15 Days)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29+AND+submittedDate%3A%5B20250110+TO+20250125%5D&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 2 of 2 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 2 results for query: all:(agents AND cybersecurity) AND submittedDate:[20250110 TO 20250125]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework\n",
      "Authors: He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu\n",
      "Published: 2025-01-23\n",
      "Summary: Penetration testing is a vital practice for identifying and mitigating\n",
      "vulnerabilities in cybersecurity systems, but its manual execution is\n",
      "labor-intensive and time-consuming. Existing large language model\n",
      "(LLM)-assisted or automated penetration testing approaches often suffer from\n",
      "inefficiencies, such as a lack of contextual understanding and excessive,\n",
      "unstructured data generation. This paper presents VulnBot, an automated\n",
      "penetration testing framework that leverages LLMs to simulate the collaborative\n",
      "workflow of human penetration testing teams through a multi-agent system. To\n",
      "address the inefficiencies and reliance on manual intervention in traditional\n",
      "penetration testing methods, VulnBot decomposes complex tasks into three\n",
      "specialized phases: reconnaissance, scanning, and exploitation. These phases\n",
      "are guided by a penetration task graph (PTG) to ensure logical task execution.\n",
      "Key design features include role specialization, penetration path planning,\n",
      "inter-agent communication, and generative penetration behavior. Experimental\n",
      "results demonstrate that VulnBot outperforms baseline models such as GPT-4 and\n",
      "Llama3 in automated penetration testing tasks, particularly showcasing its\n",
      "potential in fully autonomous testing on real-world machines.\n",
      "\n",
      "Title: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education\n",
      "Authors: Tianyu Wang, Nianjun Zhou, Zhixiong Chen\n",
      "Published: 2025-01-16\n",
      "Summary: Many non-traditional students in cybersecurity programs often lack access to\n",
      "advice from peers, family members and professors, which can hinder their\n",
      "educational experiences. Additionally, these students may not fully benefit\n",
      "from various LLM-powered AI assistants due to issues like content relevance,\n",
      "locality of advice, minimum expertise, and timing. This paper addresses these\n",
      "challenges by introducing an application designed to provide comprehensive\n",
      "support by answering questions related to knowledge, skills, and career\n",
      "preparation advice tailored to the needs of these students. We developed a\n",
      "learning tool platform, CyberMentor, to address the diverse needs and pain\n",
      "points of students majoring in cybersecurity. Powered by agentic workflow and\n",
      "Generative Large Language Models (LLMs), the platform leverages\n",
      "Retrieval-Augmented Generation (RAG) for accurate and contextually relevant\n",
      "information retrieval to achieve accessibility and personalization. We\n",
      "demonstrated its value in addressing knowledge requirements for cybersecurity\n",
      "education and for career marketability, in tackling skill requirements for\n",
      "analytical and programming assignments, and in delivering real time on demand\n",
      "learning support. Using three use scenarios, we showcased CyberMentor in\n",
      "facilitating knowledge acquisition and career preparation and providing\n",
      "seamless skill-based guidance and support. We also employed the LangChain\n",
      "prompt-based evaluation methodology to evaluate the platform's impact,\n",
      "confirming its strong performance in helpfulness, correctness, and\n",
      "completeness. These results underscore the system's ability to support students\n",
      "in developing practical cybersecurity skills while improving equity and\n",
      "sustainability within higher education. Furthermore, CyberMentor's open-source\n",
      "design allows for adaptation across other disciplines, fostering educational\n",
      "innovation and broadening its potential impact.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Calculate date 48 hours ago\n",
    "last_24_hours = (datetime.now() - timedelta(days=15)).strftime(\"%Y%m%d\")\n",
    "\n",
    "# Search for recent papers\n",
    "recent_results = fetcher.search(\n",
    "    query=\"all:(agents AND cybersecurity)\",\n",
    "    from_date=last_24_hours,\n",
    "    to_date=datetime.now().strftime(\"%Y%m%d\"),\n",
    "    max_results=5\n",
    ")\n",
    "\n",
    "# Display recent papers\n",
    "for doc in recent_results:\n",
    "    print(f\"Title: {doc.metadata['title']}\")\n",
    "    print(f\"Authors: {', '.join(doc.metadata['authors'])}\")\n",
    "    print(f\"Published: {doc.metadata['published']}\")\n",
    "    print(f\"Summary: {doc.text}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Top 5 Papers as PDF Files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 96 of 96 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to arxiv_papers/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloaded Paper: VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework\n",
      "File Path: arxiv_papers/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf\n",
      "\n",
      "Downloaded Paper: CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education\n",
      "File Path: arxiv_papers/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf\n",
      "\n",
      "Downloaded Paper: What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics\n",
      "File Path: arxiv_papers/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf\n",
      "\n",
      "Downloaded Paper: SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity\n",
      "File Path: arxiv_papers/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf\n",
      "\n",
      "Downloaded Paper: BotSim: LLM-Powered Malicious Social Botnet Simulation\n",
      "File Path: arxiv_papers/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "# Create a directory for downloaded papers\n",
    "os.makedirs(\"arxiv_papers\", exist_ok=True)\n",
    "\n",
    "# Search and download PDFs\n",
    "download_results = fetcher.search(query=\"all:(agents AND cybersecurity)\", max_results=5, download=True, dirpath=Path(\"arxiv_papers\"))\n",
    "\n",
    "for paper in download_results:\n",
    "    print(f\"Downloaded Paper: {paper['title']}\")\n",
    "    print(f\"File Path: {paper['file_path']}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'entry_id': 'http://arxiv.org/abs/2501.13411v1',\n",
       " 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework',\n",
       " 'authors': ['He Kong',\n",
       "  'Die Hu',\n",
       "  'Jingguo Ge',\n",
       "  'Liangxiong Li',\n",
       "  'Tong Li',\n",
       "  'Bingzhen Wu'],\n",
       " 'published': '2025-01-23',\n",
       " 'updated': '2025-01-23',\n",
       " 'primary_category': 'cs.SE',\n",
       " 'categories': ['cs.SE'],\n",
       " 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1',\n",
       " 'file_path': 'arxiv_papers/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "download_results[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Top 5 Papers as PDF Files (Include Summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:dapr_agents.document.fetcher.arxiv:Searching for query: all:(agents AND cybersecurity)\n",
      "INFO:arxiv:Requesting page (first: True, try: 0): https://export.arxiv.org/api/query?search_query=all%3A%28agents+AND+cybersecurity%29&id_list=&sortBy=submittedDate&sortOrder=descending&start=0&max_results=100\n",
      "INFO:arxiv:Got first page: 96 of 96 total results\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Found 5 results for query: all:(agents AND cybersecurity)\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.09709v1.CyberMentor__AI_Powered_Learning_Tool_Platform_to_Address_Diverse_Student_Needs_in_Cybersecurity_Education.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2501.00855v1.What_is_a_Social_Media_Bot__A_Global_Comparison_of_Bot_and_Human_Characteristics.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2412.20787v3.SecBench__A_Comprehensive_Multi_Dimensional_Benchmarking_Dataset_for_LLMs_in_Cybersecurity.pdf\n",
      "INFO:dapr_agents.document.fetcher.arxiv:Downloading paper to more_arxiv/2412.13420v1.BotSim__LLM_Powered_Malicious_Social_Botnet_Simulation.pdf\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "# Create a directory for downloaded papers\n",
    "os.makedirs(\"arxiv_papers\", exist_ok=True)\n",
    "\n",
    "# Search and download PDFs\n",
    "download_results = fetcher.search(query=\"all:(agents AND cybersecurity)\", max_results=5, download=True, dirpath=Path(\"more_arxiv\"), include_summary=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'entry_id': 'http://arxiv.org/abs/2501.13411v1',\n",
       " 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework',\n",
       " 'authors': ['He Kong',\n",
       "  'Die Hu',\n",
       "  'Jingguo Ge',\n",
       "  'Liangxiong Li',\n",
       "  'Tong Li',\n",
       "  'Bingzhen Wu'],\n",
       " 'published': '2025-01-23',\n",
       " 'updated': '2025-01-23',\n",
       " 'primary_category': 'cs.SE',\n",
       " 'categories': ['cs.SE'],\n",
       " 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1',\n",
       " 'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf',\n",
       " 'summary': 'Penetration testing is a vital practice for identifying and mitigating\\nvulnerabilities in cybersecurity systems, but its manual execution is\\nlabor-intensive and time-consuming. Existing large language model\\n(LLM)-assisted or automated penetration testing approaches often suffer from\\ninefficiencies, such as a lack of contextual understanding and excessive,\\nunstructured data generation. This paper presents VulnBot, an automated\\npenetration testing framework that leverages LLMs to simulate the collaborative\\nworkflow of human penetration testing teams through a multi-agent system. To\\naddress the inefficiencies and reliance on manual intervention in traditional\\npenetration testing methods, VulnBot decomposes complex tasks into three\\nspecialized phases: reconnaissance, scanning, and exploitation. These phases\\nare guided by a penetration task graph (PTG) to ensure logical task execution.\\nKey design features include role specialization, penetration path planning,\\ninter-agent communication, and generative penetration behavior. Experimental\\nresults demonstrate that VulnBot outperforms baseline models such as GPT-4 and\\nLlama3 in automated penetration testing tasks, particularly showcasing its\\npotential in fully autonomous testing on real-world machines.'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "download_results[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Penetration testing is a vital practice for identifying and mitigating\n",
      "vulnerabilities in cybersecurity systems, but its manual execution is\n",
      "labor-intensive and time-consuming. Existing large language model\n",
      "(LLM)-assisted or automated penetration testing approaches often suffer from\n",
      "inefficiencies, such as a lack of contextual understanding and excessive,\n",
      "unstructured data generation. This paper presents VulnBot, an automated\n",
      "penetration testing framework that leverages LLMs to simulate the collaborative\n",
      "workflow of human penetration testing teams through a multi-agent system. To\n",
      "address the inefficiencies and reliance on manual intervention in traditional\n",
      "penetration testing methods, VulnBot decomposes complex tasks into three\n",
      "specialized phases: reconnaissance, scanning, and exploitation. These phases\n",
      "are guided by a penetration task graph (PTG) to ensure logical task execution.\n",
      "Key design features include role specialization, penetration path planning,\n",
      "inter-agent communication, and generative penetration behavior. Experimental\n",
      "results demonstrate that VulnBot outperforms baseline models such as GPT-4 and\n",
      "Llama3 in automated penetration testing tasks, particularly showcasing its\n",
      "potential in fully autonomous testing on real-world machines.\n"
     ]
    }
   ],
   "source": [
    "print(download_results[0][\"summary\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading Downloaded PDFs\n",
    "\n",
    "To read the downloaded PDF files, we'll use the `PyPDFReader` class from `dapr_agents.document`. This allows us to extract the content of each page while retaining the associated metadata for further processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure you have the required library for reading PDFs installed. If not, you can install it using the following command:\n",
    "!pip install pypdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code reads each downloaded PDF file and extracts its pages. Each page is stored as a separate Document object, containing both the page's text and the metadata from the original PDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracted 93 documents from the PDFs.\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "from dapr_agents.document import PyPDFReader\n",
    "\n",
    "# Initialize the PDF reader\n",
    "docs_read = []\n",
    "reader = PyPDFReader()\n",
    "\n",
    "# Remove 'summary' from metadata in download_results\n",
    "for paper in download_results:\n",
    "    paper.pop(\"summary\", None)  # Remove the 'summary' key if it exists\n",
    "\n",
    "# Process each downloaded PDF\n",
    "for paper in download_results:\n",
    "    local_pdf_path = Path(paper[\"file_path\"])  # Ensure the key matches the output\n",
    "    documents = reader.load(local_pdf_path, additional_metadata=paper)  # Load the PDF with metadata\n",
    "    \n",
    "    # Append each page's document to the main list\n",
    "    docs_read.extend(documents)  # Flatten into one list of all documents\n",
    "\n",
    "# Verify the results\n",
    "print(f\"Extracted {len(docs_read)} documents from the PDFs.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 1, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative\\nFramework\\nHe Kong1,2, Die Hu1,2, Jingguo Ge1,2, Liangxiong Li1, Tong Li1 , and Bingzhen Wu1\\n1State Key Laboratory of Cyberspace Security Defense, Institute of Information Engineering,\\nChinese Academy of Sciences\\n2School of Cyber Security, University of Chinese Academy of Sciences\\nAbstract\\nPenetration testing is a vital practice for identifying and miti-\\ngating vulnerabilities in cybersecurity systems, but its manual\\nexecution is labor-intensive and time-consuming. Existing\\nlarge language model (LLM)-assisted or automated penetra-\\ntion testing approaches often suffer from inefficiencies, such\\nas a lack of contextual understanding and excessive, unstruc-\\ntured data generation. This paper presents VulnBot, an au-\\ntomated penetration testing framework that leverages LLMs\\nto simulate the collaborative workflow of human penetration\\ntesting teams through a multi-agent system. To address the in-\\nefficiencies and reliance on manual intervention in traditional\\npenetration testing methods, VulnBot decomposes complex\\ntasks into three specialized phases: reconnaissance, scanning,\\nand exploitation. These phases are guided by a penetration\\ntask graph (PTG) to ensure logical task execution. Key design\\nfeatures include role specialization, penetration path plan-\\nning, inter-agent communication, and generative penetration\\nbehavior. Experimental results demonstrate that VulnBot out-\\nperforms baseline models such as GPT-4 and Llama3 in auto-\\nmated penetration testing tasks, particularly showcasing its\\npotential in fully autonomous testing on real-world machines.\\n1 Introduction\\nPenetration testing is a critical methodology for proactively\\nidentifying network vulnerabilities and mitigating potential\\ncyberattacks [5, 21]. It enables the timely detection of weak-\\nnesses in target systems, facilitating targeted remediation and\\nreinforcement efforts. According to market forecasts, the pen-\\netration testing market is projected to grow significantly, ex-\\npanding from US $1.92 billion in 2023 to US $6.98 billion\\nby 2032 [11]. Despite its importance, traditional penetration\\ntesting remains a labor-intensive and time-consuming pro-\\ncess, requiring highly skilled professionals to execute com-\\nplex workflows manually. As network threats continue to\\ngrow in both complexity and scale, there is an urgent need\\nfor more efficient, scalable, and automated penetration testing\\nmethodologies to reduce manual effort and enhance testing\\nefficiency [9].\\nRecent advancements in large language models (LLMs)\\nand multi-agent systems have opened new avenues for au-\\ntomating penetration testing [2, 10, 18, 30, 35, 47, 50]. Con-\\nsequently, researchers have proposed various approaches to\\nleverage LLMs for automated penetration testing. For in-\\nstance, Deng et al. introduced PentestGPT, a pioneering effort\\nto utilize LLMs for automating penetration testing [14]. Pen-\\ntestGPT addresses the issue of context loss in LLMs during\\npenetration testing through three interconnected modules: a\\nreasoning module, a generation module, and a parsing module.\\nHowever, PentestGPT heavily relies on human intervention\\nand cannot assess the extent of such involvement, resulting\\nin limited agent autonomy. In contrast, AutoAttacker, a novel\\nmethod, focuses on automating the post-penetration phase\\nof simulated network attacks (i.e., \"keyboard-operated\" at-\\ntacks) [56]. By employing a modular design, AutoAttacker\\nleverages the planning, summarization, and code generation\\ncapabilities of LLMs, combined with tools like Metasploit, to\\ndemonstrate the efficacy of LLMs in isolated security tasks.\\nNevertheless, AutoAttacker primarily targets specific tasks\\nrather than real-world environments. While existing studies\\nhave explored the use of LLMs for automated penetration test-\\ning, they are often limited in scope, focusing on specific tasks\\nor relying on detailed vulnerability descriptions, which are\\ndifficult to apply in real-world scenarios. Additionally, these\\nmethods predominantly depend on the GPT-4 model [2], mak-\\ning it challenging to execute complex tasks using open-source\\nmodels.\\nIn this paper, we present VulnBot, an autonomous, multi-\\nagent penetration testing framework based on LLMs, designed\\nto emulate the collaborative workflows of human penetration\\ntesting teams. By integrating specialized modules that focus\\non different phases of penetration testing, VulnBot aims to\\nautomate and streamline the process of identifying vulner-\\nabilities in target systems. The framework incorporates dis-\\ntinct roles, such as reconnaissance, scanning, and exploitation,\\nalong with a Penetration Task Graph (PTG)-based approach\\n1\\narXiv:2501.13411v1  [cs.SE]  23 Jan 2025'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 2, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='Target Environment\\nTask Description：I want to pentest 10.10.0.4 ……\\n(a) LLM Assistant-Guided Pentest Agent (b) Conventional Automated Pentest Agent (c) Collaborative Multi-Agent (Ours)\\nAction\\nnmap -sS -p- 10.10.0.4\\nHelp\\nAnalyse\\nTarget Environment\\nFirst\\nExtract\\nAction\\nnmap -sS -p- [Target IP]\\nFirst, perform a Nmap \\nscan  … For this … Run \\nthe following command\\nto perform a SYN scan (-\\nsS) and … \\nToo many \\ninformation\\nTarget Environment\\nExtract\\nAction\\nnmap -sS -p- 10.10.0.4\\nFirst task,  \\nyou should\\n(Thinking)\\nPhase 1\\nPhase 2\\nPhase 3\\nPlanner\\nGenerator\\nContext Loss\\nInefficient\\nSummarizer\\nGood\\nGood\\nFirst, perform a \\nNmap scan  … For \\nthis, you can use \\nnmap, a powerful \\nnetwork scanning \\ntool …\\nFailed \\nCommand \\nPlan & Next Task\\nCommand\\nFigure 1: The workflow comparison of three approaches to automated penetration testing: (a) LLM Assistant-Guided Pentest\\nAgent, which requires assistance due to inefficiency; (b) Conventional Automated Pentest Agent, which struggles with information\\noverload and context loss; and (c) Collaborative Multi-Agent system, which employs a phased and modular approach, enhancing\\nthe overall efficiency and autonomy of the penetration testing process through multi-agent coordination.\\nto penetration path planning, inter-agent communication, and\\ngenerative penetration behavior. These components work to-\\ngether to simulate a robust and comprehensive penetration\\ntesting workflow.\\nAt the core of VulnBot’s design is its ability to model the\\npenetration testing process as a series of interdependent tasks,\\neach contributing to the overarching goal of identifying and\\nexploiting vulnerabilities in the target system. The PTG or-\\nganizes these tasks, ensuring that each step is executed in\\nthe correct sequence and context. VulnBot enhances inter-\\nagent communication through a Summarizer module, which\\nacts as a bridge between different phases of penetration test-\\ning. By summarizing key task outcomes and transmitting\\nthem to subsequent roles, the Summarizer ensures that crit-\\nical information is preserved and prioritized across stages.\\nThis targeted communication minimizes redundancy, ensures\\nclarity, and optimizes the flow of information across agents,\\nthereby maintaining the integrity and continuity of the pen-\\netration testing process. Furthermore, VulnBot’s Generator\\nand Executor modules translate these tasks into tool-specific\\ncommands, simulate human-like interactions with the target\\nsystem, and autonomously execute them, significantly reduc-\\ning the need for continuous human oversight. This paper in-\\ntroduces three operational modes: automatic, semi-automatic,\\nand human-involved. The experimental evaluation focuses\\non the automatic mode, as human involvement introduces\\nsubjectivity and variability that are difficult to quantify.\\nWe evaluated VulnBot across two distinct benchmarks to as-\\nsess its performance and real-world applicability. On the AU-\\nTOPENBENCH, VulnBot significantly outperformed base-\\nline models, including GPT-4o, Llama3.3-70B, and Llama3.1-\\n405B. Specifically, VulnBot-Llama3.1-405B achieved a com-\\npletion rate of 30.3%, compared to 9.09% for Llama3.1-405B\\nand 21.21% for GPT-4o. Additionally, VulnBot demonstrated\\nsuperior performance in the early stages of the test. By de-\\nlaying the automation of penetration testing to later stages,\\nVulnBot ensures that critical subtasks are executed with\\ngreater precision, thereby increasing the likelihood of complet-\\ning the testing process. Ablation studies further confirmed the\\neffectiveness of the various components within the framework.\\nOn real-world machines using the AI-Pentest-Benchmark,\\nVulnBot—when paired with Llama3.1-405B and DeepSeek-\\nv3—surpassed other baseline models. When integrated with\\nRetrieval Augmented Generation (RAG), VulnBot’s perfor-\\nmance further improved. In real-world machines, VulnBot\\nwith RAG autonomously completed tasks end-to-end, a feat\\nthat GPT-4o and Llama3.1-405B, which relied on human inter-\\nvention, could not achieve. These results highlight VulnBot’s\\npotential for fully autonomous penetration testing.\\nThe contributions of this work are as follows:\\n• We introduce VulnBot, an autonomous penetration test-\\ning framework that leverages the capabilities of LLMs\\nand multi-agent systems to automate complex penetra-\\ntion testing workflows. Inspired by the collaborative\\ndynamics of human penetration testing teams, VulnBot\\nemploys a tri-phase design—reconnaissance, scanning,\\nand exploitation. This design ensures that agents focus\\non specific tasks, minimizing information loss and en-\\nhancing efficiency.\\n• We propose a task-driven mechanism based on a PTG,\\nwhich models tasks and their dependencies as a directed\\n2'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 3, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='acyclic graph. The PTG ensures that tasks are executed\\nin a logical and conflict-free order, providing a struc-\\ntured framework for tracking task progress and outcomes.\\nThis mechanism, combined with a Check and Reflection\\nMechanism, enables continuous improvement and adap-\\ntation of the plan based on task status, achieving effective\\nerror handling and feedback.\\n• By utilizing open-source models such as Llama3.3-70B,\\nLlama3.1-405B, and DeepSeek-V3, we demonstrate the\\nfeasibility and effectiveness of leveraging open-source\\nLLMs for automated penetration testing. Our experi-\\nmental results show that VulnBot outperforms baseline\\nmodels such as GPT-4 and Llama3, achieving a 69.05%\\nsubtask completion rate and a 30.3% overall completion\\nrate on the AUTOPENBENCH. Additionally, VulnBot\\nachieved the best performance on six real-world ma-\\nchines in the AI-Pentest-Benchmark. Through the inte-\\ngration of RAG, VulnBot successfully realized complete\\nend-to-end penetration of real-world machines.\\n2 Background & Motivation\\n2.1 Background\\nPenetration testing, also referred to as ethical hacking, is a\\nmethod employed to evaluate the security of computer sys-\\ntems, networks, or applications by simulating potential ma-\\nlicious attacks. The primary objective is to identify and re-\\nmediate potential vulnerabilities before they can be exploited\\nby real attackers [1, 11]. According to the OWASP Testing\\nGuide [45], penetration testing typically consists of five key\\nphases: reconnaissance, scanning, vulnerability exploitation,\\nmaintaining access, and reporting [8]. The duration of these\\nphases varies depending on the scope of the test. On average,\\nthe entire process takes approximately 10 days, with the recon-\\nnaissance phase being the most time-consuming, often lasting\\nbetween 4 to 6 days [7, 53]. The cost of penetration testing\\nis also influenced by the type and scope of the assessment.\\nFor example, a basic website scan typically costs between US\\n$349 and US $1499, while more comprehensive assessments,\\nsuch as SaaS or web application scanning, can range from US\\n$700 to US $5999 [46, 49].\\n2.2 Motivation\\nTraditional penetration testing is both time-intensive and\\ncostly, highlighting the need for more efficient, automated\\nsolutions. Current approaches leveraging LLM-assisted or\\nautomated agents for penetration testing face notable ineffi-\\nciencies. As illustrated in Figure 1, the LLM Assistant-Guided\\nPentest Agent (a) lacks autonomy and requires frequent user\\nintervention to clarify tasks, resulting in inefficiency. Simi-\\nlarly, the Conventional Automated Penetration Testing Agent\\n(b) generates an excessive amount of unstructured data but\\nfails to provide actionable insights or clear next steps, lead-\\ning to context loss and command failures. The Collaborative\\nMulti-Agent System (c) addresses these limitations by lever-\\naging specialized agents for reconnaissance, scanning, and\\nexploitation. This system employs a modular and phased\\napproach, effectively managing tasks through coordinated\\nplanning, generation, and summarization, thereby enhancing\\nthe overall efficiency and autonomy of the penetration testing\\nprocess.\\n2.2.1 Task Definition\\nIn this paper, we define autonomous penetration testing as\\nencompassing two types of tasks: those requiring human in-\\ntervention, where penetration testers provide guidance, and\\nthose conducted entirely without human intervention. This\\nwork specifically focuses on the latter—tasks performed au-\\ntonomously, without the need for human oversight. Due to\\ntime and cost constraints, we leverage open-source models to\\nminimize expenses.\\n2.2.2 Exploratory Study\\nBefore investigating methods for automated penetration test-\\ning, we conducted an empirical study to address three research\\nquestions that explore the challenges of using open-source\\nLLMs in this domain:\\nRQ1: To what extent can open-source LLMs perform pen-\\netration testing tasks?\\nTo address RQ1, we reviewed the existing literature and\\nevaluated the performance of open-source LLMs in penetra-\\ntion testing contexts. Isozaki et al. conducted an analysis of\\nopen-source LLMs, introducing the AI-Pentest-Benchmark,\\nwhich comprises 13 real machines from Vulnhub [33, 51].\\nThey tested two prominent models, GPT-4o and Llama3.1-\\n405B, using the PentestGPT tool. The study revealed that\\nLlama3.1-405B outperformed GPT-4o in reconnaissance and\\nexploitation tasks for machines of easy and medium difficulty.\\nHowever, both models encountered challenges in privilege\\nescalation and tasks involving high-difficulty machines.\\nRQ2: What are the reasons for the failure of penetration\\ntesting using open-source LLMs?\\nRQ3: How do open-source LLMs perform in the different\\nphases of penetration testing?\\nTo answer RQ2 and RQ3, we utilized the AUTOPEN-\\nBENCH benchmark [22], which includes 33 tasks designed to\\nsimulate real-world penetration testing scenarios. These tasks\\nare categorized into two difficulty levels: in-vitro tasks (basic\\nnetwork security scenarios) and real-world tasks (based on\\npublicly disclosed CVEs). We conducted further analysis of\\nthe in-vitro tasks using the 128k-context Llama3.3-70B and\\nLlama3.1-405B models, with each test executed five times, to\\n3'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 4, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='Model Phase Failure Count Cause of Failure\\nSession Context LossFalse Output InterpretationFailed ToolDeadlock OperationFailed Command ParamOther\\nLlama3.3-70B\\nReconnaissance 28 18 (64.29%) 3 (10.71%) 2 (7.14%) 0 (0.00%) 4 (14.29%) 1 (3.57%)\\nScanning 27 5 (18.52%) 2 (7.41%) 8 (29.63%) 3 (11.11%) 7 (25.93%) 2 (7.41%)\\nExploitation 45 16 (35.56%) 4 (8.89%) 13 (28.89%) 2 (4.44%) 8 (17.78%) 2 (4.44%)\\nLlama3.1-405B\\nReconnaissance 43 28 (65.12%) 4 (9.30%) 1 (2.33%) 5 (11.63%) 3 (6.98%) 2 (4.65%)\\nScanning 27 4 (14.81%) 3 (11.11%) 9 (33.33%) 0 (0.00%) 11 (40.74%) 0 (0.00%)\\nExploitation 33 15 (45.45%) 2 (6.06%) 7 (21.21%) 1 (3.03%) 6 (18.18%) 1 (3.03%)\\nTotal 203 86 (42.36%) 18 (8.87%) 40 (19.70%) 11 (5.42%) 39 (19.21%) 8 (3.94%)\\nTable 1: Failure counts and causes for open-source LLMs in different phases\\ngain a deeper understanding of the limitations of open-source\\nmodels in penetration testing tasks.\\nTo investigate the causes of failure in penetration testing ex-\\nperiments, we conducted a detailed analysis and classification\\nof the results from 220 experiments, as summarized in Table\\n1. The primary cause of failure across both the reconnais-\\nsance and exploitation stages was the loss of session context.\\nHowever, the specific causes of failure varied between stages.\\nDuring the reconnaissance phase, the models frequently strug-\\ngled to understand the initial description provided by the user,\\nsuch as failing to execute commands like nmap -p 10.10.1.x\\nfor a comprehensive port scan. In the exploitation phase, the\\nmodels often forget previously scanned target machines or\\nrelevant information obtained during earlier stages. This con-\\ntext loss is primarily attributed to two factors: the limited size\\nof the context window and the token constraints inherent to\\nLLMs. When the critical data from a complex task exceeds\\nthe available context, important details may be truncated, lead-\\ning to the loss of vital information. Moreover, if the execution\\nresults are excessively lengthy, the context can become over-\\nloaded, further diminishing the model’s effectiveness. Overall,\\nTable 1 emphasizes the need for targeted improvements in\\nsession management, tool reliability, and command parameter\\naccuracy to enhance the robustness of open-source LLMs in\\npractical applications.\\n2.3 Challenge\\nTakeaway#1: LLM Context Length.A significant limitation\\nof LLMs is their fixed context length, which impedes their\\nability to maintain a coherent understanding of the entire\\npenetration testing process. As the model progresses through\\nthe various stages of the test, it often loses track of earlier\\ndiscoveries, leading to a failure to leverage prior insights.\\nThis context loss causes the model to forget critical steps or\\nfindings, thereby hindering task completion and adversely\\naffecting overall performance.\\nTakeaway#2: Penetration Command Generation.LLMs\\nfrequently encounter difficulties in generating accurate pene-\\ntration testing commands. They may produce incorrect tool\\nusage or fabricate non-existent parameters. Automated pene-\\ntration testing requires precise translation of natural language\\ninstructions into executable commands. However, the inabil-\\nity of current LLMs to reliably perform this translation intro-\\nduces significant errors and inefficiencies, undermining the\\naccuracy and reliability of the testing process.\\nTakeaway#3: Lack of Effective Error-Handling Mech-\\nanism. Current LLM-based systems lack an effective error-\\nhandling mechanism to manage command execution failures\\nor anomalies. When an error occurs, the model typically can-\\nnot autonomously diagnose the issue or take corrective ac-\\ntions. Consequently, manual intervention is often required to\\nresolve problems and resume testing, which diminishes the\\noverall automation and efficiency of the system.\\nTakeaway#4: Dynamic Reasoning Across Testing\\nPhases. Penetration testing involves multiple, interdependent\\nphases—reconnaissance, scanning, exploitation, and post-\\nexploitation—each of which builds on the information gath-\\nered in previous stages. For effective automation, it is not\\nenough for a system to perform well in isolated phases; it\\nmust also integrate findings dynamically to guide subsequent\\nactions. For example, insights gained during scanning must\\ninform exploitation strategies. Current systems struggle to\\nmaintain this dynamic flow, often requiring human oversight\\nto link findings across phases. This limitation results in frag-\\nmented analyses, where critical connections between discover-\\nies are missed. The inability to synthesize information across\\nmultiple stages becomes especially problematic in complex\\nscenarios.\\n3 Design\\nIn this section, we present the design of VulnBot, an au-\\ntonomous penetration testing framework for LLM-based\\nmulti-agent systems. We begin by providing an overview\\nof the overall architecture of VulnBot in Section 3.1. Sub-\\nsequently, we elaborate on the four key design aspects of\\nVulnBot: (1) specialization of roles (Section 3.2), (2) penetra-\\ntion path planning, which incorporates the Planner and Mem-\\nory Retriever modules (Section 3.3), (3) inter-agent commu-\\nnication, facilitated by the Summarizer module (Section 3.4),\\nand (4) generative penetration behavior and interaction, en-\\n4'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 5, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='abled by the Generator and Executor modules (Section 3.5).\\n3.1 Overview\\nThe overall architecture of VulnBot is illustrated in Figure 2.\\nVulnBot is an autonomous penetration testing framework de-\\nsigned to emulate the collaborative and strategic workflows of\\nhuman penetration testing teams. The system is built around\\nfive core modules—Planner, Memory Retriever, Generator,\\nExecutor, and Summarizer—which collectively automate the\\nthree primary phases of penetration testing: Reconnaissance,\\nScanning, and Exploitation. This design addresses the com-\\nplexities of automating penetration testing tasks while ensur-\\ning adaptability to unforeseen challenges, thereby maintaining\\nrobustness across diverse testing scenarios.\\n3.2 Specialization of Roles\\nDrawing from Takeaways 1 and 4, we employ a specialization\\nof roles mechanism. Clear role specialization is a cornerstone\\nof effective problem-solving in complex systems. By decom-\\nposing intricate tasks into smaller, well-defined subtasks, spe-\\ncialized roles enable agents to focus on specific objectives,\\nleveraging their unique expertise to contribute to the overall\\ngoal. In the context of penetration testing, this approach is\\nparticularly critical, as the process involves multiple interde-\\npendent phases, each requiring distinct skills and tools.\\nOur design encountered a significant challenge stemming\\nfrom the context length limitations of LLMs. When executing\\nthe five-phase process, critical information from earlier phases\\nis often lost or diluted as the process progresses. This occurs\\nbecause each phase must retain and reference information\\nfrom all preceding phases, not just the immediate prior one.\\nTo address this limitation, we restructured the penetration\\ntesting process into three specialized phases: reconnaissance,\\nscanning, and exploitation. This streamlined approach ensures\\nthat each phase maintains a clear focus while minimizing in-\\nformation loss across transitions. We provide the agent with\\ntask instructions in the form of text, including the task descrip-\\ntion, a role-playing jailbreak method [13, 38, 56] to bypass\\nLLM usage policies, and additional preliminary information\\nabout the agent.\\nReconnaissance The reconnaissance phase serves as the\\nfoundation of the penetration testing process, aimed at gath-\\nering comprehensive information about the target system. In\\nthis phase, agents are tasked with performing a full scan of the\\ntarget to identify all open ports and services. To achieve this,\\nwe equip reconnaissance agents with tools such as Nmap [44]\\nand Dirb [17], which are widely used for network discovery.\\nBy systematically collecting and organizing this data, the re-\\nconnaissance phase provides the necessary context for the\\nsubsequent scanning phase.\\nScanning Building on the data gathered during reconnais-\\nsance, the scanning phase focuses on identifying vulnerabili-\\nties and misconfigurations within the target system. Agents in\\nthis phase utilize specialized tools such as Nikto [43] (for web\\nserver vulnerability scanning) and WPScan [54] (for identi-\\nfying issues with WordPress sites) to detect potential weak-\\nnesses. The scanning phase is critical for narrowing down the\\nattack surface and prioritizing vulnerabilities that are most\\nlikely to be exploitable. By maintaining a clear separation\\nbetween reconnaissance and scanning, we ensure that agents\\ncan focus on their specific tasks without being overwhelmed\\nby extraneous information.\\nExploitation The exploitation phase marks the culmina-\\ntion of the penetration testing process, where vulnerabilities\\ndiscovered during reconnaissance and vulnerability scanning\\nare exploited to gain access to the target system and escalate\\nprivileges. In this phase, agents are equipped with tools such\\nas Metasploit [39] (for developing and executing exploit code)\\nand Hydra [32] (for brute-forcing credentials).\\nThis design ensures that each phase builds upon the pre-\\nvious one, enabling a seamless and effective workflow that\\nadapts to the complexities of real-world systems.\\n3.3 Penetration Path Planning\\nPenetration path planning is a critical component of VulnBot,\\nwhich incorporates the Planner and Memory Retriever mod-\\nules. The Planner module is responsible for generating and\\nmaintaining the penetration testing plan. It operates through\\ntwo distinct sessions: the Plan Session and the Task Session,\\neach serving a specific purpose in the planning and execution\\nprocess.\\nPlan Session: The Planner initially generates an action\\nplan in a JSON-compliant structure, tailored to the user’s re-\\nquirements and the characteristics of the target system. This\\nplan is decomposed into structured task lists, each defined\\nby unique identifiers, dependencies, instructions, and action\\ntypes, As shown in Figure 3. The primary objective is to con-\\nstruct a Penetration Testing Task Graph (PTG), which outlines\\nthe logical sequence of tasks to be executed. Subsequently,\\nthe plan is dynamically updated based on the results of task\\nexecution, incorporating feedback from both successful and\\nfailed tasks.\\nThis session is governed by two key mechanisms: theTask-\\ndriven Mechanism (Section 3.3.1), which organizes tasks\\ninto a directed acyclic graph, and the Check and Reflection\\nMechanism (Section 3.3.2), which ensures continuous im-\\nprovement and adaptation of the plan through iterative feed-\\nback from task execution result.\\nTask Session: This session focuses on generating specific\\ntask details for each instruction, which are then fed into the\\nGenerator for execution. Additionally, it is responsible for\\nchecking task execution results success.\\nTo mitigate the hallucination problem often associated with\\nLLMs, we employ a third-party retrieval-augmented genera-\\ntion framework, Langchain-Chatchat [37]. The Memory Re-\\n5'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 6, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='Name\\nTools\\nGoal\\nScanner\\nBased on the reconnaissance results, further enumeration and \\ncheck for vulnerabilities and misconfigurations in the target.\\nNikto, WPScan, ...\\nReconnaissance\\nScanning\\n Exploitation\\nScanning\\nMessage\\nTarget Enviroment\\nExecution\\nCommand\\nNext Task\\nHistory Message\\nSuccessful Tasks\\nMessage\\nPlanner Generator\\nMessage\\nSummarizer Executor\\nMemory \\nRetriever\\nNext Task\\nDetails\\nGenerated\\nPlan DAG\\nProcessing\\nBuild\\nFilter\\nTask\\nNode\\nPenetration Path Planning\\nRole\\nProfile\\nFeedback\\nFigure 2: Overview of VulnBot.\\ntriever module utilizes a vector database to store embeddings\\nof successful tasks and prior penetration knowledge. When\\ngenerating or updating plans, the system converts the current\\nplan into embedding vectors and computes their similarity\\nwith stored vectors using a text embedding model. The top k\\nmost similar vectors are retrieved, and a re-ranking algorithm\\nis applied to select the optimal option. This approach ensures\\nthat the system can leverage past experiences and knowledge\\nto enhance its planning decisions. The role of the Memory\\nRetriever module in supporting the Planner is discussed in\\ngreater detail in Section 5.4.\\n3.3.1 Task-driven mechanism\\nThe task-driven mechanism is centered around the concept of\\na Penetration Testing Task Graph (PTG), a structured repre-\\nsentation of tasks and their dependencies in the penetration\\ntesting process. The PTG ensures that tasks are executed in a\\nlogical and conflict-free order while providing a framework\\nfor tracking task progress and execution result status.\\nDefinition 1 (Penetration Task Graph) A Penetration\\nTask Graph (PTG) is a directed acyclic graph G = (V, E)\\nwhere:\\n• V is the set of nodes, each representing an individual\\ntask in the penetration testing process. Each task node\\nv ∈V has a unique identifier and contains the following\\nattributes:\\n– Instruction: Describes the primary task directive\\n(e.g., “enumerate open ports on the target ma-\\nchine”).\\n– Action: Defines the operation type, such as shell\\nor manual.\\n– Dependencies: A list of other task identifiers that\\nmust be completed before this task can be executed,\\nensuring proper sequencing.\\n– Command: The specific command to execute, is\\ngenerated by the Generator module.\\n– Result: The result returned from executing the\\ntask.\\n– Finished Status: Indicates whether the task has\\nbeen completed or is pending.\\n– Success Status: Indicates whether the task was\\nsuccessful or not.\\n• E is the set of directed edges, representing dependencies\\nbetween tasks. If task T1 must be executed before task\\n6'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 7, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='T2, then there exists a directed edge from T1 to T2. These\\ndependencies determine the execution order of tasks in\\nthe overall penetration testing process.\\n{\\n     \"id\":  \"1\", \"dependencies\":  [],\\n     \"instruction\": \"Use the credentials (wavex:door+open) to SSH into \\nthe target machine (IP: 192.168.1.104, Port: 22.\",\\n     \"action\": \"Shell\"\\n},\\n{\\n     \"id\":  \"2\", \"dependencies\":  [\"1\"],\\n     \"instruction\": \"Search for writable directories on the target machine \\nusing the command: \\'find / -writable -type d 2>/dev/null\\'.\",\\n     \"action\": \"Shell\"\\n},\\n{\\n     \"id\":  \"3\", \"dependencies\":  [\"1\"],\\n     \"instruction\": \"Enumerate running processes on the target machine \\nusing the command: \\'ps aux\\'.\",\\n     \"action\": \"Shell\"\\n},\\n……\\n{\\n     \"id\":  \"9\", \"dependencies\":  [\"5\", \"8\"],\\n     \"instruction\": \"Exploit the sudo permissions to escalate privileges to \\nroot using the command \\'sudo su\\'.\",\\n     \"action\": \"Shell\"\\n}\\nTask ListTask List\\nTask 1\\nTask 2\\nTask 3\\nTask 5\\nTask 4\\nTask 6\\nTask 7Task 8\\nTask 9\\nFigure 3: The process of generating Penetration Task Graph\\n(PTG). The green circle represents the current task being\\nexecuted, while the dark circle indicates that the task has been\\nsuccessfully completed.\\nThe PTG is designed such that each task is dependent on one\\nor more preceding tasks. This structure ensures that tasks\\nare organized to facilitate efficient sequencing and execution,\\nthereby maintaining a logical and systematic order through-\\nout the penetration testing process. As depicted in Figure 3,\\nan example of a PTG is presented. On the left side, a task\\nlist is provided in JSON format, detailing each task along\\nwith its dependencies, instructions, and actions. For instance,\\nTask 1 involves using specific credentials to SSH into a tar-\\nget machine located at IP address 192.168.1.104 on port 22.\\nSubsequent tasks, such as searching for writable directories\\n(Task 2) and enumerating running processes (Task 3), are con-\\ntingent upon the successful completion of Task 1. The right\\nside of the figure illustrates these tasks in a dependency graph,\\nwhere each node represents a task, and the arrows indicate\\nthe dependencies between them. This visual representation\\nelucidates the sequence and interdependencies of tasks, en-\\nsuring that each step is executed only after its prerequisites\\nhave been satisfied. This structured approach enhances the\\nefficiency and effectiveness of the penetration testing process\\nby systematically guiding the system through each required\\naction.\\n3.3.2 Check and Reflection Mechanism\\nThe ability to reanalyze failed tasks is critical for the success\\nof penetration testing. As highlighted in Takeaway 3, LLMs\\noften lack effective error-handling mechanisms. Can this limi-\\ntation be addressed through a reflection mechanism? Existing\\nmethods frequently lack self-correction capabilities, and due\\nto the hallucination problem associated with LLMs, they of-\\nten generate erroneous commands and parameters, as noted\\nin Takeaway 2. Another significant challenge is enabling the\\nLLM to accurately interpret the status of task execution results.\\nTo address these issues, we introduce a Check and Reflection\\nMechanism within the task session.\\nThe Task Session evaluates the results of task execution\\nand updates the task success status. The Plan Session then\\nreflects on the feedback from both successful and failed tasks,\\nautomatically updates the prompt words, and revises the plan\\naccordingly. Successful tasks are retained in the plan, while\\nfailed tasks are flagged for reanalysis. This iterative process\\nensures continuous improvement and adaptation, enhancing\\nthe system’s ability to recover from errors and optimize its\\nperformance.\\nTo facilitate this process, we employ the Merge Plan Al-\\ngorithm (Algorithm 1), which integrates new tasks into the\\nexisting plan while preserving completed tasks and their de-\\npendencies. The algorithm first identifies completed tasks that\\nare not present in the new task list and adds them to the merged\\nplan. It then processes new tasks, updating their sequences\\nand dependencies if they already exist in the completed tasks,\\nor creating new tasks if they do not.\\nAlgorithm 1 Merge Plan Algorithm\\n1: Input:\\n2: newTasks (List of new tasks)\\n3: oldTasks (List of old tasks)\\n4: Output:\\n5: mergedTasks (List of merged tasks)\\n6: completedTasks ← GETCOMPLETED TASKS (oldTasks )\\n7: mergedTasks ← []\\nStep 1: Add completed tasks not in the new task list\\n8: for all task ∈ completedTasks do\\n9: if EXISTS IN(task, newTasks) = false then\\n10: mergedTasks ← mergedTasks ∪{task}\\n11: end if\\n12: end for\\nStep 2: Process new tasks and merge with completed\\ntasks\\n13: for all newTask ∈ newTasks do\\n14: task ← GETTASK (newTask, completedTasks )\\n15: if task ̸= null then\\n16: UPDATE SEQUENCE (task)\\n17: UPDATE DEPENDENCIES (task)\\n18: else\\n19: task ← CREATE NEWTASK (newTask)\\n20: end if\\n21: mergedTasks ← mergedTasks ∪{task}\\n22: end for\\n23: return mergedTasks\\n7'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 8, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='3.4 Inter-Agent Communication\\nEffective message passing between agents is a critical com-\\nponent for successful collaboration in multi-agent systems.\\nIn this work, agents communicate using natural language,\\nwhich ensures clarity and interoperability. Accurate informa-\\ntion extraction is essential to optimize token usage and avoid\\nverbosity, which is particularly important given the constraints\\nof LLMs. The Summarizer module acts as a communication\\nbridge between roles, ensuring that key information from\\nsuccessfully completed tasks in one stage is seamlessly trans-\\nferred to the next. For instance, during the reconnaissance\\nphase, the Summarizer consolidates data such as identified\\nopen ports, service banners, operating system fingerprints,\\nand software versions. This enables the scanning role to effi-\\nciently locate its tasks, reduce redundant work, and streamline\\nworkflows, thereby minimizing information overload. The\\nPlanner module of the next role can easily interpret these nat-\\nural language summaries. For instance, if the scanning task\\nidentifies a vulnerability in a web application, the Summarizer\\nhighlights this vulnerability, enabling the exploitation role to\\nprioritize its actions effectively. This preserves the integrity\\nof the penetration testing process and maintains a coherent\\nworkflow.\\nAdditionally, the Summarizer maintains a summary of the\\ncurrent shell state to facilitate shell sharing across roles. For\\nexample, if the system gains access to a low-privileged user\\naccount (e.g., a student account) on the target machine from\\nthe attack machine (e.g., Kali Linux), the Summarizer records\\nthis state. This allows subsequent penetration paths to be\\nplanned based on the current shell status, ensuring continuity\\nand context preservation. By facilitating seamless communi-\\ncation between roles and prioritizing actionable insights, the\\nSummarizer enhances the efficiency and effectiveness of the\\nmulti-agent system.\\n3.5 Generative Penetration Behavior and In-\\nteraction\\nVulnBot operates in three distinct modes to accommodate\\nvarying levels of automation and user involvement: automatic,\\nmanual, and semi-automatic. These modes provide flexibility\\nin task execution, ensuring the system can adapt to diverse\\noperational requirements and user preferences.\\nAutomatic Mode: In this mode, VulnBot operates fully\\nautonomously, executing all tasks without human interven-\\ntion. The experimental evaluation in this paper focuses on the\\nautomatic mode, as it provides a consistent and objective basis\\nfor assessing system performance. While human participation\\ncan add value, it introduces subjectivity and variability that\\nare challenging to quantify.\\nManual Mode: In manual mode, the user actively executes\\ncommands and provides the results to the system. This mode\\nis particularly useful in scenarios requiring human expertise\\nto interpret complex or ambiguous outputs, ensuring nuanced\\ndecision-making.\\nSemi-Automatic Mode: Semi-automatic mode combines\\nthe strengths of both automatic and manual modes. In this\\nmode, tasks in the Penetration Testing Task Graph (PTG) are\\nexecuted based on their action type:\\n• If the action is classified as a shell command, the system\\nexecutes it automatically.\\n• If the action is marked as manual, the user executes the\\ncommand and returns the results to the system.\\nThis hybrid approach offers greater flexibility and control, en-\\nabling users to intervene when necessary while still leveraging\\nthe system’s automation capabilities.\\nThe Generator module plays a critical role in converting\\nthe next task provided by the Planner module into a tool-\\nspecific command tailored to the target and context of the\\ncurrent role. For example, an instruction for a reconnaissance\\ntask to enumerate open ports might be translated into a\\ncommand such as nmap -sV -p 22,80 <target-ip>, with\\nparameters optimized for the specific tool and scenario.\\nThe Executor module handles the execution of generated\\ncommands, maintaining an interactive shell with the attack\\nmachine (e.g., Kali Linux) using the Python Paramiko tool\\nlibrary. This module simulates human keyboard operations,\\nenabling seamless interaction with the target system. After\\nexecuting a command, the Executor returns the results to the\\nPlanner module for further analysis. To address the challenge\\nof overly long or redundant output, the system employs a fil-\\ntering mechanism: if the task execution result exceeds 8,000\\ncharacters, the LLM is used to extract key information. This\\nensures that only relevant and actionable insights are passed\\nto subsequent stages, improving system performance and re-\\nducing the risk of information overload.\\nTogether, the Generator and Executor modules create a\\nseamless and adaptive workflow for penetration testing. By\\ntransforming abstract plans into precise actions and ensuring\\ntheir effective execution, these modules provide the system\\nwith a robust and efficient execution pipeline.\\n4 Implementation\\nThe VulnBot prototype was implemented using approximately\\n3,000 lines of Python code, complemented by meticulously\\ndesigned prompts. The source code is publicly accessible at:\\nhttps://github.com/KHenryAegis/VulnBot.\\n4.1 Evaluation Settings\\nThe experiments were conducted in a controlled environment\\nusing the 2023 Kali Linux platform [34], which served as the\\nattacking machine. This platform was chosen for its compre-\\nhensive and reliable suite of penetration testing tools. The\\n8'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 9, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='primary models utilized in our evaluation include: Llama3.3-\\n70B and Llama3.1-405B, both with a context length of 128k\\ntokens. DeepSeek-v3, configured with a context length of 64k\\ntokens. The key experimental parameters were configured\\nas follows: The LLM Temperature was set to 0.5 to strike a\\nbalance between creativity and determinism in the model’s\\nresponses. For the AUTOPENBENCH, task execution was re-\\nstricted to 15 steps (i.e., VulnBot was allowed a maximum of\\n5 steps per phase). Similarly, for the AI-Pentest-Benchmark,\\ntask execution was limited to 24 steps (i.e., VulnBot was al-\\nlowed a maximum of 8 steps per phase).\\nThe Memory Retriever module was employed exclusively\\nfor the Retrieval Augmented Generation (RAG) experiments.\\nThis module enhances the system’s ability to retrieve rel-\\nevant historical data and prior knowledge, thereby mitigat-\\ning the hallucination problem often encountered with LLMs.\\nThe Memory Retriever is integrated with a vector database\\nand utilizes text embedding models to retrieve and leverage\\npast tasks and penetration knowledge efficiently. We use the\\nbce-embedding-base-v1 model [42] for embedding and the\\nbce-reranker-base-v1 model [42] for re-ranking. The system\\nretrieves the top 3 most similar vectors with a relevance score\\ngreater than 0.5, after which the re-ranking algorithm selects\\nthe most relevant tasks.\\nTo benchmark and validate our framework, we modified\\nPentestGPT, retaining all its original prompts. The Generator\\nand Executor modules extracted operational commands from\\nPentestGPT’s output and executed them in the target environ-\\nment. The results were then directly fed back into PentestGPT\\nfor analysis, simulating a manual copy-paste process without\\nadditional interpretation. Specifically, we used the ‘next‘ in\\nPentestGPT, which takes the results of previous test execu-\\ntions as input and generates the next step. Then, the ‘tool‘\\nprocesses security tool outputs and determines the next action.\\nWe also ensure the accuracy of command extraction and exe-\\ncution by the Generator and Executor in the baseline method,\\npreventing any potential impact on the reliability of our com-\\nparison and, in turn, guaranteeing the overall reliability of our\\nautomated penetration testing framework.\\nAfter evaluating both initial prompts for the Base LLM, we\\ndetermined that the following prompt yielded better results\\n[14, 22]. The Base LLM Initialization Prompt used in this\\npaper is as follows:\\nBase LLM Initialization\\nRole: You are a CTF player. You are working towards\\nthe final task in a step-by-step manner.\\nInstruction: At each run, focus on the observations\\nto provide the next action.\\n5 Evaluation\\nIn this section, we present a series of experiments designed to\\nevaluate the performance and functionality of VulnBot. The\\nfollowing research questions (RQs) guide this evaluation:\\nRQ1: How does the performance of VulnBot using the\\nopen-source model compare to that of baseline models (Sec-\\ntion 5.1)?\\nRQ2: How do key components such as role specialization,\\nPTG, and Summarizer influence the performance of VulnBot\\nin penetration testing tasks (Section 5.2)?\\nRQ3: How effective is VulnBot in real-world penetration\\ntesting scenarios (Section 5.3)?\\nRQ4: How does the integration of the Memory Retriever\\nmodule improve the performance of VulnBot in real-world\\npenetration testing tasks (Section 5.4)?\\n5.1 Performance Evaluation (RQ1)\\nWe evaluate the performance of VulnBot using the AU-\\nTOPENBENCH, which encompasses a diverse set of pen-\\netration testing tasks categorized into Access Control (AC),\\nWeb Security (WS), Network Security (NS), Cryptography\\n(CRPT), and Real-world scenarios. The experiments were\\nconducted using several state-of-the-art models, including\\nGPT-4o (gpt-4o-2024-08-06), Llama3.3-70B, and Llama3.1-\\n405B, both in their base configurations and integrated into\\nour framework. The data for GPT-4o is sourced from the [22].\\nGPT-4o step limits were set to 30 for in-vitro tasks and 60\\nfor real-world tasks. Table 2 presents the overall penetration\\ntask completion rates, while Table 3 details the subtask com-\\npletion rates. The term \"1 Experiment\" refers to the overall\\nsubtask completion rate across five experiments, where a sub-\\ntask is considered successful if it succeeds in at least one\\nexperiment. The term \"5 Experiments\" denotes the number\\nof subtasks completed in all five experiments. Additionally,\\nFigure 4 illustrates the stages at which failures occurred in\\nthe five experiments.\\nThe overall task completion rates across different models\\nare summarized in Table 2. As indicated, VulnBot consistently\\noutperforms the baseline models across various categories.\\nSpecifically, the VulnBot-Llama3.1-405B model achieves a\\n30.30% completion rate in overall tasks, representing a sig-\\nnificant improvement over the baseline models. This result\\nsuggests that VulnBot is more effective in handling penetra-\\ntion testing tasks, particularly in the Access Control (AC)\\nand Real-world categories. Notably, VulnBot-Llama3.3-70B\\nalso demonstrates competitive performance, particularly in\\nNetwork Security (33.33%) and Real-world tasks (18.18%),\\noutperforming both the base Llama3.3-70B and PentestGPT-\\nLlama3.3-70B models. The superior performance of VulnBot\\ncan be attributed to its advanced task decomposition, role\\nspecialization, and inter-agent communication mechanisms,\\nwhich enable it to handle complex, multi-step penetration\\n9'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 10, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='Category GPT-4o Llama3.3-70B (Our) Llama3.1-405B (Our) Llama3.3-70B (Base) Llama3.1-405B (Base) Llama3.3-70B (PentestGPT) Llama3.1-405B (PentestGPT)\\nAC 1 (20.00%) 1 (20.00%) 3 (60.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 1 (20.00%)\\nWS 2 (28.57%) 1 (14.29%) 2 (28.57%) 0 (0.00%) 1 (14.29%) 0 (0.00%) 0 (0.00%)\\nNS 3 (50.00%) 2 (33.33%) 2 (33.33%) 2 (33.33%) 2 (33.33%) 2 (33.33%) 2 (33.33%)\\nCRPT 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%)\\nReal-world 1 (9.09%) 2 (18.18%) 3 (27.27%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%)\\nALL 7 (21.21%) 6 (18.18%) 10 (30.30%) 2 (6.06%) 3 (9.09%) 2 (6.06%) 3 (9.09%)\\nTable 2: The performance of GPT-4o, Llama3.1-405B, and Llama3.3-70B on overall target completion\\nCategory Llama3.3-70B (Our) Llama3.1-405B (Our) Llama3.3-70B (Base) Llama3.1- 405B (Base) Llama3.3-70B (PentestGPT) Llama3.1-405B (PentestGPT)\\n1 Experiment (Total Subtasks: 210)\\nAC 25 (11.90%) 31 (14.76%) 16 (7.62%) 21 (10.00%) 10 (4.76%) 20 (9.52%)\\nWS 24 (11.43%) 30 (14.29%) 22 (10.48%) 26 (12.38%) 20 (9.52%) 18 (8.57%)\\nNS 12 (5.71%) 11 (5.24%) 10 (4.76%) 9 (4.29%) 9 (4.29%) 6 (2.86%)\\nCRPT 15 (7.14%) 18 (8.57%) 17 (8.10%) 18 (8.57%) 8 (3.81%) 12 (5.71%)\\nReal-world 49 (23.33%) 55 (26.19%) 29 (13.81%) 29 (13.81%) 26 (12.38%) 28 (13.33%)\\nALL 125 (59.52%) 145 (69.05%) 94 (44.76%) 103 (49.05%) 73 (34.76%) 84 (40.00%)\\n5 Experiments (Total Subtasks: 1050)\\nAC 87 (8.29%) 107 (10.19%) 46 (4.38%) 61 (5.81%) 32 (3.05%) 27 (2.57%)\\nWS 106 (10.10%) 116 (11.05%) 83 (7.90%) 66 (6.29%) 60 (5.71%) 40 (3.81%)\\nNS 41 (3.90%) 40 (3.81%) 36 (3.43%) 22 (2.10%) 27 (2.57%) 15 (1.43%)\\nCRPT 65 (6.19%) 75 (7.14%) 68 (6.48%) 44 (4.19%) 18 (1.71%) 43 (4.10%)\\nReal-world 166 (15.81%) 186 (17.71%) 99 (9.43%) 67 (6.38%) 102 (9.71%) 56 (5.33%)\\nALL 465 (44.29%) 524 (49.90%) 332 (31.62%) 260 (24.76%) 239 (22.76%) 181 (17.24%)\\nTable 3: The performance of Llama3.1-405B, and Llama3.3-70B on subtask completion\\ntesting workflows more effectively.\\nThe subtask completion rates for both single and multi-\\nple experiments are presented in Table 3. As shown, both\\nVulnBot-Llama models outperform their baseline counter-\\nparts. The Llama3.1-405B model achieves a 69.05% comple-\\ntion rate in the single experiment setting and 49.90% in the\\naggregated five-experiment setting. In contrast, the baseline\\nLlama3.1-405B model achieves 49.05% and 24.76% in the\\nsingle and five-experiment settings, respectively.\\nReconnaissance Scanning Exploitation Finish\\n0\\n20\\n40\\n60\\n80\\n100\\n41\\n56 58\\n1010\\n49\\n93\\n13\\n71\\n47\\n40\\n79\\n32\\n105\\n19\\nBase-Llama3.3-70B\\nVulnBot-Llama3.3-70B\\nBase-Llama3.1-405B\\nVulnBot-Llama3.1-405B\\nFigure 4: The failure counts of VulnBot and baseline mod-\\nels across the Reconnaissance, Scanning, and Exploitation\\nphases.\\nFurthermore, Figure 4 highlights the failure counts per\\nstage in penetration testing for various models. In the Re-\\nconnaissance and Scanning phases, VulnBot-Llama3.1-405B\\nconsistently demonstrates the fewest errors, with 9 and 32 fail-\\nures respectively, outperforming other models. The significant\\nreduction in failures, particularly during the Reconnaissance\\nphase, suggests that Llama3.1-405B allows for a smoother\\nprogression through the early stages of penetration testing.\\nThis advantage effectively pushes the testing process forward,\\nenabling subsequent stages to be approached with a more\\naccurate understanding of the system, which could lead to a\\nmore comprehensive and efficient exploitation process. The\\nsuperior performance of VulnBot-Llama3.1-405B is further\\nevidenced by the higher number of tasks reaching the Finish\\nstage, with 19 successful completions compared to 7 for the\\nbaseline Llama3.1-405B model. This substantial improve-\\nment in the Finish rate underscores the effectiveness of our\\nframework in driving the penetration testing process closer\\nto completion. By reducing errors in the early stages and\\nensuring a more accurate and efficient progression through\\nthe workflow, VulnBot increases the likelihood of success-\\nfully concluding the testing process. However, challenges\\npersist in the Exploitation phase, where VulnBot exhibits\\nhigher failure rates compared to other phases. Specifically,\\nVulnBot-Llama3.3-70B experiences 93 failed tasks, while\\nVulnBot-Llama3.1-405B encounters 105 failed tasks in this\\nphase. This discrepancy underscores the inherent complexity\\n10'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 11, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='of the Exploitation phase and suggests that further refine-\\nment is necessary to address the intricacies of this critical\\nstage. Nevertheless, by strategically delaying the automation\\nof penetration testing to later stages, VulnBot ensures that\\ncritical subtasks are executed with greater precision, thereby\\nincreasing the likelihood of completing the testing process.\\n5.2 Ablation Study (RQ2)\\nIn this section, we evaluate the impact of key architectural\\ncomponents by conducting ablation experiments on AU-\\nTOPENBENCH Real-world tasks. The experiments focus on\\nthe Llama3.1-405B model within a 128k token context. We\\nimplement three variants of VulnBot to isolate the contribu-\\ntions of its core components: (1) VulnBot-without Role: The\\nrole specialization mechanism is deactivated, causing agents\\nto operate without distinct roles. (2) VulnBot-without PTG:\\nThe Penetration Task Graph (PTG) is removed, eliminating\\nthe structured task planning and dependency management.\\n(3) VulnBot-without Summarizer: The Summarizer module is\\ndisabled, preventing inter-agent communication and context\\nsummarization.\\nSubtask Overall\\n0\\n10\\n20\\n30\\n40\\n50\\n55\\n3\\n32\\n0\\n37\\n0\\n27\\n0\\nVulnBot\\nVulnBot-without Role\\nVulnBot-without PTG\\nVulnBot-without Summarizer\\nFigure 5: Ablation study of VulnBot on AUTOPENBENCH.\\nThis figure demonstrates the impact of removing key compo-\\nnents—role specialization, the PTG, and the Summarizer—on\\nmodel performance.\\nFigure 5 illustrates the performance degradation observed\\nwhen these essential components are removed. Our findings\\nreveal that each component plays a critical role in enhancing\\nmodel performance. Specifically, the removal of role special-\\nization results in a significant decline in performance, with\\nsubtask success rates dropping from 55 to 32. Similarly, omit-\\nting the PTG leads to a reduction in the subtask success rate,\\ndecreasing to 37. The most substantial performance decline\\noccurs when the Summarizer is removed, reducing the subtask\\nsuccess rate to just 27. Furthermore, the overall task success\\nrate is entirely eliminated when any of these components are\\nremoved. These results underscore the critical importance of\\nrole specialization, PTG, and the Summarizer in achieving\\nhigh performance on penetration testing tasks. The ablation\\nstudy highlights that the synergistic interaction of these com-\\nponents is vital for the model’s success in both subtasks and\\noverall task completion. This finding aligns with the broader\\ntrend in multi-agent systems, where effective role allocation,\\ntask planning, and communication are essential for complex,\\nreal-world applications.\\n5.3 Effectiveness for Real-World (RQ3)\\nTo evaluate the practical applicability of our models, we con-\\nducted five rounds of experiments on a selection of real-world\\ntargets from the AI-Pentest-Benchmark, which includes 13\\nvulnerable machines. We selected six machines for this eval-\\nuation, focusing on penetration tasks that did not involve\\nimage observation or human intervention. The experiments\\nwere conducted using two models: Llama3.1-405B with a\\n128k context and DeepSeek-v3 with a 64k context. The task\\ncompletion rates were calculated based on the successful com-\\npletion of subtasks as defined by the AI-Pentest-Benchmark.\\nFor each machine, the reported completion rate represents the\\nbest performance achieved across the five experimental runs.\\nFigure 6 illustrates the subtask completion rates across these\\nmachines, with a value of 1 indicating a successful penetration.\\nThe results demonstrate that VulnBot-Llama3.1-405B con-\\nsistently outperforms its counterparts, achieving the highest\\ncompletion rates on Victim1 (0.33), Library2 (0.40), and West-\\nWild (0.57). Similarly, VulnBot-DeepSeek-v3 demonstrated\\ncompetitive performance, with completion rates of 0.83 on\\nVictim1 and 0.71 on WestWild. These findings highlight\\nVulnBot’s superior capability in handling complex, multi-\\nstep attack chains, which are critical in real-world penetration\\ntesting scenarios. The consistent performance of VulnBot\\nacross diverse machines underscores its robustness and adapt-\\nability, making it a reliable tool for practical cybersecurity\\napplications.\\nVictim1 Library2 Sar WestWild Symfonos2 Funbox\\n0.0\\n0.1\\n0.2\\n0.3\\n0.4\\n0.5\\n0.6\\n0.7\\n0.8\\n0.33\\n0.40\\n0.27\\n0.57\\n0.29\\n0.33\\n0.17\\n0.20\\n0.09\\n0.14 0.14\\n0.22\\n0.00\\n0.20\\n0.27\\n0.14 0.14\\n0.11\\n0.83\\n0.20\\n0.36\\n0.71\\n0.29\\n0.44\\n0.17\\n0.20\\n0.27\\n0.57\\n0.21 0.22\\n0.50\\n0.20\\n0.27\\n0.57\\n0.21\\n0.44\\nVulnBot-Llama3.1-405B\\nPentestGPT-Llama3.1-405B\\nBase-Llama3.1-405B\\nVulnBot-DeepSeek-v3\\nPentestGPT-DeepSeek-v3\\nBase-DeepSeek-v3\\nFigure 6: The performance of VulnBot over the real-world\\nmachines.\\n11'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 12, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='5.4 Retrieval Augmented Generation (RQ4)\\nTo further investigate whether prior penetration knowledge\\ncan enhance the performance of our framework, we integrated\\nthe Memory Retriever module into the Llama3.1-405B model,\\nwhich supports a 128k context window. This integration lever-\\nages RAG to improve the model’s contextual understanding\\nand task-specific optimization. In this experiment, we eval-\\nuated the performance of three distinct systems: Llama3.1-\\n405B with RAG, GPT-4o with Manual, and Llama3.1-405B\\nwith Manual. The data for GPT-4o and Llama3.1 with Manual\\nwere obtained from [33], where human operators utilized Pen-\\ntestGPT tools. To augment the contextual knowledge of native\\nLLMs, we incorporated content from cybersecurity resources\\nsuch as HackTricks [26] and HackingArticles [6]. This con-\\ntent was segmented into 750-word chunks, and the resulting\\nembeddings were stored in the Milvus vector database [40] for\\nefficient retrieval. This approach enables the system to dynam-\\nically retrieve relevant historical data and prior knowledge,\\nthereby mitigating the hallucination problem often encoun-\\ntered with LLMs.\\nFigure 7 illustrates the task completion rates of these mod-\\nels across six real-world machines. The results demonstrate\\nthat integrating the Memory Retriever module significantly\\nenhances performance on specific machines, particularly Vic-\\ntim1 and WestWild. Notably, VulnBot successfully executed\\nan end-to-end penetration of the WestWild machine, showcas-\\ning its ability to complete complex tasks autonomously. These\\nfindings highlight the advantages of retrieval-augmented ap-\\nproaches in improving the contextual understanding and task-\\nspecific optimization of penetration testing models. The inte-\\ngration of the Memory Retriever module not only enhances\\nthe model’s ability to retrieve and utilize relevant information\\nbut also improves its overall performance in real-world pene-\\ntration testing scenarios, achieving performance comparable\\nto or even surpassing that of human operators.\\nVictim1 Library2 Sar WestWild Symfonos2 Funbox\\n0.0\\n0.2\\n0.4\\n0.6\\n0.8\\n1.0\\n0.83\\n0.60\\n0.55\\n1.00\\n0.29\\n0.56\\n0.33\\n0.50\\n0.55\\n0.57\\n0.43\\n0.33\\n0.67\\n0.80\\n0.73\\n0.57\\n0.43\\n0.56\\nLlama3.1-405B-with RAG\\nGPT-4o-with Manual\\nLlama3.1-405B-with Manual\\nFigure 7: Performance comparison of VulnBot with Memory\\nRetriever module\\n6 Discussion\\nThe results obtained highlight VulnBot’s potential for effi-\\ncient vulnerability detection and exploitation. However, our\\nfindings also reveal several challenges and areas for future\\nresearch that need to be addressed to enhance its capabilities\\nfurther.\\n6.1 Limitations in Processing Non-Textual In-\\nformation\\nA significant limitation of VulnBot is its inability to process\\nnon-textual information, such as images or graphical inter-\\nfaces generated by penetration testing tools. In real-world\\npenetration testing scenarios, such is often critical for under-\\nstanding attack surfaces and interpreting the results of security\\nscans. Currently, VulnBot depends on manual descriptions\\nto interpret these non-textual elements, which introduces a\\nbottleneck in achieving full automation of the penetration test-\\ning process. Future iterations of VulnBot could address this\\nlimitation by incorporating image recognition and processing\\ncapabilities. This enhancement would enable the system to\\nanalyze and extract relevant information from screenshots and\\nother graphical representations.\\n6.2 Real-World Performance and Challenges\\nThe real-world tasks in the AUTOPENBENCH include two\\nCVEs from 2024. VulnBot completed one of these tasks using\\nLlama3.3 and Llama3.1, despite both models having a knowl-\\nedge cutoff in December 2023. This achievement highlights\\nthe reliability of our method, as it does not rely on prior knowl-\\nedge of the vulnerabilities. Despite these promising results in\\nsimulated environments, completing end-to-end penetration\\ntesting on real-world machines remains a significant challenge\\nfor VulnBot. Even with RAG to enhance the model’s con-\\ntextual knowledge and task-specific optimizations, VulnBot\\nstill faces difficulties in achieving full autonomy and success\\nacross all stages of a real-world penetration test. These chal-\\nlenges stem from the complexity of real-world systems, the\\ndynamic nature of security vulnerabilities, and the need for\\nprecise execution of multi-step attack chains.\\n7 Related Work\\n7.1 Vulnerability Detection and Exploitation\\nAtropos introduces a novel fuzzing technique for detecting\\nserver-side vulnerabilities in PHP-based web applications.\\nIt utilizes snapshot-based, feedback-driven fuzzing, which\\nis integrated directly with the PHP interpreter [25]. Simi-\\nlarly, NAUTILUS focuses on identifying vulnerabilities in\\nRESTful APIs through guided testing and parameter gener-\\nation strategies, emphasizing complex API interactions and\\n12'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 13, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='corner cases [15]. Furthermore, empirical studies, such as\\n\"Understanding Hackers’ Work\", provide insights into the op-\\nerational methods and challenges faced by offensive security\\npractitioners, underscoring the need for improved tooling [28].\\nRecent advancements in large language models (LLMs)\\nhave shown significant promise in vulnerability management.\\nLiu et al. explore the use of ChatGPT for handling complex\\ncybersecurity tasks, particularly bug report analysis. They\\nintroduce a self-heuristic prompt template that enhances Chat-\\nGPT’s performance by summarizing domain knowledge from\\nprovided examples. This approach enables the model to learn\\ntask-specific characteristics, resulting in improved perfor-\\nmance in bug report summarization through few-shot learning\\ncompared to zero-shot or general information prompts [36].\\nFang et al. further investigate LLMs’ potential in exploiting\\none-day vulnerabilities, demonstrating that GPT-4 can au-\\ntonomously exploit 87% of a benchmark set of real-world\\nvulnerabilities when provided with CVE descriptions. This re-\\nsearch highlights the emergent capabilities of advanced LLMs\\nin cybersecurity, while also raising concerns about their re-\\nsponsible deployment [19].\\n7.2 Automated Penetration Testing\\nRecent research has explored integrating LLMs into pene-\\ntration testing workflows, significantly enhancing automa-\\ntion and efficiency. Al-Sinani and Mitchel investigate the\\nuse of GPT-4 across various ethical hacking phases, includ-\\ning reconnaissance and post-exploitation [3]. Tools like AU-\\nTOATTACKER [56] and BreachSeek [4] leverage LLMs to\\nautomate post-breach activities and simulate cyberattacks,\\nrespectively, while CIPHER [48] specializes in assisting eth-\\nical researchers through structured augmentation methods.\\nFrameworks such as PTGroup [55] and HPTSA [20] demon-\\nstrate the potential of multi-agent systems and hierarchical\\nplanning in exploiting zero-day vulnerabilities. Furthermore,\\nHackSynth [41] and Pentest Copilot [23] highlight the role of\\ncrafted prompts and LLM integration in automating penetra-\\ntion testing sub-tasks.\\nHappe and Cito further explore the application of LLMs in\\npenetration testing, presenting use cases for both high-level\\ntask planning and low-level vulnerability hunting. Their work\\nimplements a feedback loop where LLM-generated actions\\nare executed on a virtual machine via SSH, demonstrating the\\npotential of LLMs to automate parts of penetration testing\\nwhile raising ethical concerns about misuse [27]. Happe et\\nal. also evaluated the ability of different LLMs to execute\\nprivilege escalation attacks in the simulation environment by\\ndeveloping a fully automatic tool Wintermute. The results\\nshow that GPT-4-turbo has achieved a remarkable success\\nrate with the assistance of sufficient context information and\\nstate mechanism [29]. Huang and Zhu introduce PenHeal,\\na two-stage LLM framework for automated penetration and\\noptimal remediation. Their system incorporates components\\nlike Planner, Executor, Estimator, Advisor, and Evaluator to\\nstreamline the process, using counterfactual prompting and a\\nGroup Knapsack Algorithm to prioritize effective and cost-\\nefficient remediations [31].\\nHowever, these approaches face limitations, including de-\\npendency on detailed vulnerability descriptions (e.g., CVE\\ndata) for effective exploitation, instability and variability in\\nperformance across different tasks and environments, and the\\nneed for human intervention in complex or end-to-end pene-\\ntration testing scenarios. Additionally, many systems struggle\\nwith long-term planning and adaptability in dynamic environ-\\nments, and while assisted or multi-agent approaches improve\\nsuccess rates, fully autonomous agents still face challenges in\\nachieving consistent and reliable results.\\n7.3 Application of LLM in Cybersecurity\\nBeyond penetration testing, LLMs are increasingly applied to\\na broader range of cybersecurity tasks. Cycle enhances code\\ngeneration capabilities through iterative self-refinement [16],\\nwhile Guan et al. leverage LLMs to detect model optimization\\nbugs in deep learning libraries [24]. Fang et al. demonstrate\\nthe ability of LLMs to exploit recently disclosed vulnerabili-\\nties with high success rates. Tools like SecurityBot [57] inte-\\ngrate LLMs with reinforcement learning to improve cyberse-\\ncurity operations, and AURORA automates the orchestration\\nof APT attack campaigns [52]. Additionally, PTHelper [12]\\nstreamlines penetration testing by integrating AI with state-\\nof-the-art tools. These applications illustrate the versatility of\\nLLMs in addressing diverse cybersecurity challenges.\\n8 Conclusion\\nIn this paper, we present VulnBot, an autonomous penetra-\\ntion testing framework that LLMs and multi-agent systems.\\nVulnBot is designed to emulate the collaborative workflows\\nof human penetration testing teams, thereby addressing the\\ninefficiencies and manual dependencies inherent in traditional\\npenetration testing methodologies. By decomposing complex\\ntasks into specialized phases—reconnaissance, scanning, and\\nexploitation—and utilizing a PTG to ensure logical task ex-\\necution, VulnBot demonstrates significant advancements in\\nautomating penetration testing workflows.\\nOur experimental results underscore VulnBot’s superior\\nperformance relative to baseline models such as GPT-4 and\\nLlama3. Incorporating RAG further augments VulnBot’s ca-\\npabilities, enabling it to execute end-to-end penetration tasks\\nautonomously without human intervention. These findings\\nhighlight the potential of VulnBot to revolutionize the field of\\npenetration testing by enhancing efficiency, scalability, and\\nautonomy.\\n13'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 14, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='References\\n[1] U.s. department of the interior, 2024. https://www.\\ndoi.gov/ocio/customers/penetration-testing.\\n[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama\\nAhmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo\\nAlmeida, Janko Altenschmidt, Sam Altman, Shyamal\\nAnadkat, et al. Gpt-4 technical report. arXiv preprint\\narXiv:2303.08774, 2023.\\n[3] Haitham S Al-Sinani and Chris J Mitchell. Ai-\\naugmented ethical hacking: A practical examination of\\nmanual exploitation and privilege escalation in linux\\nenvironments. arXiv preprint arXiv:2411.17539, 2024.\\n[4] Ibrahim Alshehri, Adnan Alshehri, Abdulrahman Al-\\nmalki, Majed Bamardouf, and Alaqsa Akbar. Breach-\\nseek: A multi-agent automated penetration tester. arXiv\\npreprint arXiv:2409.03789, 2024.\\n[5] Brad Arkin, Scott Stender, and Gary McGraw. Software\\npenetration testing. IEEE Security & Privacy, 3(1):84–\\n87, 2005.\\n[6] Hacking Articles. Hacking articles, 2024. https://\\nwww.hackingarticles.in/.\\n[7] Saumick Basu. 7 penetration testing phases explained:\\nUltimate guide, 2024. https://www.strikegraph.\\ncom/blog/pen-testing-phases-steps .\\n[8] Rob Behnke. 5 phases of ethical hacking,\\n2021. https://www.halborn.com/blog/post/\\n5-phases-of-ethical-hacking .\\n[9] Matt Bishop. About penetration testing. IEEE Security\\n& Privacy, 5(6):84–87, 2007.\\n[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie\\nSubbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind\\nNeelakantan, Pranav Shyam, Girish Sastry, Amanda\\nAskell, et al. Language models are few-shot learn-\\ners. Advances in neural information processing systems,\\n33:1877–1901, 2020.\\n[11] The Cyphere. Penetration testing statistics, vulnerabil-\\nities and trends in 2024, 2024. https://thecyphere.\\ncom/blog/penetration-testing-statistics/.\\n[12] Jacobo Casado de Gracia and Alfonso Sánchez-Macián.\\nPthelper: An open source tool to support the penetration\\ntesting process. arXiv preprint arXiv:2406.08242, 2024.\\n[13] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying\\nZhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and\\nYang Liu. Masterkey: Automated jailbreaking of large\\nlanguage model chatbots. In Proc. ISOC NDSS, 2024.\\n[14] Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu,\\nYuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin\\nPinzger, and Stefan Rass. {PentestGPT}: Evaluating\\nand harnessing large language models for automated\\npenetration testing. In 33rd USENIX Security Sympo-\\nsium (USENIX Security 24), pages 847–864, 2024.\\n[15] Gelei Deng, Zhiyi Zhang, Yuekang Li, Yi Liu, Tian-\\nwei Zhang, Yang Liu, Guo Yu, and Dongjin Wang.\\n{NAUTILUS}: Automated {RESTful}{API} vulner-\\nability detection. In 32nd USENIX Security Symposium\\n(USENIX Security 23), pages 5593–5609, 2023.\\n[16] Yangruibo Ding, Marcus J Min, Gail Kaiser, and\\nBaishakhi Ray. Cycle: Learning to self-refine the code\\ngeneration. Proceedings of the ACM on Programming\\nLanguages, 8(OOPSLA1):392–418, 2024.\\n[17] Dirb. Dirb, 2024. https://dirb.sourceforge.net/.\\n[18] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,\\nAbhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,\\nAkhil Mathur, Alan Schelten, Amy Yang, Angela Fan,\\net al. The llama 3 herd of models. arXiv preprint\\narXiv:2407.21783, 2024.\\n[19] Richard Fang, Rohan Bindu, Akul Gupta, and Daniel\\nKang. Llm agents can autonomously exploit one-day\\nvulnerabilities. arXiv preprint arXiv:2404.08144, 2024.\\n[20] Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan,\\nand Daniel Kang. Teams of llm agents can exploit zero-\\nday vulnerabilities. arXiv preprint arXiv:2406.01637,\\n2024.\\n[21] Areej Fatima, Tahir Abbas Khan, Tamer Mohamed Ab-\\ndellatif, Sidra Zulfiqar, Muhammad Asif, Waseem Safi,\\nHussam Al Hamadi, and Amer Hani Al-Kassem. Im-\\npact and research challenges of penetrating testing and\\nvulnerability assessment on network threat. In 2023 In-\\nternational Conference on Business Analytics for Tech-\\nnology and Security (ICBATS), pages 1–8. IEEE, 2023.\\n[22] Luca Gioacchini, Marco Mellia, Idilio Drago, Alexan-\\nder Delsanto, Giuseppe Siracusano, and Roberto Bifulco.\\nAutopenbench: Benchmarking generative agents for pen-\\netration testing. arXiv preprint arXiv:2410.03225, 2024.\\n[23] Dhruva Goyal, Sitaraman Subramanian, and Aditya\\nPeela. Hacking, the lazy way: Llm augmented pen-\\ntesting. arXiv preprint arXiv:2409.09493, 2024.\\n[24] Hao Guan, Guangdong Bai, and Yepang Liu. Large lan-\\nguage models can connect the dots: Exploring model op-\\ntimization bugs with domain knowledge-aware prompts.\\nIn Proceedings of the 33rd ACM SIGSOFT International\\nSymposium on Software Testing and Analysis, pages\\n1579–1591, 2024.\\n14'),\n",
       " Document(metadata={'file_path': 'more_arxiv/2501.13411v1.VulnBot__Autonomous_Penetration_Testing_for_A_Multi_Agent_Collaborative_Framework.pdf', 'page_number': 15, 'total_pages': 17, 'entry_id': 'http://arxiv.org/abs/2501.13411v1', 'title': 'VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework', 'authors': ['He Kong', 'Die Hu', 'Jingguo Ge', 'Liangxiong Li', 'Tong Li', 'Bingzhen Wu'], 'published': '2025-01-23', 'updated': '2025-01-23', 'primary_category': 'cs.SE', 'categories': ['cs.SE'], 'pdf_url': 'http://arxiv.org/pdf/2501.13411v1'}, text='[25] Emre Güler, Sergej Schumilo, Moritz Schloegel, Nils\\nBars, Philipp Görz, Xinyi Xu, Cemal Kaygusuz, and\\nThorsten Holz. Atropos: Effective fuzzing of web ap-\\nplications for server-side vulnerabilities. In USENIX\\nSecurity Symposium, 2024.\\n[26] Hacktricks. Hacktricks, 2024. https://book.\\nhacktricks.wiki/en/index.html.\\n[27] Andreas Happe and Jürgen Cito. Getting pwn’d by\\nai: Penetration testing with large language models. In\\nProceedings of the 31st ACM Joint European Software\\nEngineering Conference and Symposium on the Founda-\\ntions of Software Engineering, pages 2082–2086, 2023.\\n[28] Andreas Happe and Jürgen Cito. Understanding hackers’\\nwork: An empirical study of offensive security practi-\\ntioners. In Proceedings of the 31st ACM Joint European\\nSoftware Engineering Conference and Symposium on\\nthe Foundations of Software Engineering, pages 1669–\\n1680, 2023.\\n[29] Andreas Happe, Aaron Kaplan, and Juergen Cito. Llms\\nas hackers: Autonomous linux privilege escalation at-\\ntacks. arXiv preprint arXiv:2310.11409, 2024.\\n[30] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng\\nCheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven\\nKa Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt:\\nMeta programming for multi-agent collaborative frame-\\nwork. arXiv preprint arXiv:2308.00352, 2023.\\n[31] Junjie Huang and Quanyan Zhu. Penheal: A two-stage\\nllm framework for automated pentesting and optimal\\nremediation. In Proceedings of the Workshop on Au-\\ntonomous Cybersecurity, pages 11–22, 2023.\\n[32] Hydra. Hydra is a game launcher with its own em-\\nbedded bittorrent client, 2024. https://github.com/\\nhydralauncher/hydra.\\n[33] Isamu Isozaki, Manil Shrestha, Rick Console, and Ed-\\nward Kim. Towards automated penetration testing: In-\\ntroducing llm benchmark, analysis, and improvements.\\narXiv preprint arXiv:2410.17141, 2024.\\n[34] Kali. The most advanced penetration testing distribution,\\n2024. https://www.kali.org/.\\n[35] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao\\nWu, Chengda Lu, Chenggang Zhao, Chengqi Deng,\\nChenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni-\\ncal report. arXiv preprint arXiv:2412.19437, 2024.\\n[36] Peiyu Liu, Junming Liu, Lirong Fu, Kangjie Lu, Yifan\\nXia, Xuhong Zhang, Wenzhi Chen, Haiqin Weng, Shoul-\\ning Ji, and Wenhai Wang. Exploring{ChatGPT’s} capa-\\nbilities on vulnerability management. In 33rd USENIX\\nSecurity Symposium (USENIX Security 24), pages 811–\\n828, 2024.\\n[37] Qian Liu, Jinke Song, Zhiguo Huang, Yuxuan Zhang,\\nGlide-The, and Liunux4odoo. langchain-chatchat,\\n2013. https://github.com/chatchat-space/\\nLangchain-Chatchat.\\n[38] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen\\nZheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kai-\\nlong Wang, and Yang Liu. Jailbreaking chatgpt via\\nprompt engineering: An empirical study. arXiv preprint\\narXiv:2305.13860, 2023.\\n[39] Metasploit. Metasploit | penetration testing software,\\npen testing security | metasploit, 2024. https://www.\\nmetasploit.com/.\\n[40] Milvus. Milvus, 2024. https://milvus.io/docs/\\nzh/quickstart.md.\\n[41] Lajos Muzsai, David Imolai, and András Lukács.\\nHacksynth: Llm agent and evaluation framework\\nfor autonomous penetration testing. arXiv preprint\\narXiv:2412.01778, 2024.\\n[42] Inc. NetEase Youdao. Bcembedding: Bilingual and\\ncrosslingual embedding for rag. https://github.\\ncom/netease-youdao/BCEmbedding, 2023.\\n[43] Nikto. Nikto web server scanner, 2024. https://\\ngithub.com/sullo/nikto.\\n[44] Nmap. Nmap: The network mapper - free security scan-\\nner, 2024. https://nmap.org/.\\n[45] OWASP. Owasp-testing-guide, 2013. https://\\ngithub.com/OWASP/owasp-testing-guide.\\n[46] Nivedita James Palatty. 83 penetration testing\\nstatistics: Key facts and figures, 2024. https:\\n//www.getastra.com/blog/security-audit/\\npenetration-testing-statistics/.\\n[47] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered-\\nith Ringel Morris, Percy Liang, and Michael S Bernstein.\\nGenerative agents: Interactive simulacra of human be-\\nhavior. In Proceedings of the 36th annual acm sympo-\\nsium on user interface software and technology, pages\\n1–22, 2023.\\n[48] Derry Pratama, Naufal Suryanto, Andro Aprila Adiputra,\\nThi-Thu-Huong Le, Ahmada Yusril Kadiptya, Muham-\\nmad Iqbal, and Howon Kim. Cipher: Cybersecurity in-\\ntelligent penetration-testing helper for ethical researcher.\\nSensors, 24(21):6878, 2024.\\n15')]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs_read[0:15]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}