ScrapeGraphAI：用LLM和大语言模型实现智能网页爬取，Python开发者的新一代爬虫利器-壹联网络

今天推荐的是ScrapeGraphAI，这是一个用Python开发的智能网页爬虫库。它通过LLM和大语言模型自动理解网页结构，只需描述你想提取什么信息，AI就会自动完成爬取。Star已突破23000+，更新非常活跃。

项目介绍

ScrapeGraphAI是一个革命性的Python网页爬虫库，它彻底颠覆了传统爬虫的开发模式。传统爬虫需要手动分析网页结构、写CSS选择器或XPath表达式，而ScrapeGraphAI只需要你用自然语言描述想要提取的信息，AI就会自动分析网页并提取数据。

GitHub：https://github.com/VinciGit00/Scrapegraph-ai
Stars：23284+ | 语言：Python | 协议：MIT

核心特色

1. 自然语言驱动：只需用自然语言描述需求，如”提取所有产品价格和名称”，AI自动分析并提取。

2. 多管道支持：SmartScraperGraph（单页爬取）、SearchGraph（多页搜索爬取）、SpeechGraph（爬取并生成音频）、ScriptCreatorGraph（爬取并生成Python脚本）等多种管道。

3. 多LLM支持：支持OpenAI GPT、Claude、Gemini、MiniMax等商业API，也支持通过Ollama使用本地模型（如Llama3）。

4. LangChain/LlamaIndex集成：可无缝集成到LangChain和LlamaIndex生态系统中。

5. MCP协议支持：提供MCP服务器，可直接接入各类AI Agent框架。

6. 支持本地文档：除了网页，还支持从本地HTML、XML、JSON、Markdown文档中提取数据。

7. 并行处理：多版本管道支持LLM并行调用，大幅提升爬取效率。

安装步骤

方式一：pip安装（推荐）

pip install scrapegraphai

# 安装Playwright（用于抓取动态网页内容）
playwright install

方式二：使用虚拟环境

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install scrapegraphai
playwright install

方式三：源码安装

git clone https://github.com/VinciGit00/Scrapegraph-ai.git
cd Scrapegraph-ai
pip install -e .

使用方法

基础示例：单页爬取

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_API_KEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="提取网页上所有产品的名称、价格和描述",
    source="https://example.com/products",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

使用本地模型（Ollama）

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "model_tokens": 8192,
    },
    "verbose": True,
    "headless": False,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="提取公司名称、创始人和社交媒体链接",
    source="https://scrapegraphai.com/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

多页搜索爬取（SearchGraph）

from scrapegraphai.graphs import SearchGraph

search_graph = SearchGraph(
    prompt="查找2026年最新的人工智能发展趋势",
    config={
        "llm": {"model": "openai/gpt-4o-mini", "api_key": "YOUR_KEY"},
        "search_engine": "duck-duck-swap",
    }
)

result = search_graph.run()
print(result)

生成爬虫脚本（ScriptCreatorGraph）

from scrapegraphai.graphs import ScriptCreatorGraph

script_graph = ScriptCreatorGraph(
    prompt="创建一个爬虫，从页面提取所有新闻标题和发布日期",
    source="https://news.example.com",
    config=graph_config
)

result = script_graph.run()
print(result)  # 输出可复用的Python脚本