> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gurubase.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Start Website Crawl

> Initiate a crawl to explore the URLs of a website

Starts an **asynchronous** website crawl operation to find all the sub URLs of the provided root URL.

You can check the status of created crawl operation using the [Get Crawl Status](/api-reference/endpoints/crawler-status) endpoint or stop it using the [Stop Crawl](/api-reference/endpoints/crawler-stop) endpoint.

Crawling does not mean that the discovered URLs are indexed immediately. You need to manually add the discovered URLs as a data source by passing them to [Create Data Source](/api-reference/endpoints/create-datasources) endpoint.

<Note>
  Crawls are rate limited to 1 concurrent operation per Guru type. Subsequent requests will fail if a crawl is already running.
</Note>

<Note>
  **Skipped Paths**: The crawler automatically skips non-content paths like `/feed/`, `/rss/`, `/static/`, `/assets/`, `/media/`, `/wp-admin/`, `/wp-json/`, `/_static/`, `/_sources/`, and common file extensions (images, PDFs, CSS, JS, etc.).
</Note>

## Path Parameters

<ParamField path="guru_slug" type="string" required>
  The slug of the Guru type to associate the crawled content with
</ParamField>

## Headers

<ParamField header="x-api-key" type="string" required>
  Your API key for authentication. You can obtain your API key from the [Gurubase dashboard](https://app.gurubase.io/api-keys).
</ParamField>

### Body Parameters

<ParamField body="url" type="string" required>
  The root URL to start crawling from. Must include http\:// or https\:// protocol.

  **Important**: The crawler only discovers URLs that start with this path. For example:

  * `https://example.com/` → crawls entire site
  * `https://example.com/docs/` → only crawls URLs under `/docs/`
</ParamField>

<ParamField body="ignore_query_params" type="boolean" default="true">
  When `true`, query parameters are stripped from discovered URLs (e.g., `?utm_source=...`, `?id=123`).
  Set to `false` if you need to crawl paginated content like `?page=1`, `?page=2`.
</ParamField>

### Response

<ResponseField name="id" type="string">
  Unique identifier for the crawl operation
</ResponseField>

<ResponseField name="url" type="string">
  The root URL to be crawled. The crawler will start from this URL and follow all links that begin with it. For example, if the URL is [https://example.com/a/b/c](https://example.com/a/b/c), the crawler will extract all links that start with [https://example.com/a/b/c](https://example.com/a/b/c).
</ResponseField>

<ResponseField name="status" type="string" enum={["RUNNING", "COMPLETED", "FAILED", "STOPPED"]}>
  Current status of the crawl operation
</ResponseField>

<ResponseField name="guru_type" type="string">
  The Guru type that the crawl was initiated for
</ResponseField>

<ResponseField name="discovered_urls" type="list">
  List of URLs discovered during crawling
</ResponseField>

<ResponseField name="start_time" type="string">
  Timestamp when crawl started (ISO 8601 format)
</ResponseField>

<ResponseField name="end_time" type="string">
  Timestamp when crawl ended (ISO 8601 format)
</ResponseField>

<ResponseField name="ignore_query_params" type="boolean">
  Whether query parameters are being stripped from discovered URLs
</ResponseField>

<ResponseExample>
  ```json 200 theme={null}
  {
      "id": 211,
      "url": "https://getanteon.com/",
      "status": "RUNNING",
      "guru_type": "anteon",
      "discovered_urls": [],
      "start_time": "2025-02-21T10:25:22.710211Z",
      "end_time": null,
      "link_limit": 1500,
      "ignore_query_params": true
  }
  ```

  ```json 400 theme={null}
  {
      "msg": "Invalid URL format"
  }
  ```

  ```json 429 theme={null}
  {
      "msg": "Request was throttled. Expected available in 56 seconds."
  }
  ```
</ResponseExample>
