Data Sources - Gurubase

Build comprehensive knowledge bases by combining content from websites, documents, code repositories, and enterprise platforms. Gurubase indexes your content to create AI assistants that understand your specific domain.

At a Glance

Website

Crawl or sitemap import

PDF

Documents & manuals

Excel

Spreadsheets & data

GitHub

Repositories & code

YouTube

Video transcripts

Text

Custom snippets

Zendesk

Tickets & articles

Confluence

Wiki & docs

Jira

Issues & projects

Slack

Conversations

Google Drive

Docs & files

Salesforce

Knowledge base

File-Based Sources

PDF Documents

Upload PDF files to index text and images. Perfect for documentation, research papers, and manuals.

Feature	Description
Text extraction	Full text content from all pages
Image indexing	Images within PDFs are also indexed
Batch upload	Upload multiple PDFs at once

Excel Files

Upload spreadsheets (.xls, .xlsx) to index tabular data with header relationships.

Feature	Description
Cell extraction	Text content from all cells
Structure preservation	Maintains table relationships
Batch upload	Upload multiple files at once

Excel Extraction Best Practices

Learn how to prepare Excel files for optimal extraction

Custom Text

Add custom text directly for FAQs, instructions, or any content that doesn’t fit other categories.

Option	Description
Subtype	Choose “Text” for plain content or “Mermaid” for flowcharts/diagrams
Labels	Add labels to organize and filter text sources

Zendesk Playbooks: To create visual playbooks for the Zendesk app, select “Mermaid” as the subtype and add the “playbooks” label. These will appear in the Playbooks tab with interactive flowcharts.

Web Sources

Website Indexing

Index entire websites or specific pages. Gurubase extracts text, headings, and structured content.

Method	Description
Sitemap Import	Import all URLs from a sitemap
Crawl Website	Automatically discover and crawl all internal pages
Manual URLs	Add specific page URLs

Images from websites are also indexed (plan feature).

Crawl Options

Option	Description
URL Scope	The crawler only discovers URLs that start with the provided path. Use `https://example.com/` for the entire site, or `https://example.com/docs/` to crawl only the `/docs/` section.
Skip Query Params	Enabled by default. Strips query parameters from URLs (e.g., `?utm_source=...`). Disable for paginated content like `?page=1`, `?page=2`.
Sort URLs	Click the sort button (A-Z icon) to alphabetically sort discovered URLs.

Skipped Paths: The crawler automatically skips non-content paths like /feed/, /rss/, /static/, /assets/, /media/, /wp-admin/, /wp-json/, /_static/, /_sources/, and common file extensions (images, PDFs, CSS, JS, etc.).

YouTube Videos

Import video transcripts and metadata from YouTube channels, playlists, or individual videos.

Method	Description
Channel Import	All videos from a YouTube channel
Playlist Import	All videos from a specific playlist
Manual URLs	Specific video URLs

Code Repositories

GitHub

Index repositories to make code, documentation, and README files searchable. Supports public and private repositories.

Feature	Description
Glob patterns	Control which files are indexed
Private repos	Use GitHub tokens for access
Code + docs	Index code files and documentation

Token Requirements:

Token Type	Permissions Required
Classic	`repo` scope
Fine-grained	”Contents” (read) + “Metadata” (read)
Public repos	No token needed

Classic Token Setup

Go to GitHub Tokens (classic) (GitHub Settings → Developer settings → Personal access tokens → Tokens (classic))
Click Generate new token (classic)
Under Select scopes, check the repo checkbox (this grants full control of private repositories)

GitHub classic token scopes configuration

Fine-grained Token Setup

Go to GitHub Fine-grained tokens (GitHub Settings → Developer settings → Personal access tokens → Fine-grained tokens)
Click Generate new token
Enter a Token name (e.g., “Gurubase Token”)
Set Expiration as needed
Under Repository access, select one of:
- All repositories - Access all current and future repositories
- Only select repositories - Choose specific repositories (max 50)
Under Permissions → Repository permissions, add:
- Contents: Read-only
- Metadata: Read-only (required)

GitHub fine-grained token permissions configuration

Glob Pattern Examples:

Pattern	Matches
`*/.py`	All Python files (recursive)
`*/.{js,ts}`	All JavaScript and TypeScript files
`packages/**`	Everything inside packages folder
`!test_*.py`	Exclude test files

Test your patterns with globster.xyz before adding.

Platform Integrations

Platform integrations provide automated syncing and deeper integration with enterprise tools.

Zendesk

Index support tickets and help center articles. Set up backfill jobs for automated syncing.

Zendesk Integration

Import tickets, articles, comments, and attachments

Confluence

Index Atlassian Confluence spaces and pages. Supports CQL queries for advanced filtering.

Confluence Integration

Sync team documentation and wiki pages

Jira

Index project issues, tickets, and related documentation.

Jira Integration

Import issues and project data

Slack

Index Slack conversations and threads. Configure trusted users for targeted content.

Slack Integration

Capture team knowledge from conversations

Google Drive

Connect Google Drive to index documents, spreadsheets, and files.

Google Drive Integration

Sync Google Docs, Sheets, and files

Salesforce

Index Salesforce Knowledge Base articles. Supports SOQL queries for filtering.

Salesforce Integration

Import knowledge base articles

Best Practices

Content Organization

Mix source types - Combine documents, websites, and integrations for comprehensive coverage
Use labels - Organize text sources with labels (e.g., playbooks, faq)
Platform integrations - Use for content that changes frequently

Data Quality

Review indexed content - Check what’s actually indexed in your Guru’s sources
Use filtering - Glob patterns for GitHub, CQL for Confluence, SOQL for Salesforce
Test your Guru - Ask sample questions to verify content quality

Keeping Content Fresh

Backfill jobs - Set up automated syncing for Zendesk, Confluence, Jira, Slack
Reindex regularly - Use the reindex option for websites and documents
Monitor status - Check integration status in your Guru’s settings

Get Started

Guides

Integrations

​At a Glance

Website

PDF

Excel

GitHub

YouTube

Text

Zendesk

Confluence

Jira

Slack

Google Drive

Salesforce

​File-Based Sources

​PDF Documents

​Excel Files

Excel Extraction Best Practices

​Custom Text

​Web Sources

​Website Indexing

​Crawl Options

​YouTube Videos

​Code Repositories

​GitHub

​Platform Integrations

​Zendesk

Zendesk Integration

​Confluence

Confluence Integration

​Jira

Jira Integration

​Slack

Slack Integration

​Google Drive

Google Drive Integration

​Salesforce

Salesforce Integration

​Best Practices

​Next Steps

Create Your First Guru

Deploy a Widget

At a Glance

File-Based Sources

PDF Documents

Excel Files

Custom Text

Web Sources

Website Indexing

Crawl Options

YouTube Videos

Code Repositories

GitHub

Platform Integrations

Zendesk

Confluence

Jira

Slack

Google Drive

Salesforce

Best Practices

Next Steps