Skip to main content
Build comprehensive knowledge bases by combining content from websites, documents, code repositories, and enterprise platforms. Gurubase indexes your content to create AI assistants that understand your specific domain.

At a Glance

Website

Crawl or sitemap import

PDF

Documents & manuals

Excel

Spreadsheets & data

GitHub

Repositories & code

YouTube

Video transcripts

Text

Custom snippets

Zendesk

Tickets & articles

Confluence

Wiki & docs

Jira

Issues & projects

Slack

Conversations

Google Drive

Docs & files

Salesforce

Knowledge base

File-Based Sources

PDF Documents

Upload PDF files to index text and images. Perfect for documentation, research papers, and manuals.
FeatureDescription
Text extractionFull text content from all pages
Image indexingImages within PDFs are also indexed
Batch uploadUpload multiple PDFs at once

Excel Files

Upload spreadsheets (.xls, .xlsx) to index tabular data with header relationships.
FeatureDescription
Cell extractionText content from all cells
Structure preservationMaintains table relationships
Batch uploadUpload multiple files at once

Excel Extraction Best Practices

Learn how to prepare Excel files for optimal extraction

Custom Text

Add custom text directly for FAQs, instructions, or any content that doesn’t fit other categories.
OptionDescription
SubtypeChoose “Text” for plain content or “Mermaid” for flowcharts/diagrams
LabelsAdd labels to organize and filter text sources
Zendesk Playbooks: To create visual playbooks for the Zendesk app, select “Mermaid” as the subtype and add the “playbooks” label. These will appear in the Playbooks tab with interactive flowcharts.

Web Sources

Website Indexing

Index entire websites or specific pages. Gurubase extracts text, headings, and structured content.
MethodDescription
Sitemap ImportImport all URLs from a sitemap
Crawl WebsiteAutomatically discover and crawl all internal pages
Manual URLsAdd specific page URLs
Images from websites are also indexed (plan feature).

Crawl Options

OptionDescription
URL ScopeThe crawler only discovers URLs that start with the provided path. Use https://example.com/ for the entire site, or https://example.com/docs/ to crawl only the /docs/ section.
Skip Query ParamsEnabled by default. Strips query parameters from URLs (e.g., ?utm_source=...). Disable for paginated content like ?page=1, ?page=2.
Sort URLsClick the sort button (A-Z icon) to alphabetically sort discovered URLs.
Skipped Paths: The crawler automatically skips non-content paths like /feed/, /rss/, /static/, /assets/, /media/, /wp-admin/, /wp-json/, /_static/, /_sources/, and common file extensions (images, PDFs, CSS, JS, etc.).

YouTube Videos

Import video transcripts and metadata from YouTube channels, playlists, or individual videos.
MethodDescription
Channel ImportAll videos from a YouTube channel
Playlist ImportAll videos from a specific playlist
Manual URLsSpecific video URLs

Code Repositories

GitHub

Index repositories to make code, documentation, and README files searchable. Supports public and private repositories.
FeatureDescription
Glob patternsControl which files are indexed
Private reposUse GitHub tokens for access
Code + docsIndex code files and documentation
Token Requirements:
Token TypePermissions Required
Classicrepo scope
Fine-grained”Contents” (read) + “Metadata” (read)
Public reposNo token needed
  1. Go to GitHub Tokens (classic) (GitHub Settings → Developer settings → Personal access tokens → Tokens (classic))
  2. Click Generate new token (classic)
  3. Under Select scopes, check the repo checkbox (this grants full control of private repositories)
GitHub classic token scopes configuration
  1. Go to GitHub Fine-grained tokens (GitHub Settings → Developer settings → Personal access tokens → Fine-grained tokens)
  2. Click Generate new token
  3. Enter a Token name (e.g., “Gurubase Token”)
  4. Set Expiration as needed
  5. Under Repository access, select one of:
    • All repositories - Access all current and future repositories
    • Only select repositories - Choose specific repositories (max 50)
  6. Under PermissionsRepository permissions, add:
    • Contents: Read-only
    • Metadata: Read-only (required)
GitHub fine-grained token permissions configuration
Glob Pattern Examples:
PatternMatches
**/*.pyAll Python files (recursive)
**/*.{js,ts}All JavaScript and TypeScript files
packages/**Everything inside packages folder
!test_*.pyExclude test files
Test your patterns with globster.xyz before adding.

Platform Integrations

Platform integrations provide automated syncing and deeper integration with enterprise tools.

Zendesk

Index support tickets and help center articles. Set up backfill jobs for automated syncing.

Zendesk Integration

Import tickets, articles, comments, and attachments

Confluence

Index Atlassian Confluence spaces and pages. Supports CQL queries for advanced filtering.

Confluence Integration

Sync team documentation and wiki pages

Jira

Index project issues, tickets, and related documentation.

Jira Integration

Import issues and project data

Slack

Index Slack conversations and threads. Configure trusted users for targeted content.

Slack Integration

Capture team knowledge from conversations

Google Drive

Connect Google Drive to index documents, spreadsheets, and files.

Google Drive Integration

Sync Google Docs, Sheets, and files

Salesforce

Index Salesforce Knowledge Base articles. Supports SOQL queries for filtering.

Salesforce Integration

Import knowledge base articles

Best Practices

  • Mix source types - Combine documents, websites, and integrations for comprehensive coverage
  • Use labels - Organize text sources with labels (e.g., playbooks, faq)
  • Platform integrations - Use for content that changes frequently
  • Review indexed content - Check what’s actually indexed in your Guru’s sources
  • Use filtering - Glob patterns for GitHub, CQL for Confluence, SOQL for Salesforce
  • Test your Guru - Ask sample questions to verify content quality
  • Backfill jobs - Set up automated syncing for Zendesk, Confluence, Jira, Slack
  • Reindex regularly - Use the reindex option for websites and documents
  • Monitor status - Check integration status in your Guru’s settings

Next Steps