Webcrawler

The Webcrawler connection will crawl a url and any links it finds, creating a page per url.

Creating a Connection

  1. Select an API key from the dropdown at the top. All documents created by the connection will be attributed to the selected API key.
  2. Type in a URL to the URL input.
  3. Check "Restrict domain" if you do NOT want to follow urls to other domains.
  4. Optionally set a max depth. This controls how many links deep you want to go. A link on the page has a depth of 1. A link on that link's page has a depth of 2, and so on.
  5. Optionally set a max pages. This controls how many links to follow. So if you only want to follow 10 links, set this here.
  6. Fill out any metadata you want to associate with the resulting files. You can use this to filter the data later. This is in JSON format. You can leave it blank, or you could set it to something like
    1. {
        "company": "acme"
      }
      
  7. Select the import mode
  8. Click "Create Connection" at the top right

You will be taken back to the connectors page and the system will schedule a sync process as soon as possible.

What is synced?

  • The URL you entered as a Markdown file
  • Any links it finds and follows (subject to settings) also imported as Markdown.