Skip to main content

API Crawler Collector

The API Crawler collector is designed to pull data from external APIs based on a declarative configuration. It allows for complex and dynamic API interactions, including multi-step calls, data transformations, and context-based processing. This collector is ideal for scenarios where data needs to be fetched from various API endpoints, potentially requiring pagination, authentication, or nested requests.

This collector is ideal for scenarios where:

  • You need to integrate with a third-party service that provides a REST API but does not offer a push or streaming mechanism.
  • Data needs to be fetched on a regular, predictable schedule (e.g., every 15 minutes, once a day).
  • The integration requires handling complex interactions like authentication, pagination, or simple data transformations defined in a configuration file.

Key Features

  • Scheduled Execution: Uses CRON expressions for flexible, automated polling of APIs.
  • Pull-Based Model: Actively fetches data from source APIs, making it independent of the source's ability to push.
  • Highly Configurable Crawling: Powered by the go-apigorowler library, allowing for complex API interaction logic—including authentication, pagination, and dynamic requests—to be defined in a single YAML file.
  • Batch and Stream Processing: Can be configured to either wait for the entire crawl to finish and publish a single result, or to publish data items individually as they are discovered during the crawl.

How it Works

The API Crawler operates by reading a configuration_file.yaml configuration file, which defines the sequence of API calls to be made, how to process their responses, and how to manage the data flow. The collector then periodically executes these defined crawls based on a cron schedule.

Data fetched and processed by the crawler is published to the Open Data Hub via the message queue (RabbitMQ) as raw JSON, similar to other collectors.

Configuration

The API Crawler's behavior is primarily driven by environment variables and the configuration_file.yaml configuration.

Environment Variables

The following environment variables are essential for configuring the API Crawler collector:

Variable NameDescriptionExample Value
MQ_URIThe connection URI for the RabbitMQ instance.amqp://guest:guest@rabbitmq
MQ_CLIENTA unique identifier for the data collector client connecting to RabbitMQ.dc-api-crawler-client
MQ_EXCHANGEThe RabbitMQ exchange to which the collected data will be published.ingress
LOGLEVELThe logging level for the collector.DEBUG
PROVIDERThe unique identifier for the data provider and dataset. This value is included in the published data.myprovider/mydataset
CRONThe cron schedule string defining how often the crawler should run, including seconds.0 * * * * *
CONFIG_PATHThe file path to the API crawler's configuration file.configuration_file.yaml
SERVICE_NAMEThe name of the service for telemetry purposes.dc-api-crawler-myservice
TELEMETRY_TRACE_GRPC_ENDPOINTThe gRPC endpoint for sending telemetry traces.localhost:4317

configuration_file.yaml Configuration

The CONFIG_PATH environment variable points to the configuration_file.yaml file, which is the core of the API Crawler's functionality. This YAML file uses the ApiGorowler library's declarative syntax to define:

  • API requests: URLs, HTTP methods, headers, and body content.
  • Authentication: How to authenticate with the target APIs.
  • Data extraction and transformation: Using jq for JSON parsing and transformation.
  • Flow control: Using foreach loops for iterating over collections or paginated results.
  • Context management: How data from different steps is combined and made available to subsequent steps.

For detailed information on configuring the configuration_file.yaml, please refer to the ApiGorowler package documentation. The documentation covers the schema for request and foreach steps, authentication methods, jq and Go template usage, and context management.

The ApiGorowler configuration builder IDE is highly recommended for developing and debugging your configuration_file.yaml files, allowing you to execute and inspect the configuration in real-time.