dataproductpoc-docs

Data Product Proof of Concept (PoC)

The aim of the PoC is to be able to deduce a viable set of architectural standards by proving that the concepts outlined in this Architecture Wiki are feasible in practice.

This documentation will be based initially on a data product with REST API at it’s core.

The REST API is particularly useful to:-

Other methods are possible, however.

The choice of a REST API does not affect the way that target datasets are consumed. The PoC will demonstrate that data & metadata can also be retrieved securely from a relational database and a queue

Data Product PoC System Architecture

data product system architecture

For the PoC, we can simplify what needs to be done in production by providing the following:-

  1. The ports can be built out as REST API endpoints, database or queue connections
  2. The data pipeline that moves data between the layers can just be executed as sql statements. We will demonstrate the ability to POST transformation sql so that new datasets can be created dynamically.
  3. We will upload countries and continents files via the input data port and load these into relational tables
  4. We will also stream countries and continents data into a Confluent Kafka queue in the input data layer.
  5. The metadata that can be retrieved via the discovery port will be stored in a relational database so can also be accessed via standard SQL queries
  6. A dataset authorisation database will link users to roles and roles to datasets. These dataset authorisation permissions will be implemented as grant policies against the target datasets held in the dataset database, thus preventing unauthorised access by users who directly query this information.
  7. To ensure consistent builds on multiple cloud platforms the code will be containerised. For the PoC, we will rely on the Azure App Service container. For local deployment, we will use Docker.

What will be proved?

We will prove that a data product can:-

  1. Be containerised and deployed to a cloud platform (Azure App Service)
  2. Provide metadata & docs via an API endpoint and a relational database. Metadata can also be exported from the relational database.
  3. Provide the data pipeline sql as an output so a user knows what transformation has been applied.
  4. Can accept source files in both CSV & JSON format into an input data port
  5. Be triggered to pull in source datasets from an Azure Data Lake via a Databricks SQL warehouse.
  6. Can accept messages into a Confluent Kafka queue via an input data port
  7. Can accept a data dictionary (aka schema) that defines the source files into an input data port
  8. Can transform the source files/messages into target datasets using a pipeline.sql script injected into an input data port
  9. Can store the target datasets in a queue, file or relational database table
  10. Can provide the target datasets in more than 1 format (JSON or CSV) to the output data port
  11. Can be secured via:-
    • user authentication
    • data product authorisation
    • target dataset authorisation
  12. That the target data can be securely consumed by a client application e.g Power BI or Python code.
  13. That the data, metadata and pipeline and schema files can be maintained by a mockup of a data product admin website
  14. That data products can be discovered via a mockup of a data marketplace.

This should be sufficient to prove the concept.

Out of Scope for the PoC

There are other items that are not being proved but can utilise similar techniques as demonstrated above:- For example:-

Other factors aren’t being demonstrated simply because solutions for these are already known. For example, the following docs/metadata can be linked to the data product using URLs:-

Other factors aren’t being demonstrated because they are more relevant for production than for a proof of concept e.g.

What tooling will be used in the PoC?

The choices made were for speed of delivery. Every organisation’s delivery teams will undoubtedly have different tool choices in most cases, but it’s a Proof of Concept, so tools can easily be swapped out so long as they provide the same functionality.

Coding

Data

Security

Infrastructure

What data will be used in the PoC?

Input Data

For simplicity, we will be using 2 CSV files as input:-

We will be demonstrating 2 main methods for getting these source datasets into the data product:-

Transformation

Data pipelining sql will be uploaded via the input data port (using a http POST request) which will then fire transformation which will merge the data into a continents and countries dataset which will be saved in a relational database.

Output Data

This abstracted dataset will then be further transformed to json and csv files which will be made available to the data consumer from:-

What docs/metadata will be available in the PoC?

The REST API will provide standard OpenAPI docs via the /docs discovery port. The description information will also include extra data product metadata This metadata will also be stored in a relational database so that it can be easily extracted to Collibra or non-REST API data consumers can view it.

How will the data consumer be able to consume the data product data?

What has been completed so far?

Completed data product PoC architecture