A data lake is a system or repository where all the data of an enterprise, both structured and unstructured data, are stored. The data is stored as-in in their raw form and is used to provide the enterprise with business capabilities such as data analytics, operational reporting and business intelligence.
In this blog, I will focus on data lake deployed in the cloud and provide an overview of various aspects from an architecture perspective when considering building a data lake in the cloud.
What is a data lake comprises of?
A data lake comprises of the main components as shown in the diagram above:
Ingest and Store
A data lake has to be able to “collect anything” regardless whether it is structured (e.g. enterprise ERP and CRM applications), unstructured (e.g. system log files) or real time streaming data (e.g. IoT, click stream).
Moreover, it needs to keep up with the storage need of the business as new data sources are added and be able to maintain the data for multiple years needed for process and compliance reasons
Catalog and Search
Having large amount of data in a data lake without knowing how they are organised is not going to deliver any value to the business. We would instead end up with a data swamp.
A data catalog (i.e. metadata information) is needed to act as the source of truth of the content of the data lake. A data catalog supports data discovery and search by business users such as data scientists.
A data lake contains all the business data and security is paramount. Data have to be secured at rest and in transit. Access to the data should be by authorised personnel only. Any data governance required by the business (transaction, record keeping, data privacy rules etc.) needs to be supported.
Why Cloud Data Lake
The capabilities as described in the last section make it particularly suitable for a cloud solution. I will be using AWS in the examples but similar services are also supported by other cloud vendors.
Ingest & Store
All major cloud providers offer data storage solutions, e.g AWS S3, for unlimited amount of data with high availability and durability. The elastic nature of cloud computing means a data lake can be built with relative small data storage (and hence initial cost) provisioned at the beginning and scales as the amount of data grows.
Managed cloud services are also available for the ingestion of data from various data sources. For example, AWS Kinesis Data Firehose for streaming data, AWS Data Pipeline for application log files.
Catalog & Search
Building and maintaining a data catalog for the data lake manually is non-trivial, time consuming and error-prone. Instead,a cloud managed service such as AWS Glue can be used to create an integrated data catalog. Metadata information about the incoming data can be automatically discovered by using the AWS Glue crawler.
A Cloud Data Lake can leverage standard security features in the cloud to satisfy data security and compliance requirements. For example, using the security feature of AWS S3 to encrypt the data before storage to secure data at rest and SSL for protecting data in transit. AWS Identity & Access Management (IAM) can be used to implement access control to the Cloud Data Lake.
Data governance is an area where business can benefit by leveraging services provided by the cloud. For example, by implementing object lifecycle management in AWS S3 to support record keeping.
Another important thing to considered here is cloud vendors upgrade their big data and cloud capabilities regularly and frequently. A Cloud Data Lake will make it easier to scale and adopt any new capabilities available.
An example is AWS Macies, a relatively new AI powered service that uses machine learning to identify sensitive data such as personally identifiable information (PII).
Building a data lake in the cloud makes sense and as an architect, it is important to take advantage of the services provide in the cloud in the solution design. I would start with the following best practices:
- Separate storage and compute – so they can be provisioned and scaled individually.
- Start small – pull in data only when needed to avoid creating data swamp. Data is only needed when it can deliver value to the business.
- Use monitor tools – so you know what is in the data lake and their usage
- Go serverless – leverage managed services in the cloud as much as possible to reduce time and complexity.