Data types
There are three types of data.
- Structured data and unstructured data. Structured data is organized and can be easily stored in databases.
- Unstructured data is more complex and may require specialized storage solutions. They does not have a predefined structure or scehma. They can be in the form of text, images, videos, audio, etc. Examples of unstructured data include social media posts, customer reviews, and multimedia content. Unstructured data can be more challenging to store and analyze compared to structured data due to its lack of organization and consistency.
- Semi-structured data is a mix of structured and unstructured data. It has some organizational properties but does not conform to a rigid schema. Examples of semi-structured data include JSON, XML, and CSV files, Email headers, log files, etc. Semi-structured data can be easier to store and analyze than unstructured data while still providing flexibility in terms of data representation.
Properties of data
- Volumne: The amount of data being generated and stored as any given time.
- Velocity: The speed at which data is generated, collected and processed.
- Variety: The different types and formats of data being generated, such as structured, unstructured, and semi-structured data.
Data warehouse vs Data lake
Data warehouse is a centralized repository that stores structured data from various sources. Designed for complex queries and analysis. Data is cleaned, transformed, and loaded (ETL process). Typically uses a star or snowflake schema. we use ETL (Extract, Transform, Load) process. It is less agile due to predefined schema. Typically more expensive because of optimizations for complex queries.
Data lake is a storage repository that hodls vast amounts of raw data in tis native format. including structured, semi-structured,and unstructured data. Data is stored in its raw form and can be processed and analyzed using various tools and frameworks. Data lakes are more flexible and scalable than data warehouses, but they may require more effort to manage and maintain. examples of data lake storage solutions include Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, hadoop Distributed File System (HDFS), and Apache Iceberg. We use ELT (Extract, Load, Transform) process.
Data lakehouse is a modern data architecture that combines the best features of data lakes and data warehouses. It provides a unified platform for storing, processing, and analyzing both structured and unstructured data. Data lakehouses use a single storage layer that can handle all types of data, eliminating the need for separate data lakes and data warehouses. They also support ACID transactions, which ensure data consistency and reliability. Examples of data lakehouse solutions include AWS Lake Formation (with S3 and Redshift spectrum). Databricks Lakehouse Platform, Snowflake, and Apache Hudi.
Data Mesh
It is more about governance and organization. Individual teams own “data products” within a given domain. These data products serve various “use cases” around the organization. This called “domain-based data management”. It is Federated governance with central standardsand self-serve data infrastructure.
ETL Pipelines
ETL (Extract, Transform, Load). It’s a process used to move dta from source system into a data warehouse.
- Extract: The first step is to extract data from various source systems, such as databases, APIs, or flat files. This involves connecting to the data sources and retrieving the relevant data.
- Transform: The extracted data is then transformed to fit the target schema of the data warehouse. This may involve cleaning the data, performing calculations, handling missing values, encoding or decoding data (one-hot encoding) and applying business rules to ensure that the data is in a usable format.
- Load: Move the transformed data into the target data warehouse or another data repository. We need to manage ETL pipelines. This process must be automated in some reliable way. Aws Glue is a fully managed ETL service that makes it easy to move data between data stores. There are also orchestration services: EventBridge, Amazon Managed workflows for Apache Airflow. AWS step Functions, AWS Lambda, Glue Workflow, etc.
Data Sources
- JDBC: Java Database Connectivity (JDBC) is a standard API for connecting to relational databases. Plateform-independent and Laguage-independent.
- ODBC: Open Database Connectivity (ODBC) is a standard API for connecting to databases. It is platform-dependent (need drivers) and language-independent, allowing applications to access data from various database management systems (DBMS) using a common interface.
- Raw Logs
- APIs
- Streams
Differnt data formats:
- CSV: Comma-Separated Values (CSV) is a simple file format used to store tabular data, where each line represents a record and each field is separated by a comma. For small to medium-sized datasets, CSV files are easy to create and read. However, they can become inefficient for larger datasets due to their lack of compression and support for complex data types. It is also used for importing and exporting data between different applications and databases.
- Json: Lightweight, text-based, and human-readable dta interchagne format that represnts strutured or semi-structured data based in key-value pairs.
- Avro: Binart format that stores both the data and its schema, allowing it to be processed later with diffent systems without the original system’s context.
- Parquet: Columnar storage format optimized for analytics. Allows for efficient compression and encoding schemes. it is used for analyzing large datasets with analytics engines.Use cases where reading specific columns instead of entire records is benficial. Storing data on distirbuted systems where I/O operations and storage need optimization.
Amazon S3
S3 buckets must have globally unique name (across all regions al accounts). Buckets are defined at the region level. bucket is for storing files.
Each object have a key. The key is the full path:
s3://my-bucket/path/to/my/file.txt
key is the prefix + name and there is not folder structure. S3 is a flat storage system. We can use prefixes to organize objects in a way that resembles a folder structure, but it is not a true hierarchical file system
The maximum size of an object in S3 is 5 TB. If uploading more than 5 GB, we need to use the multipart upload API.
Metadata is alist of text key/valu paris system or use metadata. there are also tags and version ID.
IAM Policies - which API calls should be allowed for a specific user from IAM. S3 bucket policies - Json based policies that define permissions for the entire bucket or specific objects within the bucket.
Versioning
Versioning is a feature in Amazon S3 that allows you to keep multiple versions of an object in the same bucket. When versioning is enabled, each time you upload a new version of an object, S3 assigns it a unique version ID. This allows you to retrieve, restore, or permanently delete specific versions of an object as needed. Versioning provides protection against accidental deletion or overwriting of objects and enables you to maintain a history of changes to your data over time.
Replication
There are two types of replication in S3:
- Cross-Region Replication (CRR): This allows you to automatically replicate objects from one S3 bucket to another bucket in a different AWS region.
- Same-Region Replication (SRR): This allows you to automatically replicate objects from one S3 bucket to another bucket within the same AWS region.
When turn on, only the new objects will be replicated. We can also replicate existing objects by using the S3 Batch Operations feature. For Delete operation, it can replicate delete markers from source to target bucket, but it does not replicate delete operations for existing objects.Deletions with a version ID are not repliacted. There is no “chaining” if replication is enabled on both source and target buckets.
Kinesis Data Streams
Amazon Kinesis Data Streams is a fully managed service that allows you to collect, process, and analyze real-time streaming data at scale. The collected data can be then passed to other AWS services for further processing and analysis as Lambda.
The retention between up to 365 days and it is able to replayed by consumers. Data can not deleted from Kinesis and can has up to 1 MB in size. Each stream is made up of one or more shards, which are the base throughput units of the stream. Each shard can support 1MB/s in and 2MB/s out. Kinesis Producer Library (KPL) is a client library that simplifies the process of producing data to Kinesis Data Streams. KCL is a client library that simplifies the process of consuming data from Kinesis Data Streams.