In this post, we will explore the system design of a financial data ingestion system. This system is crucial for collecting, processing, and storing financial data from various sources. There can be many conplicated distributed systems to handle financial data ingestion, but we will focus on a simplified version that can be used as a starting point for building such systems.
System Design Overview
For financial data ingestion, we need to consider generally two typpes of data: Metadata and Time Series Data.
Metadata
This includes information about financial instruments, such as stock tickers, company names, sector classifications, and other attributes that help identify and categorize the data. These table is relatively static and can be updated periodically. It also contain much less data than time series data, so it is easier to handle. We can usally just store it as a table in a database. But it should be well normalized to avoid data redundancy and ensure data integrity.
Time series data
This includes stock prices, net asset values, and other financial metrics that change over time. This data is typically high volume and requires efficient storage and retrieval mechanisms. We believe it is to have one hot table for the recent data, let say about three months, and one cold table for the historical data. The hot table can be used for real-time analysis and the cold table can be used for historical analysis. The data points in the time series data tables are usually have a foreign key to the metadata table, so that we can easily join the two tables to get the full information about a financial instrument.
It is also a good idea to have the partitions or shards of the time series data tables based on the metadata as time series data with different metadata are normally independant, so that we can easily query the data for a specific financial instrument. This can be done by using a partition key or a shard key that is based on the metadata.
Data Ingestion Process
The data ingestion process often involves several steps, including data collection (Ingest into normalize database), data transformation (Ingest into standardize database), and data storage (feed to data lake or backup snapshot).
There are often mutliple data pipelines for data Ingestion pipeline, such as:
- Real-time data ingestion: This pipeline is responsible for collecting and processing real-time data from various sources. It is running in second or minute level, and it is update the data from normalize database to standardize database. Only the inserted, deleted, and updated data are processed in this pipeline and sychronized to the standardize database and .
- Incremental data ingestion: This pipeline is responsible for collecting and processing incremental data from various sources. It is running in hour or day level, and it is update the data from standardize database to data lake or backup snapshot. Only the inserted, deleted, and updated data are snapshoted, like what we have on git as git squeeze, this small package of data is then used to sychronize between the normalize database and standardize database, and then to the data lake or backup snapshot.
- Full data ingestion: This pipeline is to refresh the whole data during the weekend, it do the synchronization from the normalize database to the standardize database, based on the lastest snapshot that have been take daily base. It is running normally during the weekend, and it is used to ensure that the data is up-to-date and consistent across all systems. While full data is running, the real-time and incremental data ingestion pipelines are paused to avoid data inconsistency. The the RT will be resumed after the full data ingestion is completed. It will do an update from the lastest Full time to the current time, so that the data is up-to-date and consistent across all systems.