What is the Medallion Architecture?

I have recently been documenting my exploration into the new data space and it's time to cover another important concept: The Medallion Architecture.
I am familiar with design patterns as a concept, due to my background in C#. And, we can think of the Medallion Architecture in this way - as a data design pattern. It is intended to support and ensure good data practices, and introduce consistency and durability into systems.
The Medallion Architecture consists of three data tiers:
- A Bronze "raw" tier
- A Silver "clean" tier
- A gold "projected" tier
Data moves through these three tiers - from bronze, to silver, to gold - and the structure of the data is refined at each stage. By the time we reach the gold tier, the data should have been refined and restructured into a useful data product. To learn more about data products, and how to reshape your thinking around data and its value, check out Barry's blog!
What is the Bronze Tier?
Generally, operational data is ingested directly into bronze, with little-to-no cleaning or validation. The preservation of data in this state is important as it provides a historical archive of data. No schema is needed at this point as the data is just stored "as is". The bronze layer will usually be in cold (slow access) storage as it likely won't need to be accessed after the initial load and processing.
The ingestion of data directly into bronze follows the "ELT not ETL" principle, where data is extracted, loaded, and then transformed, rather than transformed before being loaded. Usually the Bronze layer will be organised by source and ingestion time, giving you a record of when the data arrived, and where it came from. The advantage of this pattern is that it provides a historical audit of exactly what was ingested. You have each version of a file saved, allowing for complete playback and reprocessing of data from first principles, if necessary.
What is the Silver Tier?
Once the raw data has been ingested to bronze, it is then processed into the silver layer.
During this processing, the data is transformed from whatever format it arrived in (be that JSON, XML, CSV, Excel Spreadsheets, etc.) into a tabular format - usually Parquet or Delta. For example, you may recieve complex AIS binary data that you need to transform into a readable format, or nested JSON that needs to be flattened into a tabular structure.
Alongside this transformation, the data is validated. This usually consists of the application of a schema (see Liam's blog on data validation with Pandera for a more detailed explanation of this step). This ensures that the data has the expected shape (all the expected columns are present), and that the data itself follows certain rules. For example, you may have columns that should not contain null values, and you can ensure that here.
You may also want to do additional validation - checking that you have the expected number of rows, or that known values (e.g. the total profit for the year) are correct.
You also might do some transformation at this stage - you may want to unpivot data that has been entered in a pivoted state (which is often the case as it's much more human-readable), to make it easier to create reports or perform analysis over it.
However, the output should mimic the initial overall structure of the data - separate files should only be combined in your output tables if they are logical continuations of the same table.
For example, if you had two input files:
- People - an XML file containing people's names, ages, and gender
- Contact Info - a CSV file containing those people's email addresses, cities, and phone numbers
Then these should be individual tables in your silver layer, even if you end up combining them further down the line.
However if you, for example, had two files:
- Orders for November
- Orders for December
Then these may be combined in the silver layer as they are a logical continuation of the same data.
The Silver layer is also usually where you deal with upserts, or modifications. E.g. if you receive a new update to the orders from November, you would process this in the Silver layer in order to create a complete "current view" of the world. The exception to this being if there is a requirement to track changes over time as part of your reporting or analytics use case!
After this processing, you should be left with an unopinionated (i.e. not shaped for a specific report or purpose), but clean and validated, copy of your data. And, this should be in a form that's easily consumable downstream (i.e. a tabular format). The data will usually be organised by domain (what is will be used for), and the entity from which it originated (i.e. the original files).
It is often stated that data quality improves as it moves through the tiers, however I think this is an incorrect statement. Data in the silver layer will often be used for feeding ML or statistical models, running simulations, or performing data science experiments and ad-hoc analytics. This means that at this point, the data should be high quality and reliable, else you risk unpredictable, and sometimes dangerously incorrect, results.
What is the Gold Tier?
Once you have your cleaned and validated tier, it's time to create your gold layer. The gold layer serves the data in the format that is most useful for a specific use case - often this will be BI reporting, but could also be for serving apps, or to drive analytics. Transformations that you often see in this projection stage include:
- Unpivoting data to better serve Power BI reports
- Aggregating data (e.g. if you only need month-granularity reporting)
- Creating a star / snowflake schema (splitting the data into fact and dimension tables)
- Combining related data from different sources
You "project" your unopinionated tables into the format that best fits your needs. This might mean joining related tables (like all the person-specific info mentioned above), encoding the data, or creating entirely new reference datasets (like a date table that has additional information such as day of the week, etc.).
You might have multiple projections of the same data in the silver layer in the gold layer, with each one forming the basis of semantic model for a different report. For example, you could have one report that requires all of the personally identifiable information from the "person" table, and one for which it's not necessary. In this case you could create one projection that includes this information, and one that doesn't, which allows for more partitioned and fine-grained security, and with each report only accessing the information it needs.
At this point, the data should be organised by domain and use case.
Sometimes your gold layer may be exactly the same as your silver, but often there are adjustments that need to be made in order to better serve your reporting needs.
How does this relate to the Semantic Layer?
The gold layer forms the basis of the semantic layer. But, the semantic layer adds additional context and meaning to the tables in the Gold layer which enables the humans (and increasingly machines and AI) to understand and engage with the data.
This will include:
- Human / machine friendly names, descriptions and synonyms of objects such as tables and columns.
- Assigning more advanced data types over the the primitive types that are used in the Gold layer - e.g. percentage, latitude, currency, city.
- Adding metadata to enable effective governance of the Gold layer, such as data classification tags.
- Defining relationships between objects.
Often the idea of the Gold layer is conflated with the semantic layer but, without defining your objects' meaning and relationships, you can't use the data in any meaningful way.
The semantic layer usually sits outside of the lakehouse, its exact form dependent on your use case. A good example of this is Power BI. When creating a new report you:
- Import the data from the Gold layer, this forms a part of the model
- Update column names and types to make the report more readable and enrich the data
- Use the "modelling" tab to define the relationships between your tables and objects
And, only at this point is your semantic model fully defined.
Power BI is a useful example because the semantic model is an in-built concept, but your semantic layer could also be defined using other tools such as Microsoft Purview, or Databricks Unity Catalogue. Whatever your use case - be it reporting, analytics, or application development - you will likely need to define a semantic layer in order to describe your data, and give it meaning.
Each output from the semantic layer is a "Data Product" - it is a valuable asset that should be maintained, versioned, and treated as a fully contained product. You may have multiple data products or versions of each, which are consumed by different use cases, or used for different types of analytics.
So, overall we have:
When is data "production ready"?
In software development, we often talk about different environments - at endjin we often use a three tiered approach: development, testing, and production. In general:
- The Development environment is where engineers are currently working, and is therefore the most volatile. New code will be deployed here, and bugs will often be found during this first stage. Things will be changing rapidly if the project is actively being worked on. Nothing production-focused or client-facing should ever depend on the development environment, as it is purely a place for developers to make changes, trial solutions, and update things.
- Once developers are happy with the changes made (and hopefully those changes have passed some kind of quality gate), they will be deployed to the Testing environment. Here the code will undergo more rigourous testing, in a more controlled environment. There may be additional tests at this stage - including integration tests, non-functional tests, and anything else that doesn't generally fit into a the quick feedback loop that is necessary during development. Hopefully, at this point, any bugs that slipped through during the development stage will be found.
- Finally, once all the tests have passed, and all validation carried out, the code will be deployed into Production. This is the "live" code that your wider solution depends on. If, for example, you are hosting a client-facing web application, this code will drive your public app - and you would therefore hope that it is reliable and bug-free!
This is a generalised pattern that is applicable in many scenarios. However the number of environments and their purposes can vary on a case-by-case basis. For example, you may require a "preprod" environment for additional validation prior to production deployment, or a specialised QA environment in order to meet regulatory requirements.
So, how does this apply to our data design pattern?
It is slightly confusing, in that we now have three tiers in both of these separate, but related, dimensions:
In the medallion architecture data moves from raw, to cleaned, to projected.
It is clear that we don't want any production systems relying on our Bronze (raw) data, as any new data that arrives needs to be cleaned and validated before it can be used.
But, alongside this, we still want:
- An environment in which engineers can experiment and change things around the data (including in the Silver and Gold tiers) without this impacting anything public facing (Development)
- An environment in which any changes can be validated (Testing)
- And, the environment in which we can be as sure as possible that the data is reliable, and can be depended upon (Production)
As such, we can design a system as follows:
In each of our environments we have a Bronze/Silver/Gold tier.
As is the case for software development, anything that is end user facing should only be relying on the Production environment.
The Development environment is where engineers are currently working, and is therefore volatile. These engineers will need to update things relating to the Silver and Gold tiers, and need to do so without worrying about affecting production systems. There could be necessary schema changes, column renames, bugs introduced into calculated columns, tables accidentally deleted, and much more.
Note: Though the environment is functionally volatile, the data itself may be more tightly controlled. Data needs to be consistent in order to allow for development and testing, test data may be created to specifically hit different edge cases, and fake data may be created to restrict access to production data.
Any changes made are then validated in the Testing environment. And, once those quality gates have been passed, they are deployed into Production.
Within the Production environment, nothing should be depending on data in the Bronze tier, as this is data in its, unvalidated, raw form. But once we have cleaned and validated the data, both the Silver and Gold tier should "production ready". At this point the data is cleaned, validated, and not subject to unpredictable change as it is in the Development, and to some extent Testing, environment.
As I mentioned earlier, it is often stated that data quality improves as you progress through the Medallion Architecture, however this statement is flawed. The data in the Silver tier is no less "production ready" than in the Gold tier, it is just an unopinionated representation of your data. It may be used to feed ML models, and data science experiments, and the results of those may well be client-facing. Therefore data in your Silver tier, in the production environment, is very much production data and should be viewed as such.
Advantages of the Medallion Architecture
Overall the opinionation of our data increases as it moves through the medallion architecture. We start off with data in its completely unaltered, raw, form, and we end with data that is first cleaned, and then structured for a specific use case.
Using the medallion architecture, we can:
- Land and then work with data in any format whether it is structured, semi-structured or unstructured.
- Impose gates that limit data quality issues
- Support multiple use cases
- Support different workload types (reporting, ML, data science)
- Support the recreation of tables at any time
- Support auditing with data versioning
- Allow for historical playback as you have each version of the raw data saved
- Allow for greater agility, where e.g. customer change requests can be dealt with by just updating the Gold projection
- Define data lineage (how data moves from source, through processing, to consumption)
- Allow for more flexibility in security - with certain groups being given access to different projections with different levels of sensitivity
An important point is that there is no hard an fast rules in implementing this pattern. Though the Medallion Architecture provides us with useful guidance on how to structure our data solutions, there are a lot of nuanced decisions that need to be made:
- Exactly where to draw lines between the layers
- How much validation is done at each stage
- How much processing is done in the gold storage layer .vs. how much is done in Power BI
- etc.
And each of these require a balancing of performance, flexibility, data copying and storage costs, data volumes, historical data support, security, regulatory requirements, and much more.
In Conclusion
Overall the combination of this data design pattern, and a multi-environment system in which data reliability increases as it is promoted through the environments, provides a reliable and flexible architecture that can support many different scenarios.
If you want more information about this data design pattern, check out Ed Freeman's video on how to implement the Medallion Architecture in Microsoft Fabric, and Barry's worked example of using the Kaggle Titanic data set for predictive analysis!