Troublesome Tenants

A look at Wiz's Cloud Isolation Framework (PEACH) - Cloud Multi-Tenancy Security

Thanks for reading Resilient Cyber Newsletter! Subscribe for FREE and join 7,000+ readers to receive weekly updates with the latest news across AppSec, Leadership, AI, Supply Chain and more for Cybersecurity.

One of the core principles of cloud computing is multi-tenancy. Documented in one of the most cited references on cloud computing, NIST’s Definition of Cloud Computing, Resource Pooling is listed as one of the essential characteristics of cloud computing. In short, NIST states that resource pooling is when the Cloud Service Provider (CSP) pools resources to serve multiple customers using a multi-tenant model.

This multi-tenant model provides numerous benefits, such as efficiency, economies of scale and resource abstraction to consumers. However, it also comes with additional security concerns and risks, particularly when security incidents and malicious behavior is able to expand beyond a single tenant and impact neighbors, both physically, logically and metaphorically as organizations.

CSP’s and Cloud Consumers alike can take steps to mitigate the risk of multi-tenancy. One excellent resource that was recently released by Cloud Security Leader Wiz is what they have dubbed the “PEACH” framework, which aims to help consumers mitigate the risk of isolation escape. We will be taking a look at the framework in this article.

Cross-tenant vulnerabilities on the CSP’s side can be particularly concerning because it is outside of the consumers control and the CSP must address the issue to avoid impacting a potentially exponential number of customers. Cloud computings success has an underlying dependency on cloud consumers being able to trust the CSP’s measures to prevent multi-tenant security issues on their side of the Shared Responsibility Model. If you aren’t familiar with the Shared Responsibility Model, please check out my article on CSO Online here.

A couple of recent examples from December 2022 include a vulnerability in Amazon Web Services (AWS) Elastic Container Registry (ECR) service, which was documented by cloud security vendor Lightspin, in this excellent article.

Another example is a vulnerability in Azure’s Cognitive Search service which revolved around a cross-tenant network bypass vulnerability, and was summarized by Mnemonic in this article.

Christophe Parisel did a great job documenting both vulnerabilities in two of the largest CSP’s in this post, including their Cloud Vuln DB scores and citing the articles describing the vulnerabilities that I linked to above.

While the above examples are tied to the CSP, multi-tenant security concerns exist among cloud consumers as well. For example, some cloud consumers may have specific regulatory or security requirements that drive them to use dedicated virtual instances, such as AWS’s Dedicated Instances, which ensure that the consumers virtual instances run on hardware that is dedicated to a single customer or tenant.

There are also of course examples where entire cloud environments and models are oriented around a specific customer, tenant or organization, such as AWS and Azure’s Government regions. One such example includes the U.S. Department of Defense (DoD) which has specific separation requirements captured in its Cloud Security Requirements Guide (SRG). They utilize specific Impact Levels (IL)’s to discuss data and its associated sensitivity and segmentation requirements (along with other requirements such as personnel and their citizenship and security clearance).

Another example is when consumers increasingly adopt Platform-as-a-Service (PaaS) models in environments leveraging technologies such as Kubernetes and Containerization. Kubernetes provides robust documentation defining best practices for various tenancy models, and different tenancy models which they consider “soft” or “hard” multi-tenancy, each with different levels of logical segmentation. Examples of this are common among the DoD/Federal “Software Factory” ecosystem. Notable examples in the DoD include Army’s Software Factory, Air Force’s Platform One and also Kessel Run among others.

This is a blossoming ecosystem of nearly 30 entities that include an environment, developers, users and management working together to create and deliver software for DoD/Federal missions and use cases. These environments often utilize PaaS implementations building on Cloud environments with Kubernetes and Containers.

While cross-tenant vulnerabilities in the commercial sector may involve impacting various businesses and data such as payment card or healthcare data, on the DoD and Federal Government front, cross-tenant vulnerabilities and incidents could impact critical citizen services and data or military missions and functions.

As organizations build out internal PaaS environments for their various development teams and business units or SaaS providers utilize Kubernetes to host various SaaS consumers, these sort of best practices can be implemented to ensure the risk between tenants is mitigated while still taking advantage of the value of container orchestration for workloads. This framework is also useful for modeling tenant isolation in IaaS models using their respective boundaries and mechanisms which I will discuss below.

PEACH Framework

The PEACH Framework is created to address the tenant isolation problem in the industry. They cite various examples such as ChaosDB and ExtraReplica where cross-tenant vulnerabilities allowed malicious actors to access data across customers and tenants that reside in cloud environments, in this case using vulnerabilities on the CSP’s side of the responsibility model. That said, the model is also applicable to and useful for internally managed PaaS implementations like I’ve discussed above.

PEACH, as defined by the whitepaper stands for hardening Privilege, Encryption, Authentication and Connectivity and ensuring Proper Hygiene - hence PEACH. The guidance states that service providers can use the framework to properly describe their tenant isolation measures to consumers, as well as be used as a resource for internal teams employing multi-tenancy in their deployments and architecture.

The PEACH Framework is broken into two sections, which include Modeling Tenant Isolation and Improving Tenant Isolation. We will take a look at each of them below.

The guidance points out that common root causes of notable cross-tenant escape vulnerabilities have largely revolved around bugs in customer-facing interfaces or insufficient hardening of security boundaries. One overarching goal of the guidance and framework is to lead to an industry standard approach when discussing tenant isolation and common lexicon for vendors providing assurance to consumers regarding isolation.

The PEACH Framework focused on isolating complex customer interfaces and can be used as part of broader Threat Modeling processes for cloud environments.

Modeling Tenant Isolation

Part One of the PEACH Framework focuses on modeling tenant isolation. This includes key areas such as external interfaces, security boundaries, hardening factors, isolation review, and vendor transparency.

In this first step of modeling the tenant isolation they cover areas such as interface complexity, shared interfaces, existing security boundaries and the implementation of the 5 PEACH parameters (e.g. Privilege, Encryption, Authentication, Connectivity and Hygiene).

External Interfaces

In part one of modeling tenant isolation they look at external interfaces. They define these as the closest tenant facing applications or processes that ingest user input. This can be understood as the various means in which a customer can change the state of a back end surface, or in other words, the exposed attacked surface.

These interfaces are ingesting user-controlled and untrusted data and need to be secured appropriately. As the framework points out, these interfaces need to be isolated among tenants to mitigate the potential for a security incident on one interface impacting multiple tenants.

The guidance provides examples of various interface types and their associated levels of complexity. It is often said complex systems fail in complex ways, and the PEACH framework seems to support this notion, because they mention that “complexity correlates positively with likelihood of vulnerability” and suggest stronger isolation based on the interfaces complexity.

Security Boundaries

Moving onto the second part of modeling tenant isolation, the framework looks at security boundaries. Security boundaries are far from a new security concept, however their complexity and implementation has evolved with emerging technologies such as Cloud and Kubernetes.

The guidance recommends using well-known, mature mechanisms to facilitate security boundaries and tenant isolation. As the guidance points out, there are both security boundary types and then actual hardening of said types of boundaries. This essentially means even known valid types of security boundaries can be nullified or weakened by poor configurations.

The framework also points out that tenant isolation can be bolstered by combining multiple independent boundaries, a concept we all know in security as defense-in-depth. Taking this approach can help address the previous point, which is where you have a poorly configured type of boundary, having additional boundaries can mitigate the impact of an escape or lateral movement beyond a single boundary.

One aspect I found refreshing and rational among the guidance was that this section points out that low isolation levels may be warranted and valid in some environments, and may have operational and business benefits that outweigh security considerations.

This is a tough pill to swallow for many security professionals who can’t or won’t accept that the world doesn’t always revolve around security and the business has something referred to a as a risk tolerance, which means some risk will always be acceptable to the business or mission. The boundary methods and configurations chosen should align with the organizations risk tolerance, data types involved, regulatory requirements and so on.

The guidance lays out what they refer to as Primary and Secondary boundaries. They state that primary boundaries may be used on their own whereas secondary boundaries should be thought of enhancements and additional measures that accompany a primary or multiple primary boundaries.

(Hey you, U.S. Federal Community) Also, a quick note before diving into the Primary and Secondary Boundaries as defined by the PEACH Framework. For systems working in U.S. Federal and DoD environments like the Software Factory/PaaS ecosystem I previously described as well as FedRAMP systems, the concept of an Authorization Boundary exists.

This is more so construct oriented around Compliance and System Authorization’s and how you architect and design your System Authorization Boundary has implications for Security Control Inheritance, Responsibilities between Control Providers and Control Consumers, Assessment and Authorization (A&A) and more.

For a deeper dive on the concept of System Authorization Boundaries you can in the U.S. Federal context you can find more in NIST’s 800-137 or the FedRAMP Authorization Boundary Guidance. This is to point out that multi-tenancy models and implementations should be considered from both the security and compliance perspective depending on your industry.

Primary Boundaries

Among the primary boundaries laid out are Hardware Separation, Hardware Virtualization, Containerization and Data Segmentation.

Among these 4, hardware obviously has the most innate tenant isolation since it is physical and not a logical abstraction, unlike some of the other boundaries we will discuss.

As we’ve discussed earlier in the article, some of the more mature CSP’s offer bare metal servers, dedicated instances and even dedicated cloud environments for specific communities of interest (e.g. Governments, Nations etc.).

Hardware Virtualization comes next and involved having virtual machines that run on shared hardware. Many of us know this paradigm from prior to the advent and acceleration of technologies such as Kubernetes and Containers.

In Cloud environments this will come down to understanding things such as IAM, VPC’s, Subnets, NACL’s and so on to secure traffic, communication among the various VM’s and instances running different workloads. It will also involve remediating critical vulnerabilities that malicious actors look to exploit to laterally move once they are in an environment.

For an excellent discussion on lateral movement risks in the cloud, see this article from Wiz. It walks through the nuances of Cloud vs. On-Prem, network lateral movement, VPC’s and more. In the context of IaaS, logical boundaries and abstractions such as Cloud Accounts, VPC’s, Subnet’s and so on become increasingly important to mitigate the malicious actors ability to move laterally and have a larger blast radius across tenants or within an organization across workloads.

Containerization is next on the list and involved different tenants running in separate containers that run on shared hardware and utilize operating system (OS) level virtualization. The guidance points out that Containers are not considered especially effective security boundaries on their own.

This of course is backed by examples such as CVE-2022-0185 which could potentially allow container escape and is represented on the MITRE ATT&CK Containers Matrix in T1611 aka “Escape to Host”. Much like the article I cited in the paragraph above, here is the second part of a two-part article from Wiz, this one focusing on lateral risks from Kubernetes to the cloud. This can be thought of a malicious actor increasingly escaping abstractions and boundaries to get closer to desired targets, impact more targets and have an increased impact on victims. Rather than an escalation of privilege per-se, it becomes an escalation of impact and access.

One useful way to visualize and think of this paradigm is through the use of “The 4 C’s of Cloud Native Security” as defined in the Kubernetes documentation, showing the C’s of: Cloud (and even Data Centers), Clusters, Containers and Code.

As organizations increasingly implement internal Platform-as-a-Service (PaaS) environments utilizing Cloud, Kubernetes and Containers, it is key for organizations to understand the various models of Kubernetes Multi-Tenancy (as captured in the Kubernetes Docs) along with key security considerations and methods for hardening and securing containers. Another valuable resource on this front is the CNCF Cloud Native Security Whitepaper. 

In the PaaS construct, especially in Cloud-native environments utilizing Kubernetes and Containers, logical constructs such as Clusters, Namespaces, Network Policies and so on serve as important mechanisms to mitigate lateral movement from malicious actors. For an informative talk on this topic, you can check out this session titled “Multi-tenancy vs. Multi-cluster: When Should you Use What?

Another key point I feel is worth adding is that depending on the Kubernetes implementation, the responsibility for the Kubernetes Control Plane will vary. If a tenant decides to “Roll Their Own” Kubernetes, they will be responsible for the Kubernetes Control Plane, where is a tenant is using a Managed Service such as AWS Elastic Kubernetes Service (EKS) then the CSP is responsible for the Kubernetes Control Plane.

It is recommended and often the case that organizations opt for the latter, due to the complexity and administrative overhead associated with managing the Kubernetes Control Plane at scale. Even in this scenario, organizations should be familiar with AWS EKS security guidance, EKS Best Practices and can even use tools such as “Harden EKS” to ensure their EKS implementations follow the previously mentioned best practices.

Wiz’s article on lateral movement from containers/clusters to the cloud that I cited above also provides a great graphic demonstrating how malicious actors may look to escalate access and impact pivoting from compromising containers and clusters to the broader cloud accounts and environments that they reside in, which of course has broad implications for tenants in said environments.

Lastly, the final primary security boundary listed is Data Segmentation. As the guidance points out, it isn’t uncommon to see shared storage components used with unique keys for encryption and accessing the various tenants data. The guidance points out this is also a low level os isolation and required mature and effective key management practices to mitigate the risk of inadvertent disclosure of multiple tenants data. Some good resources on this front include Cloud Security Alliance’s “Key Management in Cloud Services” whitepaper, along with CSP specific knowledge, such as AWS’s Key Management Service and associated whitepaper(s) and understanding storage in Kubernetes environments.

Malicious actors are primarily after data - you should prepare accordingly.

Secondary Boundaries

Among the secondary boundaries discussed are Network Segmentation and Identity Segmentation. Given our goal is about tenant isolation and mitigating impact of lateral movements from malicious actors, it should come as no surprise that segmentation at both the Network and Identity layer are cited here. They are also of course core focus areas for Zero Trust, showing up prominently in sources such as CISA’s Zero Trust Maturity Model.

For networking the guidance recommends segmenting tenants through methods such as encapsulation, encryption and firewalling when you have tenants on different virtual or physical machines on the same network.

On the identity front, the guidance points to role-based access control for tenants coupled with least-permissive and deny-by-default policies to limit actions from one tenant impacting the resources of another. These of course are prominent in Zero Trust guidance as well.

In the IaaS context these may be items such as Accounts, Users, Roles, Permissions based on your CSP(s) in use and any federated identity implementations, where in the Kubernetes context it may be items such as roles and role-bindings, as captured in the Kubernetes documentation.

Hardening Factors

Third among tenant modeling is what they call Hardening Factors. This involves hardening complex interfaces and security boundaries to mitigate the impact of a weak link that could compromise tenant isolation. The recommendations in this section they say are backed by their experience exploiting complex PaaS and SaaS environments.

This is where their mnemnonic “PEACH” is laid out in depth as hardening factors.

Privilege Hardening - This involves minimal privileges for tenants and hosts in the various service environments and ensuring tenants aren’t authorized to read or write other tenants data unless explicitly approved by the other parties. These privileges get verified prior to execution.

Encryption Hardening - This factor involves encrypting each tenants data with a unique key and retaining logs related to the data and keys associated with it as well as the logs of the tenants activity being encrypted and only visible between them and the respective control plane of the hosting environment and architecture.

Authentication Hardening - This factor focuses on ensuring a unique key is used for communication between each tenant and the control plane bi-directionally (e.g. mTLS). It also requires the use of validated authentication keys and blocks the use of self-signed keys.

Connectivity Hardening - The most detailed of the factors this one lays out several requirements. These include blocking inter-host connectivity unless explicitly approved by both tenants. It also involves not accepting incoming connection requests among tenants unless explicitly approved and limiting arbitrary access from tenants to external resources. These may be internal to the hosting environment and architecture or more broadly, such as the Internet. Communications are limited to only pre-approved resources approved by the tenant.

Hygiene - Clean your damn room! Kidding, sort of, but this factor involves ensuring unnecessary data isn’t prevalent across the environment that can be used to aid malicious actors in their efforts. These may include things such as secrets such as keys and credentials that allow authentication or encryption activities, software that can be exploited to enable lateral movement or logs which shouldn’t be exposed to other tenants and don’t include information related to other tenants activities.

Many of us know both studies and time have shown that a majority of cloud data breaches are due to poor customer configurations (e.g. hygiene), as well as secrets sprawl and poor secrets management. I wrote about the Secrets Management issue here, for those interested. Secrets have been involved in many of the recent cloud data breaches impacting many large enterprise organizations, and are often a key target for malicious actors and used for subsequent attack activities.

Isolation Review

The 4th step is the Isolation Design Review. This activity facilitates the activities above. You’re able to decompose a service, the associated customer-facing interfaces and start to estimate the complexity and isolation levels and then make design changes, if necessary.

After you’ve mapped all of the external interfaces you can do the following activities:

  • Identify the type of input and components of the interfaces

  • Sketch a design diagram

  • Assess the complexity level of the components

  • Determine the interface’s isolation level

  • Summarize findings into a table

  • Identify potential vulnerabilities

  • Note potential issues

  • Modify the design - if warranted

Their Isolation Design Review procedure is captured in the below diagram:

The guidance also provides examples of Design Diagrams and Implementation Tables, but for the sake of brevity we won’t list those here. At a high level they can be thought of as simplified Data Flow Diagrams (DFD)’s, which are a common artifact used in Threat Modeling and the implementation table services as the checklist to assess the hardening factors.

Vendor Transparency

The last step in Modeling Tenant Isolation is listed as Vendor Transparency. This step, as called out by the guidance is due to a lack of transparency around security controls in cloud environments from vendors.

Tenants understandingly have concerns about the Confidentiality, Integrity and Availability (CIA) of their data and environments that are using multi-tenancy and the Wiz PEACH Framework makes the suggestion that not only do vendors use this method to model tenant isolation but they go as far as publicly sharing the most essential information about the design and implementation of their preventative controls, while not necessarily releasing the full details of their service architecture.

As consumers, either externally using IaaS/PaaS and SaaS providers, or internally, using internally developed Platforms become more familiar and concerned with multi-tenancy security risks, they will inevitably have questions. Being forward communicating and transparent can build consumer trust, both within an organization and externally among your customers and stakeholders.

The guidance cites examples from Microsoft, Google Cloud and AWS who all provide examples of how they implement tenant isolation in their respective environments.

The framework also suggests vendors publicly disclosure significant isolation failures discovered in their services, such as proven cross-tenant vulnerabilities brought to their attention by researchers or in some unfortunate cases, successful malicious actors. These disclosures can be accompanied by security advisories detailing the vulnerabilities and modifications that were taken to mitigate them. These sort of activities would fall in the wheelhouse of Vulnerability Disclosure Programs (VDP)’s and Product Information Security Response Teams (PSIRT)’s, which I previously discussed in this article.

For a mature example of this, look no further than Microsoft, who released an advisory for a cross-tenant vulnerability, ironically identified by Wiz in their Azure Database services environment. They lay out how it was brought to their attention, customer impact, Microsoft’s response, and Technical Details

Improving Tenant Isolation

Now that organizations have modeled their tenant isolation they can begin the second part of the framework, which is Improving Tenant Isolation.

They lay out three primary methods that can be taken to minimize the risk of unauthorized cross-tenant activities. These include Reducing Complexity, Improving Separation and Increasing Duplication. Let’s step through each of these briefly to see what they entail.

Reducing Complexity - This step involves right sizing the type of interface to the type of expected output and by focusing on API security. This helps limit the actions users can perform, as well as would be attackers. There are a variety of resources and best practices to pursue on this front, such as OWASP’s Top Ten and Shieldfy’s API Security Checklist that the guidance cites.

Improving Separation - The guidance states the separation deficiencies can be addressed by either hardening existing isolation mechanisms, replacing them with stronger isolation boundaries/methods as discussed previously, or augmenting them with secondary measures, which we also covered.

The guidance goes on to make some statements that should catch the attention both cloud security practitioners and cloud consumers alike. “any single application-level security boundary encapsulating a non-simple interface should be assumed breachable” (assume breach being another ZT principle).

They say this requires the use of additional secondary boundaries that are independent of the first, such as a different type. So that the compromise of one boundary type doesn’t inherently lead to all out compromise and unimpeded lateral movement across the various tenants.

Increasing Duplication - The guidance makes the case that another way to mitigate risk is to shift shared functionality from the shared control plane (whether by a vendor, or internal platform provider) and duplicating it among the tenants (e.g. localizing it).

Examples could include per tenant, cluster, region etc. This limits the blast radius of a single shared service or interface across all tenants to just the environment it is localized in. It is worth emphasizing, as the guidance does, that this approach comes at the cost of efficiency and with increased complexity and additional administrative overhead. Meaning organizations need to weigh their cybersecurity concerns and constraints against other relevant constraints, such as budget, resources, schedule and capabilities.

The image below depicts a shared interface or service/functionality being moved from the centralized provider or control plane, and instead out to one of the tenants. This would need to be replicated across tenants and doing so should be weighed carefully at the expense of other tradeoffs.

Additional Considerations

As we wrap up, I wanted to bring up that the guidance goes on to lay out various additional considerations. While I won’t list them all here, I did want to call out a few.

Among them:

Over Isolation - The guidance points out that in pursuit of isolation, vendors and organizations may instead end up “over isolating” which brings a corresponding increase in complexity and potentially new vulnerabilities. Isolation MUST be weighed against other external concerns and considerations, such as budget, compliance requirements, and specific use-case characteristics.

Detecting Controls - As they say, it isn’t if, but when. Even the most secured and hardened architectures and environments can still experience security incidents and it is critical to have host and network level monitoring measures in place to detect when these incidents do occur and empower you to respond accordingly. Of course, incidents can also serve as lessons learned and funnel back into improved architecture and isolation mechanisms, as well as empower customers and consumers to perhaps make changes with their consumption if vendors and platform providers aren’t taking the acceptable levels of risk management.

Drift Prevention - The guidance points that isolation levels for various services and environments need to be evaluated regularly. They discuss Isolation Drift, where the real-world environment doesn’t match the intended design. One method to mitigate this, although not specifically cited in the document is the use of a declarative GitOps environment to mitigate some of the manual bespoke changes being made. This involves adopting declarative Infrastructure-as-Code (IaC) deployment models coupled with DevOps methodologies applied to Infrastructure Operations.

Bringing It Home

As you can see from the article, hardening complex multi-tenant cloud environments, whether IaaS, PaaS or SaaS can be and is challenging.

There are a myriad of concerns that must be taken into consideration as well as a variety of hardening techniques cloud and platform providers can take to mitigate the risk to tenants residing in multi-tenant environments.

Cloud, Kubernetes and Containers bring with them a wealth if innovation and capability to help more efficiently and reliability deliver business value to customers and stakeholders - but they aren’t without their own security, compliance and privacy concerns.

Maintaining customer and consumer trust either internally for in-house developed Platforms (such as Software Factories/PaaS), or externally for a CSP’s multi-tenant cloud service offerings is key to a successful adoption of Cloud and Containerization at-scale and the PEACH Framework provides a structured methodology to address multi-tenant cloud security concerns building on proven practices such as Threat Modeling and leveraging expertise of the involved technologies such as Cloud, Kubernetes and Containerization.

As it turns out, much like our physical world, while there are many benefits to densely populated environments, they also come with their own unique considerations and risks and organizations should architect accordingly.

If you want to check out the PEACH Framework, you can visit https://peach.wiz.io to learn more