1 month 2 weeks ago

Author: Jiawei Wang (Google)

The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14.

Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for GA. This article is intended to give a status update to the feature as well as changes between Kubernetes 1.17 and 1.23. In addition, I will also cover the future roadmap for the CSI migration feature GA for each storage plugin.

Quick recap: What is CSI Migration, and why migrate?

The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins. Kubernetes support for the Container Storage Interface has been generally available since Kubernetes v1.13. Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).

As more CSI Drivers were created and became production ready, SIG Storage group wanted all Kubernetes users to benefit from the CSI model. However, we cannot break API compatibility with the existing storage API types. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.

The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work. When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have. However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.

For example, suppose you are a kubernetes.io/gce-pd user, after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing API/Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.

This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.

What has been changed, and what's new?

Building on the work done in Kubernetes v1.17 and earlier, the releases since then have made a series of changes:

New feature gates

The Kubernetes v1.21 release deprecated the CSIMigration{provider}Complete feature flags, and stopped honoring them. In their place came new feature flags named InTreePlugin{vendor}Unregister, that replace the old feature flag and retain all the functionality that CSIMigration{provider}Complete provided.

CSIMigration{provider}Complete was introduced before as a supplementary feature gate once CSI migration is enabled on all of the nodes. This flag unregisters the in-tree storage plugin you specify with the {provider} part of the flag name.

When you enable that feature gate, then instead of using the in-tree driver code, your cluster directly selects and uses the relevant CSI driver. This happens without any check for whether CSI migration is enabled on the node, or whether you have in fact deployed that CSI driver.

While this feature gate is a great helper, SIG Storage (and, I'm sure, lots of cluster operators) also wanted a feature gate that lets you disable an in-tree storage plugin, even without also enabling CSI migration. For example, you might want to disable the EBS storage plugin on a GCE cluster, because EBS volumes are specific to a different vendor's cloud (AWS).

To make this possible, Kubernetes v1.21 introduced a new feature flag set: InTreePlugin{vendor}Unregister.

InTreePlugin{vendor}Unregister is a standalone feature gate that can be enabled and disabled independently from CSI Migration. When enabled, the component will not register the specific in-tree storage plugin to the supported list. If the cluster operator only enables this flag, end users will get an error from PVC saying it cannot find the plugin when the plugin is used. The cluster operator may want to enable this regardless of CSI Migration if they do not want to support the legacy in-tree APIs and only support CSI moving forward.

Observability

Kubernetes v1.21 introduced metrics for tracking CSI migration. You can use these metrics to observe how your cluster is using storage services and whether access to that storage is using the legacy in-tree driver or its CSI-based replacement.

Components Metrics Notes Kube-Controller-Manager storage_operation_duration_seconds A new label migrated is added to the metric to indicate whether this storage operation is a CSI migration operation(string value true for enabled and false for not enabled). Kubelet csi_operations_seconds The new metric exposes labels including driver_name, method_name, grpc_status_code and migrated. The meaning of these labels is identical to csi_sidecar_operations_seconds. CSI Sidecars(provisioner, attacher, resizer) csi_sidecar_operations_seconds A new label migrated is added to the metric to indicate whether this storage operation is a CSI migration operation(string value true for enabled and false for not enabled). Bug fixes and feature improvement

We have fixed numerous bugs like dangling attachment, garbage collection, incorrect topology label through the help of our beta testers.

Cloud Provider && Cluster Lifecycle Collaboration

SIG Storage has been working closely with SIG Cloud Provider and SIG Cluster Lifecycle on the rollout of CSI migration.

If you are a user of a managed Kubernetes service, check with your provider if anything needs to be done. In many cases, the provider will manage the migration and no additional work is required.

If you use a distribution of Kubernetes, check its official documentation for information about support for this feature. For the CSI Migration feature graduation to GA, SIG Storage and SIG Cluster Lifecycle are collaborating towards making the migration mechanisms available in tooling (such as kubeadm) as soon as they're available in Kubernetes itself.

What is the timeline / status?

The current and targeted releases for each individual driver is shown in the table below:

Driver Alpha Beta (in-tree deprecated) Beta (on-by-default) GA Target "in-tree plugin" removal AWS EBS 1.14 1.17 1.23 1.24 (Target) 1.26 (Target) GCE PD 1.14 1.17 1.23 1.24 (Target) 1.26 (Target) OpenStack Cinder 1.14 1.18 1.21 1.24 (Target) 1.26 (Target) Azure Disk 1.15 1.19 1.23 1.24 (Target) 1.26 (Target) Azure File 1.15 1.21 1.24 (Target) 1.25 (Target) 1.27 (Target) vSphere 1.18 1.19 1.24 (Target) 1.25 (Target) 1.27 (Target) Ceph RBD 1.23 Portworx 1.23

The following storage drivers will not have CSI migration support. The ScaleIO driver was already removed; the others are deprecated and will be removed from core Kubernetes.

Driver Deprecated Code Removal ScaleIO 1.16 1.22 Flocker 1.22 1.25 (Target) Quobyte 1.22 1.25 (Target) StorageOS 1.22 1.25 (Target) What's next?

With more CSI drivers graduating to GA, we hope to soon mark the overall CSI Migration feature as GA. We are expecting cloud provider in-tree storage plugins code removal to happen by Kubernetes v1.26 and v1.27.

What should I do as a user?

Note that all new features for the Kubernetes storage system (such as volume snapshotting) will only be added to the CSI interface. Therefore, if you are starting up a new cluster, creating stateful applications for the first time, or require these new features we recommend using CSI drivers natively (instead of the in-tree volume plugin API). Follow the updated user guides for CSI drivers and use the new CSI APIs.

However, if you choose to roll a cluster forward or continue using specifications with the legacy volume APIs, CSI Migration will ensure we continue to support those deployments with the new CSI drivers. However, if you want to leverage new features like snapshot, it will require a manual migration to re-import an existing intree PV as a CSI PV.

How do I get involved?

The Kubernetes Slack channel #csi-migration along with any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:

  • Michelle Au (msau42)
  • Jan Šafránek (jsafrane)
  • Hemant Kumar (gnufied)

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:

  • Andy Zhang (andyzhangz)
  • Divyen Patel (divyenpatel)
  • Deep Debroy (ddebroy)
  • Humble Devassy Chirammal (humblec)
  • Jing Xu (jingxu97)
  • Jordan Liggitt (liggitt)
  • Matthew Cary (mattcary)
  • Matthew Wong (wongma7)
  • Neha Arora (nearora-msft)
  • Oksana Naumov (trierra)
  • Saad Ali (saad-ali)
  • Tim Bannister (sftim)
  • Xing Yang (xing-yang)

Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

1 month 2 weeks ago

Authors: Jim Angel (Google), Lachlan Evenson (Microsoft)

With the release of Kubernetes v1.23, Pod Security admission has now entered beta. Pod Security is a built-in admission controller that evaluates pod specifications against a predefined set of Pod Security Standards and determines whether to admit or deny the pod from running.

Pod Security is the successor to PodSecurityPolicy which was deprecated in the v1.21 release, and will be removed in Kubernetes v1.25. In this article, we cover the key concepts of Pod Security along with how to use it. We hope that cluster administrators and developers alike will use this new mechanism to enforce secure defaults for their workloads.

Why Pod Security

The overall aim of Pod Security is to let you isolate workloads. You can run a cluster that runs different workloads and, without adding extra third-party tooling, implement controls that require Pods for a workload to restrict their own privileges to a defined bounding set.

Pod Security overcomes key shortcomings of Kubernetes' existing, but deprecated, PodSecurityPolicy (PSP) mechanism:

  • Policy authorization model — challenging to deploy with controllers.
  • Risks around switching — a lack of dry-run/audit capabilities made it hard to enable PodSecurityPolicy.
  • Inconsistent and Unbounded API — the large configuration surface and evolving constraints led to a complex and confusing API.

The shortcomings of PSP made it very difficult to use which led the community to reevaluate whether or not a better implementation could achieve the same goals. One of those goals was to provide an out-of-the-box solution to apply security best practices. Pod Security ships with predefined Pod Security levels that a cluster administrator can configure to meet the desired security posture.

It's important to note that Pod Security doesn't have complete feature parity with the deprecated PodSecurityPolicy. Specifically, it doesn't have the ability to mutate or change Kubernetes resources to auto-remediate a policy violation on behalf of the user. Additionally, it doesn't provide fine-grained control over each allowed field and value within a pod specification or any other Kubernetes resource that you may wish to evaluate. If you need more fine-grained policy control then take a look at these other projects which support such use cases.

Pod Security also adheres to Kubernetes best practices of declarative object management by denying resources that violate the policy. This requires resources to be updated in source repositories, and tooling to be updated prior to being deployed to Kubernetes.

How Does Pod Security Work?

Pod Security is a built-in admission controller starting with Kubernetes v1.22, but can also be run as a standalone webhook. Admission controllers function by intercepting requests in the Kubernetes API server prior to persistence to storage. They can either admit or deny a request. In the case of Pod Security, pod specifications will be evaluated against a configured policy in the form of a Pod Security Standard. This means that security sensitive fields in a pod specification will only be allowed to have specific values.

Configuring Pod Security Pod Security Standards

In order to use Pod Security we first need to understand Pod Security Standards. These standards define three different policy levels that range from permissive to restrictive. These levels are as follows:

  • privileged — open and unrestricted
  • baseline — Covers known privilege escalations while minimizing restrictions
  • restricted — Highly restricted, hardening against known and unknown privilege escalations. May cause compatibility issues

Each of these policy levels define which fields are restricted within a pod specification and the allowed values. Some of the fields restricted by these policies include:

  • spec.securityContext.sysctls
  • spec.hostNetwork
  • spec.volumes[*].hostPath
  • spec.containers[*].securityContext.privileged

Policy levels are applied via labels on Namespace resources, which allows for granular per-namespace policy selection. The AdmissionConfiguration in the API server can also be configured to set cluster-wide default levels and exemptions.

Policy modes

Policies are applied in a specific mode. Multiple modes (with different policy levels) can be set on the same namespace. Here is a list of modes:

  • enforce — Any Pods that violate the policy will be rejected
  • audit — Violations will be recorded as an annotation in the audit logs, but don't affect whether the pod is allowed.
  • warn — Violations will send a warning message back to the user, but don't affect whether the pod is allowed.

In addition to modes you can also pin the policy to a specific version (for example v1.22). Pinning to a specific version allows the behavior to remain consistent if the policy definition changes in future Kubernetes releases.

Hands on demo Prerequisites Deploy a kind cluster kind create cluster --image kindest/node:v1.23.0

It might take a while to start and once it's started it might take a minute or so before the node becomes ready.

kubectl cluster-info --context kind-kind

Wait for the node STATUS to become ready.

kubectl get nodes

The output is similar to this:

NAME STATUS ROLES AGE VERSION kind-control-plane Ready control-plane,master 54m v1.23.0 Confirm Pod Security is enabled

The best way to confirm the API's default enabled plugins is to check the Kubernetes API container's help arguments.

kubectl -n kube-system exec kube-apiserver-kind-control-plane -it -- kube-apiserver -h | grep "default enabled ones"

The output is similar to this:

... --enable-admission-plugins strings admission plugins that should be enabled in addition to default enabled ones (NamespaceLifecycle, LimitRanger, ServiceAccount, TaintNodesByCondition, PodSecurity, Priority, DefaultTolerationSeconds, DefaultStorageClass, StorageObjectInUseProtection, PersistentVolumeClaimResize, RuntimeClass, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, ResourceQuota). ...

PodSecurity is listed in the group of default enabled admission plugins.

If using a cloud provider, or if you don't have access to the API server, the best way to check would be to run a quick end-to-end test:

kubectl create namespace verify-pod-security kubectl label namespace verify-pod-security pod-security.kubernetes.io/enforce=restricted # The following command does NOT create a workload (--dry-run=server) kubectl -n verify-pod-security run test --dry-run=server --image=busybox --privileged kubectl delete namespace verify-pod-security

The output is similar to this:

Error from server (Forbidden): pods "test" is forbidden: violates PodSecurity "restricted:latest": privileged (container "test" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "test" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") Configure Pod Security

Policies are applied to a namespace via labels. These labels are as follows:

  • pod-security.kubernetes.io/<MODE>: <LEVEL> (required to enable pod security)
  • pod-security.kubernetes.io/<MODE>-version: <VERSION> (optional, defaults to latest)

A specific version can be supplied for each enforcement mode. The version pins the policy to the version that was shipped as part of the Kubernetes release. Pinning to a specific Kubernetes version allows for deterministic policy behavior while allowing flexibility for future updates to Pod Security Standards. The possible <MODE(S)> are enforce, audit and warn.

When to use warn?

The typical uses for warn are to get ready for a future change where you want to enforce a different policy. The most two common cases would be:

  • warn at the same level but a different version (e.g. pin enforce to restricted+v1.23 and warn at restricted+latest)
  • warn at a stricter level (e.g. enforce baseline, warn restricted)

It's not recommended to use warn for the exact same level+version of the policy as enforce. In the admission sequence, if enforce fails, the entire sequence fails before evaluating the warn.

First, create a namespace called verify-pod-security if not created earlier. For the demo, --overwrite is used when labeling to allow repurposing a single namespace for multiple examples.

kubectl create namespace verify-pod-security Deploy demo workloads

Each workload represents a higher level of security that would not pass the profile that comes after it.

For the following examples, use the busybox container runs a sleep command for 1 million seconds (≅11 days) or until deleted. Pod Security is not interested in which container image you chose, but rather the Pod level settings and their implications for security.

Privileged level and workload

For the privileged pod, use the privileged policy. This allows the process inside a container to gain new processes (also known as "privilege escalation") and can be dangerous if untrusted.

First, let's apply a restricted Pod Security level for a test.

# enforces a "restricted" security policy and audits on restricted kubectl label --overwrite ns verify-pod-security \ pod-security.kubernetes.io/enforce=restricted \ pod-security.kubernetes.io/audit=restricted

Next, try to deploy a privileged workload in the namespace.

cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-privileged spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: allowPrivilegeEscalation: true EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-privileged" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "busybox" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Now let's apply the privileged Pod Security level and try again.

# enforces a "privileged" security policy and warns / audits on baseline kubectl label --overwrite ns verify-pod-security \ pod-security.kubernetes.io/enforce=privileged \ pod-security.kubernetes.io/warn=baseline \ pod-security.kubernetes.io/audit=baseline cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-privileged spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: allowPrivilegeEscalation: true EOF

The output is similar to this:

pod/busybox-privileged created

We can run kubectl -n verify-pod-security get pods to verify it is running. Clean up with:

kubectl -n verify-pod-security delete pod busybox-privileged Baseline level and workload

The baseline policy demonstrates sensible defaults while preventing common container exploits.

Let's revert back to a restricted Pod Security level for a quick test.

# enforces a "restricted" security policy and audits on restricted kubectl label --overwrite ns verify-pod-security \ pod-security.kubernetes.io/enforce=restricted \ pod-security.kubernetes.io/audit=restricted

Apply the workload.

cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-baseline spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: allowPrivilegeEscalation: false capabilities: add: - NET_BIND_SERVICE - CHOWN EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-baseline" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]; container "busybox" must not include "CHOWN" in securityContext.capabilities.add), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Let's apply the baseline Pod Security level and try again.

# enforces a "baseline" security policy and warns / audits on restricted kubectl label --overwrite ns verify-pod-security \ pod-security.kubernetes.io/enforce=baseline \ pod-security.kubernetes.io/warn=restricted \ pod-security.kubernetes.io/audit=restricted cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-baseline spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: allowPrivilegeEscalation: false capabilities: add: - NET_BIND_SERVICE - CHOWN EOF

The output is similar to the following. Note that the warnings match the error message from the test above, but the pod is still successfully created.

Warning: would violate PodSecurity "restricted:latest": unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]; container "busybox" must not include "CHOWN" in securityContext.capabilities.add), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/busybox-baseline created

Remember, we set the verify-pod-security namespace to warn based on the restricted profile. We can run kubectl -n verify-pod-security get pods to verify it is running. Clean up with:

kubectl -n verify-pod-security delete pod busybox-baseline Restricted level and workload

The restricted policy requires rejection of all privileged parameters. It is the most secure with a trade-off for complexity. The restricted policy allows containers to add the NET_BIND_SERVICE capability only.

While we've already tested restricted as a blocking function, let's try to get something running that meets all the criteria.

First we need to reapply the restricted profile, for the last time.

# enforces a "restricted" security policy and audits on restricted kubectl label --overwrite ns verify-pod-security \ pod-security.kubernetes.io/enforce=restricted \ pod-security.kubernetes.io/audit=restricted cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-restricted spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: allowPrivilegeEscalation: false capabilities: add: - NET_BIND_SERVICE EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-restricted" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

This is because the restricted profile explicitly requires that certain values are set to the most secure parameters.

By requiring explicit values, manifests become more declarative and your entire security model can shift left. With the restricted level of enforcement, a company could audit their cluster's compliance based on permitted manifests.

Let's fix each warning resulting in the following file:

cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-restricted spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: seccompProfile: type: RuntimeDefault runAsNonRoot: true allowPrivilegeEscalation: false capabilities: drop: - ALL add: - NET_BIND_SERVICE EOF

The output is similar to this:

pod/busybox-restricted created

Run kubectl -n verify-pod-security get pods to verify it is running. The output is similar to this:

NAME READY STATUS RESTARTS AGE busybox-restricted 0/1 CreateContainerConfigError 0 2m26s

Let's figure out why the container is not starting with kubectl -n verify-pod-security describe pod busybox-restricted. The output is similar to this:

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 2m29s (x8 over 3m55s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "busybox-restricted_verify-pod-security(a4c6a62d-2166-41a9-b288-20df17cf5c90)", container: busybox)

To solve this, set the effective UID (runAsUser) to a non-zero (root) value or use the nobody UID (65534).

# delete the original pod kubectl -n verify-pod-security delete pod busybox-restricted # create the pod again with new runAsUser cat <<EOF | kubectl -n verify-pod-security apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-restricted spec: securityContext: runAsUser: 65534 containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: seccompProfile: type: RuntimeDefault runAsNonRoot: true allowPrivilegeEscalation: false capabilities: drop: - ALL add: - NET_BIND_SERVICE EOF

Run kubectl -n verify-pod-security get pods to verify it is running. The output is similar to this:

NAME READY STATUS RESTARTS AGE busybox-restricted 1/1 Running 0 25s

Clean up the demo (restricted pod and namespace) with:

kubectl delete namespace verify-pod-security

At this point, if you wanted to dive deeper into linux permissions or what is permitted for a certain container, exec into the control plane and play around with containerd and crictl inspect.

# if using docker, shell into the control plane docker exec -it kind-control-plane bash # list running containers crictl ps # inspect each one by container ID crictl inspect <CONTAINER ID> Applying a cluster-wide policy

In addition to applying labels to namespaces to configure policy you can also configure cluster-wide policies and exemptions using the AdmissionConfiguration resource.

Using this resource, policy definitions are applied cluster-wide by default and any policy that is applied via namespace labels will take precedence.

There is no runtime configurable API for the AdmissionConfiguration configuration file so a cluster administrator would need to specify a path to the file below via the --admission-control-config-file flag on the API server.

In the following resource we are enforcing the baseline policy and warning and auditing the baseline policy. We are also making the kube-system namespace exempt from this policy.

It's not recommended to alter control plane / clusters after install, so let's build a new cluster with a default policy on all namespaces.

First, delete the current cluster.

kind delete cluster

Create a Pod Security configuration that enforce and audit baseline policies while using a restricted profile to warn the end user.

cat <<EOF > pod-security.yaml apiVersion: apiserver.config.k8s.io/v1 kind: AdmissionConfiguration plugins: - name: PodSecurity configuration: apiVersion: pod-security.admission.config.k8s.io/v1beta1 kind: PodSecurityConfiguration defaults: enforce: "baseline" enforce-version: "latest" audit: "baseline" audit-version: "latest" warn: "restricted" warn-version: "latest" exemptions: # Array of authenticated usernames to exempt. usernames: [] # Array of runtime class names to exempt. runtimeClasses: [] # Array of namespaces to exempt. namespaces: [kube-system] EOF

For additional options, check out the official standards admission controller docs.

We now have a default baseline policy. Next pass it to the kind configuration to enable the --admission-control-config-file API server argument and pass the policy file. To pass a file to a kind cluster, use a configuration file to pass additional setup instructions. Kind uses kubeadm to provision the cluster and the configuration file has the ability to pass kubeadmConfigPatches for further customization. In our case, the local file is mounted into the control plane node as /etc/kubernetes/policies/pod-security.yaml which is then mounted into the apiServer container. We also pass the --admission-control-config-file argument pointing to the policy's location.

cat <<EOF > kind-config.yaml kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane kubeadmConfigPatches: - | kind: ClusterConfiguration apiServer: # enable admission-control-config flag on the API server extraArgs: admission-control-config-file: /etc/kubernetes/policies/pod-security.yaml # mount new file / directories on the control plane extraVolumes: - name: policies hostPath: /etc/kubernetes/policies mountPath: /etc/kubernetes/policies readOnly: true pathType: "DirectoryOrCreate" # mount the local file on the control plane extraMounts: - hostPath: ./pod-security.yaml containerPath: /etc/kubernetes/policies/pod-security.yaml readOnly: true EOF

Create a new cluster using the kind configuration file defined above.

kind create cluster --image kindest/node:v1.23.0 --config kind-config.yaml

Let's look at the default namespace.

kubectl describe namespace default

The output is similar to this:

Name: default Labels: kubernetes.io/metadata.name=default Annotations: <none> Status: Active No resource quota. No LimitRange resource.

Let's create a new namespace and see if the labels apply there.

kubectl create namespace test-defaults kubectl describe namespace test-defaults

Same.

Name: test-defaults Labels: kubernetes.io/metadata.name=test-defaults Annotations: <none> Status: Active No resource quota. No LimitRange resource.

Can a privileged workload be deployed?

cat <<EOF | kubectl -n test-defaults apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-privileged spec: containers: - name: busybox image: busybox args: - sleep - "1000000" securityContext: allowPrivilegeEscalation: true EOF

Hmm... yep. The default warn level is working at least.

Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "busybox" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/busybox-privileged created

Let's delete the pod with kubectl -n test-defaults delete pod/busybox-privileged.

Is my config even working?

# if using docker, shell into the control plane docker exec -it kind-control-plane bash # cat out the file we mounted cat /etc/kubernetes/policies/pod-security.yaml # check the api server logs cat /var/log/containers/kube-apiserver*.log # check the api server config cat /etc/kubernetes/manifests/kube-apiserver.yaml

UPDATE: The baseline policy permits allowPrivilegeEscalation. While I cannot see the Pod Security default levels of enforcement, they are there. Let's try to provide a manifest that violates the baseline by requesting hostNetwork access.

# delete the original pod kubectl -n test-defaults delete pod busybox-privileged cat <<EOF | kubectl -n test-defaults apply -f - apiVersion: v1 kind: Pod metadata: name: busybox-privileged spec: containers: - name: busybox image: busybox args: - sleep - "1000000" hostNetwork: true EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-privileged" is forbidden: violates PodSecurity "baseline:latest": host namespaces (hostNetwork=true) Yes!!! It worked! 🎉🎉🎉

I later found out, another way to check if things are operating as intended is to check the raw API server metrics endpoint.

Run the following command:

kubectl get --raw /metrics | grep pod_security_evaluations_total

The output is similar to this:

# HELP pod_security_evaluations_total [ALPHA] Number of policy evaluations that occurred, not counting ignored or exempt requests. # TYPE pod_security_evaluations_total counter pod_security_evaluations_total{decision="allow",mode="enforce",policy_level="baseline",policy_version="latest",request_operation="create",resource="pod",subresource=""} 2 pod_security_evaluations_total{decision="allow",mode="enforce",policy_level="privileged",policy_version="latest",request_operation="create",resource="pod",subresource=""} 0 pod_security_evaluations_total{decision="allow",mode="enforce",policy_level="privileged",policy_version="latest",request_operation="update",resource="pod",subresource=""} 0 pod_security_evaluations_total{decision="deny",mode="audit",policy_level="baseline",policy_version="latest",request_operation="create",resource="pod",subresource=""} 1 pod_security_evaluations_total{decision="deny",mode="enforce",policy_level="baseline",policy_version="latest",request_operation="create",resource="pod",subresource=""} 1 pod_security_evaluations_total{decision="deny",mode="warn",policy_level="restricted",policy_version="latest",request_operation="create",resource="controller",subresource=""} 2 pod_security_evaluations_total{decision="deny",mode="warn",policy_level="restricted",policy_version="latest",request_operation="create",resource="pod",subresource=""} 2

A monitoring tool could ingest these metrics too for reporting, assessments, or measuring trends.

Clean up

When finished, delete the kind cluster.

kind delete cluster Auditing

Auditing is another way to track what policies are being enforced in your cluster. To set up auditing with kind, review the official docs for enabling auditing. As of version 1.11, Kubernetes audit logs include two annotations that indicate whether or not a request was authorized (authorization.k8s.io/decision) and the reason for the decision (authorization.k8s.io/reason). Audit events can be streamed to a webhook for monitoring, tracking, or alerting.

The audit events look similar to the following:

{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"","pod-security.kubernetes.io/audit":"allowPrivilegeEscalation != false (container \"busybox\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"busybox\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"busybox\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"busybox\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"}}

Auditing is also a good first step in evaluating your cluster's current compliance with Pod Security. The Kubernetes Enhancement Proposal (KEP) hints at a future where baseline could be the default for unlabeled namespaces.

Example audit-policy.yaml configuration tuned for Pod Security events:

apiVersion: audit.k8s.io/v1 kind: Policy rules: - level: RequestResponse resources: - group: "" # core API group resources: ["pods", "pods/ephemeralcontainers", "podtemplates", "replicationcontrollers"] - group: "apps" resources: ["daemonsets", "deployments", "replicasets", "statefulsets"] - group: "batch" resources: ["cronjobs", "jobs"] verbs: ["create", "update"] omitStages: - "RequestReceived" - "ResponseStarted" - "Panic"

Once auditing is enabled, look at the configured local file if using --audit-log-path or the destination of a webhook if using --audit-webhook-config-file.

If using a file (--audit-log-path), run cat /PATH/TO/API/AUDIT.log | grep "is forbidden:" to see all rejected workloads audited.

PSP migrations

If you're already using PSP, SIG Auth has created a guide and published the steps to migrate off of PSP.

To summarize the process:

  • Update all existing PSPs to be non-mutating
  • Apply Pod Security policies in warn or audit mode
  • Upgrade Pod Security policies to enforce mode
  • Remove PodSecurityPolicy from --enable-admission-plugins

Listed as "optional future extensions" and currently out of scope, SIG Auth has kicked around the idea of providing a tool to assist with migrations. More details in the KEP.

Wrap up

Pod Security is a promising new feature that provides an out-of-the-box way to allow users to improve the security posture of their workloads. Like any new enhancement that has matured to beta, we ask that you try it out, provide feedback, or share your experience via either raising a Github issue or joining SIG Auth community meetings. It's our hope that Pod Security will be deployed on every cluster in our ongoing pursuit as a community to make Kubernetes security a priority.

For a step by step guide on how to enable "baseline" Pod Security Standards with Pod Security Admission feature please refer to these dedicated tutorials that cover the configuration needed at cluster level and namespace level.

Additional resources
1 month 2 weeks ago

Author: Bridget Kromhout (Microsoft)

"When will Kubernetes have IPv6?" This question has been asked with increasing frequency ever since alpha support for IPv6 was first added in k8s v1.9. While Kubernetes has supported IPv6-only clusters since v1.18, migration from IPv4 to IPv6 was not yet possible at that point. At long last, dual-stack IPv4/IPv6 networking has reached general availability (GA) in Kubernetes v1.23.

What does dual-stack networking mean for you? Let’s take a look…

Service API updates

Services were single-stack before 1.20, so using both IP families meant creating one Service per IP family. The user experience was simplified in 1.20, when Services were re-implemented to allow both IP families, meaning a single Service can handle both IPv4 and IPv6 workloads. Dual-stack load balancing is possible between services running any combination of IPv4 and IPv6.

The Service API now has new fields to support dual-stack, replacing the single ipFamily field.

  • You can select your choice of IP family by setting ipFamilyPolicy to one of three options: SingleStack, PreferDualStack, or RequireDualStack. A service can be changed between single-stack and dual-stack (within some limits).
  • Setting ipFamilies to a list of families assigned allows you to set the order of families used.
  • clusterIPs is inclusive of the previous clusterIP but allows for multiple entries, so it’s no longer necessary to run duplicate services, one in each of the two IP families. Instead, you can assign cluster IP addresses in both IP families.

Note that Pods are also dual-stack. For a given pod, there is no possibility of setting multiple IP addresses in the same family.

Default behavior remains single-stack

Starting in 1.20 with the re-implementation of dual-stack services as alpha, the underlying networking for Kubernetes has included dual-stack whether or not a cluster was configured with the feature flag to enable dual-stack.

Kubernetes 1.23 removed that feature flag as part of graduating the feature to stable. Dual-stack networking is always available if you want to configure it. You can set your cluster network to operate as single-stack IPv4, as single-stack IPv6, or as dual-stack IPv4/IPv6.

While Services are set according to what you configure, Pods default to whatever the CNI plugin sets. If your CNI plugin assigns single-stack IPs, you will have single-stack unless ipFamilyPolicy specifies PreferDualStack or RequireDualStack. If your CNI plugin assigns dual-stack IPs, pod.status.PodIPs defaults to dual-stack.

Even though dual-stack is possible, it is not mandatory to use it. Examples in the documentation show the variety possible in dual-stack service configurations.

Try dual-stack right now

While upstream Kubernetes now supports dual-stack networking as a GA or stable feature, each provider’s support of dual-stack Kubernetes may vary. Nodes need to be provisioned with routable IPv4/IPv6 network interfaces. Pods need to be dual-stack. The network plugin is what assigns the IP addresses to the Pods, so it's the network plugin being used for the cluster that needs to support dual-stack. Some Container Network Interface (CNI) plugins support dual-stack, as does kubenet.

Ecosystem support of dual-stack is increasing; you can create dual-stack clusters with kubeadm, try a dual-stack cluster locally with KIND, and deploy dual-stack clusters in cloud providers (after checking docs for CNI or kubenet availability).

Get involved with SIG Network

SIG-Network wants to learn from community experiences with dual-stack networking to find out more about evolving needs and your use cases. The SIG-network update video from KubeCon NA 2021 summarizes the SIG’s recent updates, including dual-stack going to stable in 1.23.

The current SIG-Network KEPs and issues on GitHub illustrate the SIG’s areas of emphasis. The dual-stack API server is one place to consider contributing.

SIG-Network meetings are a friendly, welcoming venue for you to connect with the community and share your ideas. Looking forward to hearing from you!

Acknowledgments

The dual-stack networking feature represents the work of many Kubernetes contributors. Thanks to all who contributed code, experience reports, documentation, code reviews, and everything in between. Bridget Kromhout details this community effort in Dual-Stack Networking in Kubernetes. KubeCon keynotes by Tim Hockin & Khaled (Kal) Henidak in 2019 (The Long Road to IPv4/IPv6 Dual-stack Kubernetes) and by Lachlan Evenson in 2021 (And Here We Go: Dual-stack Networking in Kubernetes) talk about the dual-stack journey, spanning five years and a great many lines of code.

1 month 2 weeks ago

Authors: Kubernetes 1.23 Release Team

We’re pleased to announce the release of Kubernetes 1.23, the last release of 2021!

This release consists of 47 enhancements: 11 enhancements have graduated to stable, 17 enhancements are moving to beta, and 19 enhancements are entering alpha. Also, 1 feature has been deprecated.

Major Themes Deprecation of FlexVolume

FlexVolume is deprecated. The out-of-tree CSI driver is the recommended way to write volume drivers in Kubernetes. See this doc for more information. Maintainers of FlexVolume drivers should implement a CSI driver and move users of FlexVolume to CSI. Users of FlexVolume should move their workloads to the CSI driver.

Deprecation of klog specific flags

To simplify the code base, several logging flags were marked as deprecated in Kubernetes 1.23. The code which implements them will be removed in a future release, so users of those need to start replacing the deprecated flags with some alternative solutions.

Software Supply Chain SLSA Level 1 Compliance in the Kubernetes Release Process

Kubernetes releases now generate provenance attestation files describing the staging and release phases of the release process. Artifacts are now verified as they are handed over from one phase to the next. This final piece completes the work needed to comply with Level 1 of the SLSA security framework (Supply-chain Levels for Software Artifacts).

IPv4/IPv6 Dual-stack Networking graduates to GA

IPv4/IPv6 dual-stack networking graduates to GA. Since 1.21, Kubernetes clusters have been enabled to support dual-stack networking by default. In 1.23, the IPv6DualStack feature gate is removed. The use of dual-stack networking is not mandatory. Although clusters are enabled to support dual-stack networking, Pods and Services continue to default to single-stack. To use dual-stack networking Kubernetes nodes must have routable IPv4/IPv6 network interfaces, a dual-stack capable CNI network plugin must be used, Pods must be configured to be dual-stack and Services must have their .spec.ipFamilyPolicy field set to either PreferDualStack or RequireDualStack.

HorizontalPodAutoscaler v2 graduates to GA

The HorizontalPodAutscaler autoscaling/v2 stable API moved to GA in 1.23. The HorizontalPodAutoscaler autoscaling/v2beta2 API has been deprecated.

Generic Ephemeral Volume feature graduates to GA

The generic ephemeral volume feature moved to GA in 1.23. This feature allows any existing storage driver that supports dynamic provisioning to be used as an ephemeral volume with the volume’s lifecycle bound to the Pod. All StorageClass parameters for volume provisioning and all features supported with PersistentVolumeClaims are supported.

Skip Volume Ownership change graduates to GA

The feature to configure volume permission and ownership change policy for Pods moved to GA in 1.23. This allows users to skip recursive permission changes on mount and speeds up the pod start up time.

Allow CSI drivers to opt-in to volume ownership and permission change graduates to GA

The feature to allow CSI Drivers to declare support for fsGroup based permissions graduates to GA in 1.23.

PodSecurity graduates to Beta

PodSecurity moves to Beta. PodSecurity replaces the deprecated PodSecurityPolicy admission controller. PodSecurity is an admission controller that enforces Pod Security Standards on Pods in a Namespace based on specific namespace labels that set the enforcement level. In 1.23, the PodSecurity feature gate is enabled by default.

Container Runtime Interface (CRI) v1 is default

The Kubelet now supports the CRI v1 API, which is now the project-wide default. If a container runtime does not support the v1 API, Kubernetes will fall back to the v1alpha2 implementation. There is no intermediate action required by end-users, because v1 and v1alpha2 do not differ in their implementation. It is likely that v1alpha2 will be removed in one of the future Kubernetes releases to be able to develop v1.

Structured logging graduate to Beta

Structured logging reached its Beta milestone. Most log messages from kubelet and kube-scheduler have been converted. Users are encouraged to try out JSON output or parsing of the structured text format and provide feedback on possible solutions for the open issues, such as handling of multi-line strings in log values.

Simplified Multi-point plugin configuration for scheduler

The kube-scheduler is adding a new, simplified config field for Plugins to allow multiple extension points to be enabled in one spot. The new multiPoint plugin field is intended to simplify most scheduler setups for administrators. Plugins that are enabled via multiPoint will automatically be registered for each individual extension point that they implement. For example, a plugin that implements Score and Filter extensions can be simultaneously enabled for both. This means entire plugins can be enabled and disabled without having to manually edit individual extension point settings. These extension points can now be abstracted away due to their irrelevance for most users.

CSI Migration updates

CSI Migration enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. After migration, Kubernetes users may continue to rely on all the functionality of in-tree storage plugins using the existing interface.

  • CSI Migration feature is turned on by default but stays in Beta for GCE PD, AWS EBS, and Azure Disk in 1.23.
  • CSI Migration is introduced as an Alpha feature for Ceph RBD and Portworx in 1.23.
Expression language validation for CRD is alpha

Expression language validation for CRD is in alpha starting in 1.23. If the CustomResourceValidationExpressions feature gate is enabled, custom resources will be validated by validation rules using the Common Expression Language (CEL).

Server Side Field Validation is Alpha

If the ServerSideFieldValidation feature gate is enabled starting 1.23, users will receive warnings from the server when they send Kubernetes objects in the request that contain unknown or duplicate fields. Previously unknown fields and all but the last duplicate fields would be dropped by the server.

With the feature gate enabled, we also introduce the fieldValidation query parameter so that users can specify the desired behavior of the server on a per request basis. Valid values for the fieldValidation query parameter are:

  • Ignore (default when feature gate is disabled, same as pre-1.23 behavior of dropping/ignoring unkonwn fields)
  • Warn (default when feature gate is enabled).
  • Strict (this will fail the request with an Invalid Request error)
OpenAPI v3 is Alpha

If the OpenAPIV3 feature gate is enabled starting 1.23, users will be able to request the OpenAPI v3.0 spec for all Kubernetes types. OpenAPI v3 aims to be fully transparent and includes support for a set of fields that are dropped when publishing OpenAPI v2: default, nullable, oneOf, anyOf. A separate spec is published per Kubernetes group version (at the $cluster/openapi/v3/apis/<group>/<version> endpoint) for improved performance and discovery, for all group versions can be found at the $cluster/openapi/v3 path.

Other Updates Graduated to Stable Major Changes Release Notes

Check out the full details of the Kubernetes 1.23 release in our release notes.

Availability

Kubernetes 1.23 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using Docker container “nodes” with kind. You can also easily install 1.23 using kubeadm.

Release Team

This release was made possible by a very dedicated group of individuals, who came together as a team to deliver technical content, documentation, code, and a host of other components that go into every Kubernetes release.

A huge thank you to the release lead Rey Lejano for leading us through a successful release cycle, and to everyone else on the release team for supporting each other, and working so hard to deliver the 1.23 release for the community.

Release Theme and Logo

Kubernetes 1.23: The Next Frontier

"The Next Frontier" theme represents the new and graduated enhancements in 1.23, Kubernetes' history of Star Trek references, and the growth of community members in the release team.

Kubernetes has a history of Star Trek references. The original codename for Kubernetes within Google is Project 7, a reference to Seven of Nine from Star Trek Voyager. And of course Borg was the name for the predecessor to Kubernetes. "The Next Frontier" theme continues the Star Trek references. "The Next Frontier" is a fusion of two Star Trek titles, Star Trek V: The Final Frontier and Star Trek the Next Generation.

"The Next Frontier" represents a line in the SIG Release charter, "Ensure there is a consistent group of community members in place to support the release process across time." With each release team, we grow the community with new release team members and for many it's their first contribution in their open source frontier.

Reference: https://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/ Reference: https://github.com/kubernetes/community/blob/master/sig-release/charter.md

The Kubernetes 1.23 release logo continues with the theme's Star Trek reference. Every star is a helm from the Kubernetes logo. The ship represents the collective teamwork of the release team.

Rey Lejano designed the logo.

User Highlights Ecosystem Updates Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.23 release cycle, which ran for 16 weeks (August 23 to December 7), we saw contributions from 1032 companies and 1084 individuals.

Event Update
  • KubeCon + CloudNativeCon China 2021 is happening this month from December 9 - 11. After taking a break last year, the event will be virtual this year and includes 105 sessions. Check out the event schedule here.
  • KubeCon + CloudNativeCon Europe 2022 will take place in Valencia, Spain, May 4 – 7, 2022! You can find more information about the conference and registration on the event site.
  • Kubernetes Community Days has upcoming events scheduled in Pakistan, Brazil, Chengdu, and in Australia.
Upcoming Release Webinar

Join members of the Kubernetes 1.23 release team on January 4, 2022 to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below: