Collection of issues we faced and external pages/links that actually helped resolve them;
2021-08
- User needed to change
ulimitson worker nodes so his "heavy" app can run without errors like "Too many open files." With RHCOS, for system configs likeulimitsto persist, it has to be done withMachineConfigwhich has a rather straightforward way of adding (or replacing) any file to the system. Tweaking the example for/etc/security/ulimits.conf, the rest was just executing the commands to convert the input.bufile to YAML, thenoc applyit, and waiting for all (worker) nodes to reboot.
By the way, thequay.io/coreos/butaneimage referenced by this page is x86 only. If you want to execute that step on Power, go get the Linux binary from here.
==> https://docs.openshift.com/container-platform/4.8/post_installation_configuration/machine-configuration-tasks.html#machineconfig-modify-journald_post-install-machine-configuration-tasks
(2021-07-27 4.8 GA)
2021-06
- Namespace (aka Project in OpenShift) stuck in
Terminatingstate after an attempt in deleting one.oc get allshows no resources in the namespace, yetoc get nsshows the "deleted" namespace inTerminatingstate.
==> https://www.redhat.com/sysadmin/openshift-terminating-state
2021-05
- Upgrading a cluster (from 4.7.6 to 4.7.7) hit a snag and one master node remained
SchedulingDisabled, while all other nodes upgraded fine.machine-configoperator is Degraded, andoc get clusterversionstuck at 83%. Understanding how OCP upgrade works and how MCO works helps feel comfortable going into a failing cluster node and running commands to nudge MCD to getmachine-configoperator back in a good state, and finish the outstanding upgrade operation.
=> https://guifreelife.com/blog/2021/03/09/Understanding-OpenShift-Over-The-Air-Updates/
=> https://access.redhat.com/solutions/5598401
2021-04
- Customizing KVM RHEL8 image would fail, if you try to do that (
virt-customize) on an RHEL7 host. This was experienced when following preparatory steps in https://github.com/ocp-power-automation/ocp4-upi-kvm.
=> https://access.redhat.com/solutions/4073061
2021-03
- A compute or storage got deleted perhaps by accident or mistake, and now your
terraform.tfstateis out of sync.terraform refreshis the first thing to try, as it queries the target IaaS and updates the .tfstate file accordingly. If for some reason that doesn't work, use thisterraform state rmcoommand that can delete resources properly in the file. First, you runterraform state listand find out which module corresponds to the infrastructure resource(s) that are now gone. After determining which one(s), you can go ahead and runterraform state rmwith those, perhaps in quotes, one at a time. Once done,terraform.tfstatefile is back in sync, and other Terraform commands (eg.refresh,apply,destroy) should be able to run happily.
=> https://www.terraform.io/docs/cli/commands/state/rm.html
(2021-02-24 4.7 GA)
2021-02
image-registryoperator reporting Progressing, while Available and not Degraded, after starting with "EmptyDir" storage, switching to PVC (withnfs-provisioner), then deleting and recreating the provisioner Pod. Theclusterconfig ofimage-registryends up with 2 storage types, which it cannot support at this time (tested/verified with 4.6.13 & 4.6.19).
To edit the config, use:$ oc edit configs.imageregistry.operator.openshift.io
=> https://access.redhat.com/solutions/4516391
2021-01
-
4.6 node reporting
NotReadyafter upgrade to 4.6.12, due to CRI-O failing to start up on one worker node (other nodes were fine). The error seen injournalctlon the troubling node was likeJan 21 16:00:48 worker1.ocp4.example.com bash[1881]: Error: readlink /var/lib/containers/storage/overlay/l/6MDZKPV63T>
Running# systemctl daemon-reloadwas necessary before the storage wipe steps in the below doc.
=> https://access.redhat.com/solutions/5350721 (again, see 2020-10) -
OCS (OpenShift Container Storage) 4.6 install worked like a champ following this document. => https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html-single/monitoring_openshift_container_storage/index
(2020-10-26 4.6 GA)
2020-10
-
Pods failing to start on a particular node, often with
ImagePullBackOfferror, due to getting CRI-O ephemeral storage full
=> https://access.redhat.com/solutions/5350721 -
How to troubleshoot ingress, routes, services, what objects to check, in what order (perhaps for likeliness/quicker resolution)
=> https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ -
4.5 disconnected cluster (ie. no direct access to Internet), normal upgrade fails, requiring "opting out Insights operator" and "mirroring image repository (again)."
oc adm upgradefirst checks if the cluster is currently healthy, and any problem with cluster operators (oc get co) blocks upgrade operation from proceeding. The Insights operator would be running but degraded because it cannot report status to Red Hat due to no Internet access, hence needs "opting out." Then, the upgrade-to version images need to be mirrored to the internal repository, again, due to no Internet access.
=> https://docs.openshift.com/container-platform/4.5/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html
=> https://docs.openshift.com/container-platform/4.5/updating/updating-restricted-network-cluster.html -
How to install, configure, use Node Feature Discovery operator. The article was written for OCP 4.1, therefore the UI is a little different in 4.6, but it's actually easier (without having to deal with YAML) in 4.6.
=> https://access.redhat.com/solutions/4734811
2020-09
- 4.5 on libvirt,
authenticationandconsoleoperators not available, andconsolepod's log indicating DNS issue for "oauth-openshift."
=> https://github.com/openshift/installer/issues/1648#issuecomment-585235423
2020-08
-
node
NotReadyafter reboot perhaps after shut down for a few days, with "Kubelet stopped posting node status" inoc describe node
essentially,sudo systemctl restart kubeleton the bad node, followed byoc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve, perhaps multiple times as a few CSRs could come in for approval sequentially.
=> https://docs.openshift.com/container-platform/4.1/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html -
random intermittent API server errors like "connection refused," "authentication required," "internal error," with
etcdrunning on slow disk
=> https://github.com/openshift/release/blob/23074b5/ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml#L375-L382
(2020-07-30 4.5 GA)
(2020-06-23 4.4 GA)
2020-06
OOMKilledon large POWER system, while same set of pods run fine on smaller x86 system, with possible tinkering withslub_max_orderkernel param
=> https://medium.com/ibm-cloud/fortifying-ibm-cloud-private-for-large-enterprise-power-systems-6119804f0103
(2020-04-30 4.3 GA)
2020-02
- 4.x image registry not accessible externally by default, need
default-routecreated
=> https://docs.openshift.com/container-platform/4.3/registry/securing-exposing-registry.html
=> https://docs.openshift.com/container-platform/4.4/registry/configuring-registry-operator.html#registry-operator-default-crd_configuring-registry-operator