Before You Begin Set Env Vars
The following are required (specific tasks may require just a subset of these):
export COMPANY=ev.younite
export KOPS_STATE_STORE="s3://<your-s3-bucket-name>"
export KOPS_CLUSTER_NAME=<your-cluster-name>
export KUBECONFIG="<path-to-kube-config-file>"
export AWS_ACCESS_KEY_ID=<cluster-aws-access-key>
export AWS_SECRET_ACCESS_KEY=<cluster-aws-secret>
Useful Commands for Troubleshooting the Cluster
Permission Denied When Running KOPS, Kubectl and/or AWS CLI Commands
Make sure you are the cluster admin:
aws iam get-user # This requires IAM get-user permission
aws sts get-caller-identity # this always works
Log back into the cluster:
kops export kubecfg --admin # stays logged in for 18 hours
Cluster Can’t Retrieve YOUnite Docker Images
Log the cluster back into the YOUnite ECR.
For this you will need the ECR AWS keys and the cluster’s AWS Keys:
export AWS_ACCESS_KEY_ID=<younite-ecr-access-key>
export AWS_SECRET_ACCESS_KEY=<younite-ecr-secret>
password=`aws ecr get-login-password --region us-west-2`
Reset AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
back to the cluster key values then run:
kubectl create secret docker-registry younite-registry --docker-server=https://160909222919.dkr.ecr.us-west-2.amazonaws.com --docker-username=AWS --docker-password=$password --docker-email=notused@younite.us
unset password
Reset AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
to the cluster key values.
Inspect Cluster Variables
kops edit cluster --name=$KOPS_CLUSTER_NAME
Check Status of Nodes and Pods
kubectl get nodes
kubectl get nodes --show-labels
kubectl get pods
Check What Services are Running
kubectl get services -o=custom-columns="NAME:.metadata.name,EXTERNAL-IP:.status.loadBalancer.ingress[0].hostname"\
Check What Pods are Running on Which Nodes
kubectl get pods --all-namespaces -o jsonpath="{range .items[*]}{.metadata.namespace}{','}{.metadata.name}{','}{.spec.nodeName}{'\n'}{end}" | awk -F',' '{printf "%-40s %-40s %-40s\n",$1,$2,$3}' | column -t
Start/Restart a Service
Login to cluster and apply secrets and any necessary settings (e.g. "rbac" and "storage-class"):
kubectl apply -f <service's .service file.yml>`
kubectl apply -f <service's .deployment file.yml>`
Note
|
Some services may combine the service and deployment configuration into a single service yml file.
|
Delete (Stop) a Service
Login to cluster and apply secrets and any necessary settings (e.g. "rbac" and "storage-class"):
kubectl delete service <service-name>
Then either:
kubectl delete deployment <service-name>
#
# or, if a replica set:
#
kubectl delete statefulset <service-name>
Get, Create, Validate, Edit and Update a Cluster
kops get cluster
kops validate cluster
kops create cluster --name=$KOPS_CLUSTER_NAME --zones=us-west-2a
An example of adding an instance group to a cluster:
kops edit ig --name $KOPS_CLUSTER_NAME nodes-us-west-2a # e.g. add an instance group "nodes-us-west-2a"
kops update cluster --name=$KOPS_CLUSTER_NAME --yes
Restoring the Local Kubeconfig File
If the kube config file is accidentally deleted on your Local IT Host/control machine, it can be recreated with the following:
kops export kubecfg --name your-cluster-name
Starting All Over - Deleting a Cluster
Deleting a cluster properly is super important because if not, artifacts will be left behind and inherited by future invocations. Use the following to properly shutdown a cluster substituting your cluster’s KOPS_STATE_STORE:
kops delete cluster --state=$KOPS_STATE_STORE --yes
Useful Network Debugging
Here are some tools to help resolve network connectivity problems. An issue that frequently arises is the inconsistent or delayed propagation of CNAME records to the cluster, test host, or test clients.
Adaptor Health Check Issue
Adaptors fail to start due to health check failures. The best place to start is to look at the adaptor’s log and find the first error:
kubectl get pods
kubectl logs <adaptor-pod-name>
Adaptor DB Connectivity Issue
The DB adaptor will not register to the master nodes health checks if it cannot connect to its datasource (e.g. database).
If the adaptor pod is showing:
younite-zone-producer1-oracle-db1-adaptor-d6cf44f-n26zk 0/1 Running 6 (3m44s ago) 63m
And kubectl describe pods <adaptor-pod>
shows:
Warning Unhealthy 3s (x21 over 3m23s) kubelet Startup probe failed: HTTP probe failed with statuscode: 503
Look at the adaptor log the first error usually includes:
IO Error: The Network Adapter could not establish the connection
That is not referencing the YOUnite adaptor but some TCP Java adapter. You will also see some errors later in the log about Domain Versions but they are not relevant - just a thread not able to do its work.
Common sources of the problem:
-
The DB is down
-
There isn’t route between the YN adaptor and the DB (bad peering connection perhaps) - run the busybox pod to debug if a networking/routing issue - peering connections need to be made between the DB VPC and the Cluster VPC and the scripts to start the cluster do this. The peering connection also needs two routes with the DB VPC using the CLuster VPC CIDR and the vice versa. This works sometimes and other times it doesn’t — I have not been able to solve this riddle yet.
-
Security Groups - Use the "data-virtualization" security group then
-
Network ACLs - Each VPC has one or more subnets. Typically all of the subnets in a VPC share the same Network ACLs. By default they allow all traffic so there shouldn’t be any need to change things here if all defaults are taken.
-
The adaptor config has the wrong Database IP or port (look at the adaptor’s config in the YOUnite UI)
BOTTOM LINE - If the peering connection between the private default VPC and cluster VPC doesn’t work - then just use the public IP of the database in the adaptor DB URL config.
Curl
The kubernetes pods do have curl loaded on them. Once you login to the cluster (kops export kubecfg --admin) you can log directly into a pod and run curl e.g. login into a Oracle adaptor and test its connection with an Oracle DB:
kubectl exec -it younite-zone-consumer1-oracle-db1-adaptor-55f859d758-rvf2m -- /bin/sh
curl 172.31.0.117:27021
To login into a cluster node (instance) has proved unpredictable at best - this has worked but not always:
ssh -i test-keys.pem ubuntu@<public-ip-of-node>
Busybox
Busybox in a kubernetes configurations has limitations. Busybox runs as an instance not a pod and therefore tests node connectivity and not pod connectivity.
Run Busybox
A busybox.yml
file is in this test’s specs
directory.
kubectl apply -f busybox.yml
This will run busybox for 12 hours.
Use Busybox
kubectl exec -it busybox -- /bin/sh
Useful
-
Busybox does not have
curl
but it haswget
- it is part of the Docker network so it can use the docker hostnames. For example to check the health of a service (note that sending the response payload to stdout i.e. "-O -" does not work with this version of busybox) :
wget -O response.txt younite-api:8080/health
cat response.txt
-
Checking the health of a specific pod using the pod’s IP:
wget -O response.txt 100.96.4.32:8080/health
-
ping <host>
-
traceroute <host>
-
nslookup <host>
To test a database connection: nc -zv <db-ip> <db-port>.
For example:
nc -zv 172.31.10.174 27021
Terminate Busybox
kubectl delete -f busybox.yml
Check the CNAME Value of the YOUnite API Service
If the CNAME for the API server is api.younite.myco.com
run the following:
nslookup api.younite.myco.com
You should get a response similar to the following:
Server: 172.31.0.2
Address: 172.31.0.2#53
Non-authoritative answer:
api.younite.myco.com canonical name = a505346f7839d41dab018c1c9f95b0f4-1518693669.us-west-2.elb.amazonaws.com.
Name: a505346f7839d41dab018c1c9f95b0f4-1518693669.us-west-2.elb.amazonaws.com
Address: 52.39.170.78
Name: a505346f7839d41dab018c1c9f95b0f4-1518693669.us-west-2.elb.amazonaws.com
Address: 52.39.52.89
Slow CNAME Updates - Refresh the DNS cache
Flushing a system’s DNS cache allows it to pick up refreshed DNS entries. It is important to note that this action does not guarantee the resolution of CNAME update issues. This is because the DNS cache of the Internet Service Provider (ISP) may not have been updated at that point.
Windows
ipconfig /flushdns
OS X
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
Linux
-
For systemd-based distributions:
sudo systemd-resolve --flush-caches
-
For non-systemd or older distributions:
sudo service nscd restart
Health Endpoints
Public Facing Services
Curl
can be used to check any of the public facing services:
* YOUnite API
* YOUnite Data Virtualization Service
* YOUnite Notification Service
* Kibana
* YOUnite UI
See the kubernetes service spec file for the httpGet
path and port number. The default values for each are supplied here:
YOUnite Stack Service | CNAME | default port | path |
---|---|---|---|
younite-api |
api |
8080 |
/health |
younite-ui |
ui |
443 |
/ (should get a redirect to IDP) |
younite-notification-service |
notifications |
8080 |
/actuator/health |
younite-data-virtualization-service |
dvs |
8080 |
/actuator/health |
younite-kibana |
kibana |
5601 |
/ |
For example, to check the API service endpoint for the DNS organization name younite.myco.com
check the health endpoint by running the following:
It should respond with:
{"status":"UP","groups":["liveness","readiness"]}[user-name@ip-172-31-15-161 test-client]
Note
|
The default port is show above. To get the service’s actual port see its deployment file in the Kubernetes spec directory.
|
Private Services
Not all services have public facing IPs but, they do have health endpoints. To test them, you will need to:
-
Get the IP address of the instances/node that the service is running on. See above:
-
Check What Services are Running
-
Check What Pods are Running on Which Nodes
-
-
Start a busybox instance in the cluster and use the
wget
command (instead of curl).-
See
Using Busybox
above.
-
YOUnite Stack Service | default port | path |
---|---|---|
YOUnite Off-the-Shelf Adaptors* |
8080 - If multiple adaptors are running on a single node then each will have its own port - see each adaptor’s |
/health |
younite-mb (message bus) |
61613 |
/ |
younite-elastic |
9200 |
/ |
younite-logstash |
4560 |
/ |
-
All adaptors are supposed to supply a health endpoint however implementations may choose not to provide one.
Note
|
The default port is show above. To get the service’s actual port see its deployment file in the Kubernetes spec directory.
|