Zscalertwo.net Outages

We are new Zscaler customers (less than 1 year maturity) and we have noticed the US datacenters have been experiencing a lot of outages (6-10 so far). This is obviously an issue for our business operations.

For long standing Zscaler customers, is it normal for Zscaler to have outages? If so, how did you find the best way to cope with these issues?

If a datacenter were to go down, we have logic in our PAC files to route traffic to the secondary datacenter – The issue is the logic is not picked up because the datacenter doesn’t actually go down, it just has “service degradations” / other.

We have built additional logic to route traffic away from Zscaler’s problematic datacenter, but this requires manual intervention and sync time to pick up the new PAC changes. Is Zscaler not capable of routing traffic away from problematic datacenters automatically during outages?

Zoshinsky,

Same here. I am roughly 1 month into my Zscaler journey. My company has had it for about 6 months total.

One thing I would say helps here is that you go to trust.zscaler.com and login so that you can subscribe to your specific datacenters. You will get alerts as they are pushed to the site on the status of selected DCs with incidents. Make sure to uncheck all the default options in the other tabs as well.

This does require manual intervention as someone has to respond to the issue and I don’t think there is anything for automation on Zscalers end to reroute your traffic to your backup in case of service degradation.

Thanks

I experienced a major issue almost exactly a year ago with an old, saturated data centre in zscalergov being selected automatically by users at my main client, and what I did was build two PAC files; with one specifying WAS1-2 as the primary DC and CHI1-2 as secondary with the other PAC being the reverse of that. There was an App Profile tied to each, and in the event of degradation, I would disable the offending App Profile in Client Connector Portal and enable the backup. Users then were instructed to click “Update Policy“ in ZCC, and they would be routed to the designated backup.

I have since reverted this configuration back to Zscaler best practice since all of the old DCs have been decommissioned, but it is an option for you to consider if your user base is not too geographically dispersed and the outages are isolated to only a few data centres in your cloud.

Something very similar happens to us. We are seeing excessive issues on Zscloud and ZsTwo. And although our architectures are designed to be tolerant, we cannot capture these degradations.

Also, in some cases we lose the control plane of Zscaler so we can’t do much either.

Thanks Ryan, we too have the notifications set up via trust.zscaler.com specifically for zscalertwo.net. The only down side is they are not always timely.

Been doing further research on this - ZCC 4.0 should help the situation with ZIA DR mode configurations and easier sub-cloud management in 6.2 (if you are using a sub cloud):

r/Zscaler

@MATIAS_DIAZ_GONZALEZ RE: losing control of the control plane - That’s concerning / we have not run into that issue.

Is there not redundancy in the datacenters that would prevent customers from experiencing these issues during service degradations? In the below FAQ page, Zscaler outlines “what to expect when datacenter issues happen / what to do”:

FAQ - ZscalerTrust

" * For issues with intermittent timeouts and connection to a DC leading to traffic impact, we may explicitly message: failover to the configured secondary DC" – We don’t understand why Zscaler cannot do this automatically for customers. Also, there is not direct guidance on how to do this or the BEST way of doing this (we found a way via the PAC through PS guidance).

It’s a real issue when a DC is degraded and not down as like you said it stays in the sub-cloud rotation.
It’s a bit easier to identify and “self-heal” these sorts of things with GRE tunnels but very difficult for Road Warriors.

We’ve discssed this before but an option for ZCC to connect to the fastest DC instead of closest DC in these situations would help.

1 Like