Problem with zScaler DNS hijacking (advanced)

There seems to be a serious problem with how zScaler interferes with the TCP/IP stack to hijack DNS requests.

I noticed the problem when trying to resolve a random jsgdfgsdf.yahoo.fr (yahoo resolves all subdomains so it is great for testing without worrying about the cache). I get

> nslookup kjsdjkdssdssqdqsddd.yahoo.fr 1.1.1.1 
Server: one.one.one.one Address: 1.1.1.1 

DNS request timed out. timeout was 2 seconds. 

Non-authoritative answer: Name: src.g03.yahoodns.net
Address: 212.82.100.150 
Aliases: kjsdjkdssdssqdqsddd.yahoo.fr rc.yahoo.com

Note the 2 seconds timeout. It does not depend on the target DNS resolver ( 1.1.1.1 in this case)

I checked with Wireshark and the behavior is peculiar to say the least (the grayed out section i the internal suffix, you can skip these lines):

As you can see, there is a ~2 seconds delay after the first successful query (which turns 1.1.1.1 into one.one.one.one for display reasons), and then the request gos on correctly.

EXCEPT that there is an ICMP 3.3 sent back from me to 1.1.1.1. Why? And why here?

It looks like zScaler messes seriously with the resolver.

Funnily enough, a qualified FQDN (with a dot at the end) is OK:

> nslookup kjhjkhjddodfdru.yahoo.fr. 1.1.1.1
Server:  one.one.one.one
Address:  1.1.1.1

Non-authoritative answer:
Name:    src.g03.yahoodns.net
Address:  212.82.100.150
Aliases:  kjhjkhjddodfdru.yahoo.fr
          rc.yahoo.com

This is even more surprising : how the use of suffixes (automatically added by Windows, unfortunately) could matter?

The next funny thing is that when you use the Windows resolver (ping kjhkjehkjze.yahoo.fr), there is no problem.

This is really the DNS request via a direct resolution of the DNS (by querying its port 53 directly) that poses a problem.

Ah - and of course, there is no problem when zScaler is switched off.

You’ve already explained the cause and the solution :wink:

The cause is the DNS Search suffix. Windows is appending the suffixes in order, and you have a wildcard policy in ZPA. This causes ZPA to intercept the FQDN you looked up, with the appended suffix, since it matches the wildcard. The delay is caused by Zscaler querying the App Connectors to resolve this host - once it’s returned as non-resolvable, then it tries to resolve without the suffix appended.
Note that without ZPA enabled, the client would still apply the DNS Search Suffixes, except that the DNS server may respond quicker to NXDOMAIN.

I would suggest removing the wildcard in your segments. This would prevent the match on the client, and therefore mean the NXDOMAIN is returned by ZPA quicker. However - this isn’t a long term solution since you’d likely need the wildcard for your segmentation to occur.

Is this actually causing a problem, or is this simply a question about how the functionality works? Fundamentally if you have search suffixes and attempt to resolve a host, then that host will always have the suffixes appended (unless you put the dot at the end). If those suffixes match an app segment (wildcard) then ZCC will need to intercept and ask ZPA to resolve. There is caching in the cloud to optimise this, but there will also be times where it needs to query app connectors’ ability to resolve before returning the answer to the client for “fail-through” to occur.

I do not have wildcards for yahoo.fr so it is not managed by zScaler and there is no reason for the delay.

Then, there is the ICMP packet which is sent back - why? Also please see where in the timeline the ICMP is sent back and how the interruption is right after the first, instantaneous PTR request.

You have a wildcard for the domains which are appended by the OS. So - these need to be resolved first. Those get resolved through interception. Once those fail, then the OS will attempt to resolve the initial FQDN.
It’s not possible to explain the ICMP response without seeing the detail. This might be spurious since the source is the client, destination is the server. I would expect this is is due to sourceport translation occurring, and a response packet into the client triggers the client to send the ICMP.

OK, I thought you meant a wildcard in the ZTA.

Yes, all the suffixes are attempted one by one and they fail. The difference is that it takes a fraction of a second without zScaler and a timeout with zScaler on.
When you say “intercepted”, you mean “intercepted by ZTA”? Even if these domains are not part of the profile? And in this case where is the resolution done? And by what?

If all DNS requests are intercepted by ZTA and resolved by ZTA why nslookup thisdoessdfsnotexist.skdjhfksjhfdj.com. 1.1.1.1 is resolved without delay?

Why the same resolution (with failed domains due to suffixes) is immediate without zScaler?

There is no more details, this is one of my worries. This is everything related to the calls.
I am not sure why “spurious”? It is a typical ICMP 3.3. reply - but of course not expected here.