Old school troubleshooting based on OSI model

Intro

In college, the OSI model was drilled into me, as one of the IT admin courses I was taking included the full Cisco Certified Network Admin track. While I never did take up a job doing pure network stuff, what I learned about the OSI model is still useful to this day

We have paid tools such as the excellent ControlUP / eG Innovations to aid us with troubleshooting virtualized environmental issues , it’s always good to triage your performance / outage with some good ‘ol fashioned troubleshooting that doesn’t need to make use of paid tools. If you’re a consultant like me, you aren’t always going to be able to install/configure paid tools for your clients, or perhaps the client is VERY small, 1 or 2 virtualization hosts, so , it doesn’t make sense to setup these tools either.

This blog post will review the OSI model, and provided relevant examples for each layer go over the basics, with examples that are applicable to virtualized environments

What is the OSI model?

Let’s start with the below image from imperva

As above, I learned of the OSI model in college, so, I had test/exam/quiz questions on it, to remember the 7 layers of the cake, there are a few acronyms you can use. I chose the following

ALL
PEOPLE
SEEM
TO
NEED
DATA
PROCESSING

I’ve been working in IT professionally (whatever that means to you) for 21 years. I believe that’s a pretty good chunk of time to have formed an efficient method for troubleshooting networked computers. From experience, the most complicated (and therefore lengthy ) issues you DON’T want to start your troubleshooting steps with are layers 4-7. The worst, is application troubleshooting. Why? Way too many variables at play: settings, interoperability with the underlying operating system / drivers/ runtimes. Trust me, you don’t want to start you path here. Instead, I always start with the ground floor of your network device /computer, the physical layer

1 – Physical layer

Here, you can review the basics at the physical layer that might be the culprit for your performance issues. Our first bullet list will mostly cover those physical / hardware settings that would be relevant to check on your virtualization hypervisor

  • What is the expected link speed of the network card? I had an HP desktop unit I’d re-purposed for use with ESXi last year, I was noting issues with the mgmt. interface going down, the built-in NIC was supposed to run at 1 Gbe, but refused to go beyond 10mbit/sec, yikes! I ended up disabling it, and cutting over to a dual-head 1GB intel NIC
  • Modern SSDs using the NVM protocol and connect using M.2 or U.2 on the motherboard. However, there are still plenty of regular SATA-based drives that connect using traditional SATA cables, of which there are TWO cable types. 3 Gbit/ 6Gbit /sec, check that you’ve got the right cable connected to ensure full bandwidth
  • Not all PCI express slots are created equally. In some systems, you’ll only get one of your x16 PCI express lane running at full speed. So, if you have a fancy 10 Gbe NIC in your virtualization hosts like me, but also have dual or quad head 1 Gbit NICs, place the slower NICs in the down-shifted X4 PCI slot, and the 10 Gbe nic in the x16
  • RAM: Ensure your virtualization host isn’t mixing / matching RAM speeds, else, you get the slowest speeds
  • ESXi by default will install and set itself to “Balanced” power mode, this means your CPU will max out at the 80% mark, unless you’re working in a “green” or LEED certified datacenter, you’ll want to set it “MAXIMUM PERFORMANCE” as such:

2 – Data Link

A few simple items to check at the data link layer include the following:

  • VLANs set on your network
  • A manually coded MAC address set on your network card, it’s possible you or someone else set one as part of a PVS implementation
  • The data link layer is also where error control is handled, so, if you’re using PVS, you’ll want to inspect your network switches for any corresponding error messages

3 – Network

As we move up the layers , our troubleshooting steps will be become more complex and involved, this is where you’ll more than likely need to start using built-in CLI tools for your OS: *nix or windows, here’s a good starting list, I’ve ordered them based on what to start with:

Ping from a regular command line or test-netconnection via powershell, you can also test individual ports via test-netconnection -computername XXYY -port 123

-Netstat -ano: to show a mass list of port bindings to IP addresses , very handy if your troubleshooting issues with service ports on on-prem CVAD controllers where you suspect a binding isn’t working as expected

-Tracert: I’ve used this countless times to ID issues with default gateways, that is, where a virtual device isn’t getting the correct default gateway. You’ll see this in a tracert, where the routing process will fail after hop 1 or 2, in a large enterprise, you could have 5-6 hops to get all the way from source to destination

-Nslookup: useful for testing basic name resolution , Citrix / VMware / Nutanix require static DNS entries for some functionality , if you suspect an entry is incorrect / missing, and need to update the DNS record in MS active directory DNS or unix BIND, start with an nslookup first

-WireShark: This tool is extremely powerful, it’s actually useful for troubleshooting problems from layers 2 all the way to 7! However, it can certainly be used for the NETWORK LAYER to start

4 – Transport

Here is where TCP/IP based communication lives. There are some quick wins to be had here in terms of troubleshooting performance . The first relates to a cursed feature set that Microsoft developed for use with Windows Server 2003, and it’s been a thorn in many an IT person’s side ever since! It’s called Scalable Networking Pack – Wikipedia, the Microsoft article is here

For years, blogs/tech/articles/or even MS support engineers would recommend disabling RSS, receive side scaling. When modern PC’s only had a single vCPU or pCPU assigned, this made sense to have it disabled, but when was the last time you worked on a system with a single core? However, the RSS setting you might have put in place as best practice years ago via script/GPO/GPP/BAT/3rd party tool doesn’t actually get reset when you update VMware tools, as such, you should evaluate ALL of the following settings on the network card properties of your physical/virtual computer. Each should be tested and enabled where possible to ensure optimal performance

4k jumbo frames setting: Here you will want to use a proper method to ID the maximum MTU window size you can set on your physical network cards, switches and virtual network cards. You can do this via VMKPING on ESXi, and ping -L on windows

TCP chimney offload

Receive side scaling

5 – Session

To be honest, you will probably never do any pure troubleshooting at the session layer. Instead, you will troubleshoot layer 7 application issues that can cause Citrix “session” disconnects due to presentation layer (TLS security) or network issues that can occur along the way through layers 1-4. The session layer would be considered a “legacy” layer as we transition to connectionless based protocols such as UDP. However, TCP is a valid fall back for UDP sessions, so, it’s important to know the basics of session based troubleshooting. For troubleshooting Citrix-based environments using receiver / workspace app, refer to the previous layer troubleshooting notes and read on for the final two layers!

6 – Presentation layer

This is where encryption related issues can occur. All the tech vendors will remove support for older versions of TLS as time goes on, ensure that you’ve got the right TLS versions configured on your networking devices and end-points to align with industry best practices. Protip: A certificate that’s expired on your ADC might not initially appear as a “your cert has expired” on workspace app , use a daily health check, or set a calendar reminder when your ADC / StoreFront certs are coming up on renewal

7 – Application

As I mentioned in the first part of this blog post, this is the least fun / most complicated layer to troubleshoot, hopefully, the troubleshooting you’ve done with layers 1-6 have removed the need to keep the party going @ layer 7. If not, let’s get into it!

Anti-virus exclusions:

A classic way to lose CPU cycles is scanning things you don’t want scanned! 

For Citrix, check the following KB:
https://docs.citrix.com/en-us/tech-zone/build/tech-papers/antivirus-best-practices.html#virtual-apps-and-desktops

For VMware Horizon environments, check the followng:
https://kb.vmware.com/s/article/2082045

Windows updates:
As per Microsoft’s insane release schedule, windows is never really the same for long. Monthly patches will fix some issues, close security holes, and open new security holes and create new issues. Identifying if the most recent windows update patch you applied comes down to good testing. If you’re running a virtualized environment , silo off a section of VMs to deploy the latest windows updates, and provide them to dedicate test users, as well, ensure you’ve got a formal testing process that can be filled out and tracked each month

Runtimes:
Vendors such as Citrix/VMware/Nutanix/etc will include updates to pre-req programing language runtimes required with their apps, but they often don’t install the latest versions. To determine if you’re most recently installed vendor binary hasn’t created a new issue with pre-existing installed runtimes, reboot the machine, and open eventviewer and check for any new administrative events related to missing/replaced/corrupted runtimes.

Vendor updates:
There are a few schools of thought on how best to deploy vendor updates: One is evergreen, where you are always on the latest version of the application. The other is follow a semi-annual , LTSR or quarterly approach. I’ve found the best approach, is a hybrid based on what makes sense, and will reduce the impact should something go wrong, or you have an environment where proper UAT is not possible, and new environment changes can only be validated in production

Here’s what I use for most of my clients that are on a windows 201x / Win 10 based virtual desktop:

To cover the need to patch for critical security vulnerabilities, the following elements can be patched monthly

  • The windows OS
  • Stick with an patch the “current release” channel of MS office

Also to cover the need to patch for security issues, you’ll want to review the release notes on Google Chrome / Edge Chromium and patch monthly. That being said, both vendors are tightening the noose with legacy browser settings, you may find your clients internal web apps stop working as expected if you update to the latest version of either of the above browsers, always test, or if this is not possible, send out a comm to the IT/business contact with a link to the release note

Middle-ware apps such as WEM/FSLogix/CVAD/Citrix Optimizer should be updated as required when there are features you want, I’ve had to back out of plenty of changes for multiple clients where I blindly installed the latest version of each . FSLogix QA has certainly suffered since the MS takeover a few years back, the last thing you want is to introduce a prod issue with your profile management solution “just to be on the latest” version. Again, unless there’s a critical feature / fix you want in your middleware app, and you’re still supported should you need to escalate to the vendor, leave it alone

General tips / Wrap-up

The meat of the above blog post advocates for troubleshooting based on a basic understanding of the OSI model, that being said, it’s not always practical to through all the layers, from experience, a huge chunk of the time we spend on “Why is Citrix slow” comes down to the application level. The ace in the hole for me in troubleshooting application / system issues in virtualized Citrix environment, is basic Powershell loops, you can make these more advanced, and I certainly recommend my own github, or Sasha Tomet’s for daily environment health checks if you want to go more advanced, but let’s start with a basic example:

User’s are intermittently complaining they can’t connect to a non-persistent Citrix VDI desktop

Obviously, your first steps should be review Citrix studio / director for anything obvious, however, this route may not provide you with root cause on your issue

One of my favorite uses for Powershell, is to mass collect event IDs, the below code reads all VDs from a delivery group, filters out offline assets, and checks for specific win event IDs, you can edit the event ID to suit your own needs

$ReportTime = (Get-Date).ToString('MM-dd-yyyy-hhmm-tt')
$ReportsPath = "\\UNC\PATH\TOFOLDERTOSAVEXLS"

Add-PSSnapin Citrix*

IF (Get-PSsnapin -name  Citrix.Sdk.Proxy.V1) {

    Get-XDAuthentication -CustomerId CtxCloudID

}

Else {

    Write-warning "Please install Citrix Powershell SDK to access Citrix Cloud, path below"
    write-warning "The script will now exit"
    EXIT
}

If (-not(Get-Module -ListAvailable ImportExcel)) {

    write-host "Installing ImportCSV module and pre-reqs"
    Install-Module ImportExcel

}


### CTX Cloud Win 10 VDI Prod Leo/Molson
write-host "Collecting asset info from Citrix cloud" -ForegroundColor Cyan

$Assets = Get-BrokerMachine -DesktopGroupName "Your DG NAME HERE" -MaxRecordCount 2000 | Where {$_.PowerState -eq "On"} | Select MachineName, @{E={$_.AssociatedUserFullNames};Label='AssociatedUserFullNames'}, @{E={$_.AssociatedUserNames};Label='AssociatedUserNames'} , SessionClientName, SessionClientVersion, SessionStartTime, DesktopGroupName

## #Start!

$AssetsTotal = $Assets | Measure | Select-Object -ExpandProperty Count
$AssetsLeft = $AssetsTotal

$OutArray = @()

ForEach ($i in $Assets) {

    $VM = $i.MachineName.Split("\")[1]
    $User = $i.AssociatedUserFullNames
    $AssociatedUserFullNames = $i.AssociatedUserFullNames
    $AssociatedUserNames = $i.AssociatedUserNames
    $SessionClientName = $i.SessionClientName
    $SessionClientVersion = $i.SessionClientVersion
    $SessionStartTime = $i.SessionStartTime
    $DeliveryGroup = $i.DesktopGroupName
    
    write-host "Checking $VM"

    Write-Host "Checking $VM now" -ForegroundColor green
    write-host  "$AssetsLeft remaining to process.." -ForegroundColor cyan

    if (Test-Connection -ComputerName $VM -Count 1 -ErrorAction SilentlyContinue) {

        $Ping = "Online"
        
        write-host "$VM is online" -ForegroundColor Green
        
        $ID = Get-WinEvent -ComputerName $VM -FilterHashtable @{LogName='System' ; 'ID'='1058'} -ErrorAction SilentlyContinue | Select -last 1
        
        IF ($ID) {
            
            write-warning "Sys event 1058 found on $VM"
            
            $IDXML = [xml]$ID.ToXML()
            $DC = $IDXML.event.EventData.Data[6].'#text'

            $MSG = $ID  | Select -first 1 | Select-object -ExpandProperty Message
            $Time = $ID | Select -first 1 | Select-object -ExpandProperty TimeCreated
        }
        
         Else  {

            $MSG = "No event ID 1007 found"
            $Time = "N/A"
            $DC = "N/A"
        }       

    }
    
    Else {
        
        write-host "$VM is offline" -ForegroundColor yellow
        $Ping = "Offline"
        $ID = "N/A"
        $MSG = "N/A"
        $DC = "N/A"
        $Time = "N/A"

    }

    $OutArray += New-Object PSObject -property @{

    VM = $VM
    Ping = $Ping
    MSG = $MSG
    Time = $Time
    DC = $DC
    AssociatedUserFullNames = $AssociatedUserFullNames
    AssociatedUserNames = $AssociatedUserNames
    SessionClientName = $SessionClientName
    SessionClientVersion = $SessionClientVersion
    SessionStartTime = $SessionStartTime
    DeliveryGroup = $DeliveryGroup
    }

    $AssetsLeft --

} #ForEach asset

$OutArray | Select VM, Ping, Time, DC, MSG,  @{E={$_.AssociatedUserFullNames};Label='AssociatedUserFullName'}, @{E={$_.AssociatedUserNames};Label='AssociatedUserName'},`
SessionClientName,SessionClientVersion,SessionStartTime, DeliveryGroup | Export-Excel -Path "$ReportsPath\EventID-1058-$ReportTime.xlsx" -AutoFilter -Autosize -FreezeTopRow -BoldTopRow

$OutArray | Where {$_.MSG -like "The Processing*"} | Select VM, Ping, Time, DC, MSG, DeliveryGroup | ogv 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: