The file server in question is used for user home folder storage and users are accessing Outlook Personal Storage (.pst) files stored on the server from their client. The issue will manifest as either a server hang, or PagedPool depletion (Event ID 2020). Oftentimes the issue will occur first thing in the morning - when users are logging on and launching Outlook. In especially severe cases, the issue occurs several times daily. Sometimes the server will hang for a few minutes and then continue operating for a few minutes - and then hang again. Rinse & repeat. The users are frustrated because of slow access to their data, the server administrators are frustrated because they are tasked with fixing the problem, and upper management is frustrated because everyone else is frustrated.
If you're in this situation - there's good news ... and very bad news. The good news is that this problem is very common and is a known issue. The very bad news (from the customer's standpoint) is that PST files on a LAN/WAN is an unsupported configuration. Some customers are very surprised to hear this but Network Stored PST files have been unsupported since the days of Exchange 4.0. Microsoft KB Article 297019 goes into some detail about the effects of Network PST files:
"A .pst file is a file-access-driven method of message storage. File-access-driven means that the computer uses special file access commands that the operating system provides to read and write data to the file.
This is not efficient on WAN or LAN links because WAN/LAN links use network-access-driven methods, commands the operating system provides to send data to or receive from another networked computer. If there is a remote .pst (over a network link), Microsoft Outlook tries to use the file commands to read from the file or write to the file, but the operating system then has to send those commands over the network because the file is not on the local computer. This creates a great deal of overhead and increases the time it takes to read and write to the file. Additionally, the use of a .pst file over a network connection may result in a corrupted .pst file if the connection degrades or fails."
Let's use an example to illustrate the problem and also follow the problem through to its end result.
Let's say that a user sends an e-mail message to 500 users within the company. All of these users have their e-mail delivered directly to their PST file which is stored on the File Server. Some of these 500 users may need to extend their PST files to receive it. To extend a PST, an extra allocation on disk has to be made via NTFS. This locks out the whole volume while free space is allocated and the Master File Table (MFT) is updated. While this is happening for each user, all I/O for the other 499 users is on hold.
Allocating free space can take an extended time, especially if the disk is fragmented. Now factor in multiple users extending their PST's in the same timeframe, and significant periods of MFT lockout might be observed, which in turn is seen as inability to access any other file on the volume, resulting in queueing in the server service work queues, and sometimes SRV 2019, 2020, 2021, or 2022 events being logged. This scenario might overload the disk(s).
Setting aside the example of one email being sent to a group of users, imagine if you had a couple of hundred users who each have two or three PST files. These users have been with the company for a while, and they rarely (if ever!) delete their email from their PST files. The files continue to grow in size - let's use an average of 1 GB as the size of the PST file. Now consider that when each user launches Outlook, they make a request for two (or three) files, each of them being about 1 GB in size. Then consider what happens when 200 users all launch Outlook around the same time when they get to work. 200 x 3 x 1 = 600 GB of data being requested at the same time. That's an awful lot of Disk & Network I/O to process simultaneously. This is a very common scenario - the file server "freezing" for a few minutes at a time while it tries to service these requests.
The queuing in the server service work queues is what causes this temporary hang. The server service uses work items to handle I/O requests that come in over the network - for example: a request to extend a PST file. These work items are queued in the server service work queues, and from there they are handled by the server service worker threads. The work items are allocated from a kernel resource called Non-Paged Pool (NPP).
The server service sends these I/O requests down to the disk subsystem. If, for reasons mentioned above, the disk subsystem does not respond in time, the incoming I/O requests are queued via work items in the server work queues. Since these work items are allocated from NPP, eventually this resource runs empty. Running out of NPP causes systems to hang eventually (logging an Event ID 2019 in the process).
Digging down into this from more of a troubleshooting perspective, we can usually see issues caused by the PST files manifested in Poolmon and Perfmon captures. For example, we may see the LSwn pool tag allocation climbing in a Poolmon trace. These allocations are made by SRV.SYS. The size of the allocation is configurable via the SizReqBuf registry value. One allocation is made for each work item used by the server service. When looking at this through Perfmon, you will notice a steady decrease in the "Available Work Items" counter. If Available Work Items reaches zero, then clients may experience difficulties accessing files (any files, not just the PST files!). You may also experience 2019 errors if the problem lies with LSwn allocations (Non-Paged Pool depletion)
Another tag that highlights the issues with the PST files is the MmSt tag. This tag represents Mm section object prototype PTEs - a memory management-related structure used for mapped files. Put a different way, this is the pool tag that is used to map the OS memory used to track shared files. MmSt issues often manifest as Paged Pool depletion (Event ID 2020).
Is there any server-side tweaking that can be done to mitigate some of these effects? Yes. Is there any guarantee that this will resolve the problem completely and indefinitely? No. As an environment continues to scale up, the problem will continue to manifest itself despite all the tweaking that we can do. At some point, the tweaking itself may contribute to the problem because we've reached a point where the server simply cannot handle the workload.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment