Scale Out File Server SMB redirection locking up CSVs


problem - physical hosts have hyperv running , vhdx located in sofs csv (hyperv hosts different sofs cluster nodes).  during start of vm when smb redirection occurs or when trying move csvs active smb connection between cluster nodes locks csv.  

physical hosts , vms windows 2012 r2 updates ~july 2016
physical hosts cisco c220s latest os updates , 1 update behind on firmware
sofs 2 physical node cluster sas connected jbod
4 csvs exist, exhibiting same issue
sofs cluster nodes have below networks:
mgmt - teamed 10g - no cluster use
cluster0 - single 10g nic - cluster only
cluster1 - single 10g nic - cluster only
sofs0 - single 10g nic - cluster/client
sofs1 - single 10g nic - cluster/client (currently set none troubleshooting)
backup - teamed 10g - no cluster use
livemigration - teamed 10g no cluster use/only network live migrations
cluster validation runs clean
when nothing connected csv shares can fail csvs , sofs role without errors
each csv used single hyperv server , has single vhdx in it.

hyperv host networks
sofs0 - single 10g nic
sofs1 - single 10g nic
backup team
mgmt team
customer network team

believe both problems related;
problem 1)
csv share owned sofsa
when boot vm secondary vhdx located in sofs (os in local raid disk), checking smbclient logs on hyperv host , smbserver logs on sofs hosts can see:
hyperv host hits sofsb.  
hyperv host connects , share seen asymmetric/continuous availability transfer.  witness registration completes.  
sofsb issues redirect sofsa.  
hyperv host gets redirection request , establishes connection sofsa (4 event log messages, smb client reconnect, session reconnect, share reconnect , witness registration). 
@ same second previous 4 smb reconnect messages, last in sequence. 5th message, message received redirect cluster node.
hyperv looses session , share during reconnect , smb client moved, no messages on session or share reconnect.
after 59 seconds on sofsa have errors re-open failed (event id 1016), client session expired
after 60 seconds hyperv registers request timeout due no response server.  server responding tcp not smb (event id 30809)
hyperv host registers connections sofsb share, goes through same redirection sequence sofsa (who owns share).  smb client, session reconnect, share reconnect, witness registration successful.
2 seconds later on sofsa have reopened failed, file temporarily unavailable (event id 1016)  i can see source/destination/share matches occurring.  error continues every 5 seconds.
if go , try 'inspect' drive hyperv times out , on sofsa warning (event id 30805) client lost session - error {network name not found} - specified share name can not found share name \sofsclustername\$ipc
repeat errors client established session server, lost session server network name not found server \sofsclustername - same session id in connect/disconnect each pair of connect/disconnect

great part - 
if go failover cluster (foc) , try move csv other node, csv gets stuck in pending offilne.  after few minutes other csvs owned same node go pending offline , hang.  i can reboot , wait 10 minutes die , failover or wait 20 foc die on both nodes of cluster.  in cluster logs, sofs node never releasing csv move.  the last message see related teh volume is:
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning 4 2.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved state 2. reson 7; status 0x0.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning 2 1.

see :
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning 4 2.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved state 2. reson 7; status 0x0.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning 2 1.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved state 1. reson 5; status 0x0.
volume4; volume target path \??\globalroot\device\harddisk39\clusterpartition1; file system target path \??\globalroot\device\harddisk39\clusterpartition1.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning 1 setdownlevel. local true; flags 0x1; countersname
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved state 3. reson 3; status 0x0.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} transitioning 3 4.
volume {c7cdc2d5-e1f9-40c5-b36d-43523e2996f1} moved state 4. reson 4; status 0x0.

issue consistent across 4 csvs have.  i believe issue has existed.  if hyperv hosts lined right hit sofs server owns csv, boots fine.  when doesn't vms , foc hangs , have go through reboots , vms loose drives , have reboot well. when gets redirected different sofs server issue comes leads me next problem.

problem2: 
assuming vms connected right sofs csv owner on boot , running/working fine days/weeks/months (yes has been sitting around while unresolved problem).  if try , move csv sofs maintenance purposes csv hangs in offline pending.  eventually foc hangs , have spend 2 hours things lined right (after ever planning on doing) vms boot.

things done/verified
windows firewall off
i've turned off ipv6
removed teaming nodes using sofs0/1 network , cluster0/1 (used windows team vs individual networks)
turned off client/network access sofs1 network
turned off csv balancer - hindsight doesn't work without due redirection of csvs due asymentic storage
updated permissions sofs share include hyperv host, sofs cluster nodes - didn't make difference/never see access denied errors

1 item see don't understand on sofs cluster nodes, in smbclient/connectivity logs, see network connection failed cluster adddresses:

network connection failed.
error: {device timeout}
specified i/o operation on %hs not completed before time-out period expired.
server name: fe80::98f9:c138:xxxxx%32
server address: x.x.x.x:445
connection type: wsk
guidance:
indicates problem underlying network or transport, such tcp/ip, , not smb. firewall blocks port 445 or 5445 can cause issue.

server name 'tunnel adapter local area connection* 12:' on other sosf cluster node.  so sofsa generating errors sofsb , sofsb generating errors connecting sofsa.   occuring before , after cluster0/1 network interfaces teamed



thanks-








installed:

kb3185279

kb3179574

kb3172614

and issue no longer present.

previous root cause csvs failing migrate between fail on cluster nodes (stuck in offline pending) or online inaccessible resumekey event log error messages (reopen failed) in smbserver eventlog point smb key persistence failing fixed in 1 of above patches.



Windows Server  >  High Availability (Clustering)



Comments

Popular posts from this blog

Group Policy Event ID 1058 Error Code 1326 (The user name or password is incorrect)

Suspicious event log Event ID: 4905

DCOM received error "2147746132" from...