HDDS-14577. Handle missing metadata dir when updating container state#10565
HDDS-14577. Handle missing metadata dir when updating container state#10565priyeshkaratha wants to merge 2 commits into
Conversation
sreejasahithi
left a comment
There was a problem hiding this comment.
Thanks @priyeshkaratha for this patch ,overall LGTM
just a minor nit
| // Simulate MISSING_METADATA_DIR: delete the entire metadata directory. | ||
| File metadataDir = new File(keyValueContainerData.getMetadataPath()); | ||
| assertTrue(metadataDir.exists(), "Metadata dir should exist before corruption"); | ||
| FileUtils.deleteDirectory(metadataDir); |
There was a problem hiding this comment.
nit: you could use MISSING_METADATA_DIR.applyTo(keyValueContainer) instead because it is already used in scanner tests (TestKeyValueContainerCheck). Reusing it keeps corruption simulation consistent and avoids duplicating delete logic.
|
Hi @priyeshkaratha, the container scanner is not designed to repair containers it finds issues with, so I don't think we want to implement this. The Note that the metadata scanner runs more frequently than the data scanner, so both SCM (which gets the unhealthy ICR and sends the delete) and DN would need to be restarted every time the metadata scanner runs for this container to remain indefinitely. If it doesn't make the code too messy we could clean up the logging by not attempting to update the file if the directory does not exist, but note that this is not a common case so we should not do anything too disruptive to the main code flow to accommodate this. |
Thanks @errose28 for pointing out this. It makes sense. I am justing skipping the creation of temp file in this scenario and instead adding a debug log. |
What changes were proposed in this pull request?
When the container metadata scanner detects a MISSING_METADATA_DIR corruption, it attempts to persist the UNHEALTHY state by writing the .container file. However, writeToContainerFile creates a temp file inside the metadata directory itself — the very directory that is missing — causing a secondary StorageContainerException: Failed to create tmp file. This produces a spurious WARN and leaves the UNHEALTHY state un-persisted on disk; the state is lost on the next datanode restart.
The fix adds a single guard in KeyValueContainer.writeToContainerFile: if the metadata directory does not exist, recreate it with mkdirs() before attempting to create the temp file. This allows the .container file to be written and the UNHEALTHY state to survive a restart.
What is the link to the Apache JIRA
HDDS-14577
How was this patch tested?
Added new testcases