-
Notifications
You must be signed in to change notification settings - Fork 1
Terminating 50 instance generated ton of errors #21
Comments
|
Searching the CloudWatch Log filter @message like 'event: ' Searching for Searching for CloudWatch Logs Insights
|
|
|
Additional sns notification for failures. 1st 2nd 3rd |
|
Here's the query output for 'waiting for retry' - I think 157 times with maximum wait being 5. So definitely not showing the same error that you saw last week. Also, I confirmed that the log (searching for "Success: ") showed 200 records being deleted as well. So bottom line, the code worked as expected this time. However, I am not certain that we have closed out the issue 20. I did look over the 3 other entries you sent and they all were TXT records. Anyway. I am going to somewhat unavailable most of the week as I'll be onsite for another customer. But will try to sync back with Census VDI here and there. |
|
"Message" : "{"instance_id": "i-09690ff614af80899", "account_id": "252999262699", "client": "route53", "boto3_method": "change_resource_record_sets", "message": "change_resource_record_sets could not DELETE record", "change_resource_record_sets": {"HostedZoneId": "Z0335046OYVYX7FSFJ0N", "ChangeBatch": {"Comment": "Updated by Lambda DDNS", "Changes": [{"Action": "DELETE", "ResourceRecordSet": {"Name": "othername-vpc3-testn-apps-us-gov-west-1b-ite.ite.das.rm.census.gov.", "Type": "TXT", "TTL": 60, "ResourceRecords": [{"Value": "\"heritage=dynr53,dynr53/version=0.2.1,dynr53/account_id=252999262699,dynr53/region=us-gov-west-1,dynr53/instance_id=i-09690ff614af80899,dynr53/create_time=1648135504\""}]}}]}}}", "Message" : "{"instance_id": "i-06e47650f168c5efa", "account_id": "252999262699", "client": "route53", "boto3_method": "change_resource_record_sets", "message": "change_resource_record_sets could not DELETE record", "change_resource_record_sets": {"HostedZoneId": "Z03319983G8KBL1Y4PNBN", "ChangeBatch": {"Comment": "Updated by Lambda DDNS", "Changes": [{"Action": "DELETE", "ResourceRecordSet": {"Name": "197.24.191.10.in-addr.arpa", "Type": "TXT", "TTL": 60, "ResourceRecords": [{"Value": "\"heritage=dynr53,dynr53/version=0.2.2,dynr53/account_id=252999262699,dynr53/region=us-gov-west-1,dynr53/instance_id=i-06e47650f168c5efa,dynr53/create_time=1648465378\""}]}}]}}}", "Message" : "{"instance_id": "i-06e47650f168c5efa", "account_id": "252999262699", "client": "route53", "boto3_method": "change_resource_record_sets", "message": "change_resource_record_sets could not DELETE record", "change_resource_record_sets": {"HostedZoneId": "Z0335046OYVYX7FSFJ0N", "ChangeBatch": {"Comment": "Updated by Lambda DDNS", "Changes": [{"Action": "DELETE", "ResourceRecordSet": {"Name": "ip-10-191-24-197.ite.das.rm.census.gov.", "Type": "TXT", "TTL": 60, "ResourceRecords": [{"Value": "\"heritage=dynr53,dynr53/version=0.2.2,dynr53/account_id=252999262699,dynr53/region=us-gov-west-1,dynr53/instance_id=i-06e47650f168c5efa,dynr53/create_time=1648465378\""}]}}]}}}", "Message" : "{"instance_id": "i-0444664955f19f510", "account_id": "252999262699", "client": "route53", "boto3_method": "change_resource_record_sets", "message": "change_resource_record_sets could not DELETE record", "change_resource_record_sets": {"HostedZoneId": "Z0335046OYVYX7FSFJ0N", "ChangeBatch": {"Comment": "Updated by Lambda DDNS", "Changes": [{"Action": "DELETE", "ResourceRecordSet": {"Name": "ip-10-191-25-135.ite.das.rm.census.gov.", "Type": "TXT", "TTL": 60, "ResourceRecords": [{"Value": "\"heritage=dynr53,dynr53/version=0.2.2,dynr53/account_id=252999262699,dynr53/region=us-gov-west-1,dynr53/instance_id=i-0444664955f19f510,dynr53/create_time=1648465378\""}]}}]}}}", |
|
parsing the 4 messages. Checking ROute53 Zone No A Record - those were removed correctly. |
|
Looking at the TXT/A Pairing to see what records might have failed. Checking ite.das.rm.census.gov zone closer for missing pair Checking all PTR zones for missing pair Will need to check against CloudWatch log to see if those might have errored out at the same window. |
|
Rerunning the query to identify ALL errored instances. CloudWatch Logs Insights
|
|
Looks like 4 records failed to delete since code was updated. CloudWatch Logs Insights
|
|
For |
|
For CloudWatch Logs Insights
|
|
CloudWatch Logs Insights
|
Looks like due to CloudWatch Logs Insights
|
|
Not matching. Well, the version number constructed comes from that of the running module. So what you might see in DNS is version=0.2.1 or something else. This is why I either wanted to lookup the TXT and verify or record the DNS records created in the DDB. |
So for the TXT record, we are doing "lookup" of the existing TXT record - so technically it should always match but for whatever reason, 6 times, it did not. Looking at it closer to see where it might be caused by. |
|
OK.. so I thought that the TXT record deletion will LOOK up the existing value and attempt to delete it. Looks like that's NOT the case. In my lab, I manually changed the version ID in the TXT value and looked at the log and saw that it was generating the Invalid Record Match (the value it was trying to delete and existing did not match). I put some logging inside the get_resource_record function, and none of the those were being logged. I looked at the code closer, and looks like this line (746) controls whether the resource record is retrieved instead of being "generated". Do you have an issue if change that to be TRUE so that any time a TXT record deletion is done, the Code will LOOK up the value of the TXT record instead of trying to generate it? As you stated, any time a version changes in between, the heritage tag will be different and TXT record will fail to delete. Given the likeliness of the TXT record being updated between instance start and terminate, I think it is safer to ALWAYS look up the value before deleting. I can move get_rr to a constant and use ENV variable as well, but really, I don't think it should ever be False otherwise, we run the risk of TXT value not matching and not being cleaned up. I looked over one failed logs in MA8 and I can confirm why those TXT records failed to delete. The value in Route 53 zone for ip-10-191-25-135.ite.das.rm.census.gov. What the DDNS Lambda Tried to delete (notice the version 0.2.2) Also looking at your SNS topic. What I see in the Route 53 I think we have few options: |
|
Narrowed down the issue to the code that attempts to delete the TXT record. Initially, I thought that the code (written by Don) was doing a look up to find the value of the TXT record but it wasn't. So the checker failed if ANY fields were updated in the heritage TXT record. Talking with Don, we are ONLY checking if the TXT was created by the DDNS Lambda AND the instance-id matches what is expected. Working through the debug now, and should be able to finish in next 1-2 day.. target completion by end of the week. |
When trying to terminate all 50 instances at the same time, @badra001 received a lot of messages like this.
The text was updated successfully, but these errors were encountered: