Menu
Government/Veterans Affairs

VA Firewall Debugging & Scaling Breakthrough

Diagnosed and resolved network-level failures blocking VA benefits submissions at national scale

6 engineers team members
Government/Veterans Affairs
Share:

Project Overview

The firewall issue was one of those problems that did not look like a firewall issue at first.

From the application team's perspective, the Benefits Intake API looked healthy. The Rails app was not throwing obvious errors. GovCloud logs were not showing failed requests. AWS infrastructure appeared stable. The S3 upload flow was doing what it was designed to do: keep large PDF files away from the application servers and let customers upload directly through a secure, one-time URL.

But customers were seeing something different. They were attempting to upload benefits PDFs and getting a strange response: '200 Closed TCP.'

That was the confusing part. A normal 200 implies success. But 'Closed TCP' suggested the connection had been cut off unexpectedly. So partners were left in an awkward state. From their side, the upload had failed or partially failed. From the VA application logs, it often looked like nothing had happened at all.

That mismatch became the first big clue.

The API team could not find corresponding failures inside the application stack. If the app had been rejecting files, timing out, throwing exceptions, or failing downstream submission into OBIPI or VBMS, there should have been evidence in the logs. But the failures were mostly invisible from inside GovCloud. That suggested the breakage was happening before the request reached the application layer.

At low volume, the system seemed reliable. A few hundred submissions worked fine. But as usage grew, the error rate started climbing. What had originally looked like scattered customer-side issues became a pattern. Error rates moved from roughly 1 to 5 percent into much more serious territory, eventually reaching 30 to 40 percent during heavier submission periods. That made it clear this was not just a random partner integration problem.

The team started by checking the normal suspects. They looked at the application code. They looked at the upload flow. They looked at AWS. They looked at S3 behavior. They looked at Kong. They looked at request routing. They looked at timeouts, file sizes, retries, and customer upload behavior. Nothing fully explained the symptoms.

The architecture had been intentionally designed to avoid server bottlenecks. Customers were not uploading giant PDFs directly through the Rails application. They were receiving pre-signed S3 URLs and uploading to a controlled bucket location. That should have made the app lightweight and resilient. So if the app was not choking, and AWS was not obviously failing, the team had to look further upstream. That led to the VA firewall.

At first, the firewall team was skeptical. That is pretty common in large organizations. From their point of view, firewall policies were in place, traffic was allowed, and there was not obvious evidence that the firewall was the problem. It was easier to assume the issue was happening on the customer side or somewhere in the API implementation. But the API team had a strong reason to keep pushing: the failures were not showing up in GovCloud logs. That meant the traffic might be dying at the edge, before the application ever had a chance to record the failure.

The challenge was proving it. Anecdotes from customers were not enough. Application logs were not enough because they did not show the failure. Firewall dashboards alone were not enough because the issue did not appear obvious under normal traffic. The team needed a way to reproduce the failure on demand, at scale, in a way that the firewall engineers could observe directly.

So the team built a custom stress-testing tool. The tool simulated large numbers of PDF submissions through the same basic path real customers used. Instead of testing with one file at a time, it could generate hundreds of simultaneous upload attempts and push the staging environment in a way that resembled real production load. The tool was open-sourced so anyone could repeat the test exactly:

https://github.com/bastosmichael/dsva-firewall-tester

That changed the investigation. Once the stress tester was running, the failures became reproducible. The issue was no longer a vague complaint from partners or a mysterious intermittent failure. The team could create the problem, watch it happen, and show exactly when it appeared. When the firewall engineers saw the test results inside their own environment, the root cause became much harder to dismiss.

The investigation revealed two major firewall problems.

First, only one of the four national firewall entry points was actually configured to accept the API traffic. So even though there were multiple available firewall paths in theory, the Benefits Intake API traffic was effectively being funneled through a single choke point. That meant the system was not truly load-balanced at the network edge.

Second, the firewall hardware did not have enough buffer memory to handle the combination of large PDF files and high concurrent upload volume. At lower usage, the buffers were fine. But as more partners submitted larger files at the same time, the firewall buffers filled up. When that happened, the firewall would prematurely close the TCP connection.

That premature close created the misleading customer-facing behavior: a response that looked like a 200, but with the connection closed before the upload completed cleanly. The application never saw the complete failure because the request had been interrupted before it reached the systems the API team controlled. That is why the GovCloud logs were clean. The problem was real, but it was happening outside the app's visibility.

The resolution came once everyone had the same evidence. The firewall team enabled all four firewall entry points for the API traffic. That removed the single-lane bottleneck and allowed traffic to be distributed properly. They also increased the firewall buffer memory so the hardware could handle larger files and higher concurrency. Finally, traffic was balanced across the available entry points, making the upload path more resilient. After those changes, the '200 Closed TCP' errors disappeared.

The key lesson from the firewall investigation was that the failure lived in the gap between teams. The application team could not see it in their logs. The firewall team could not easily reproduce it under normal conditions. Customers were the only ones consistently experiencing the pain.

The breakthrough was not just finding the firewall issue. It was building a tool that made the invisible failure visible to everyone. That stress-testing bot turned a debate into a shared diagnostic exercise. Once the firewall team could reproduce the failure themselves, the conversation shifted from 'Are we sure this is the firewall?' to 'Now that we can see it, how do we fix it?' That was the real turning point.

Key Challenges

  • Capturing an intermittent 200 Closed TCP response invisible to application logs
  • Earning buy-in from firewall engineers skeptical of an infrastructure root cause
  • Building compliant stress-testing harnesses for federal network boundaries
  • Coordinating changes across nationally distributed firewall hardware
  • Maintaining veteran-facing service reliability during remediation

Technologies & Solutions

Custom Go and Ruby load-testing tooling Kong and api.va.gov traffic proxying AWS GovCloud observability stack Firewall buffer instrumentation and tuning Collaborative incident response playbooks

Key Metrics

30-40% error rate reduced to <1%
Four national firewall entry points enabled for API traffic
Firewall buffer memory expanded to handle large PDFs
40,000+ monthly submissions restored without incident

Results & Impact

200 Closed TCP errors eliminated through firewall reconfiguration, restoring 40K+ monthly uploads

Want Similar Results?

Let's discuss how we can help solve your engineering challenges.