packet loss Archives

The Cloud’s Achilles Heel – The Network

SoftNAS began its life in the cloud and rapidly rose to become the #1 best-selling NAS in the AWS cloud in 2014, a leadership position we have maintained and continue to build upon today. We and our customers have been operating cloud native since 2013, when we originally launched on AWS. Over that time, we have helped thousands of customers move everything from mission-critical applications to entire data centers of applications and infrastructure into the cloud. In 2015, we expanded support to Microsoft Azure, which has become a tremendous area of growth for our business.

By working closely with so many customers with greatly varying environments over the years, we’ve learned a lot as an organization – about the challenges customers face in the cloud – and getting to the cloud in the first place with big loads of data in the hundreds of terabytes to petabyte scale.

Aside from security, the biggest challenge area tends to be the network – the Internet. Hybrid cloud uses a mixture of on-premises and public cloud services with data transfers and messaging orchestration between them, so it all relies on the networks. Cloud migrations must often navigate various corporate networks and the WAN, in addition to the Internet.

The Internet is the data transmission system for the cloud, like power lines distribute power throughout the electrical grid. While the Internet has certainly improved over the years, it’s still the wild west of networking.

The network is the Achilles heel of the cloud.

Developers tend to assume that components of an application are operating in close proximity of one another; i.e., a few milliseconds away across reliable networks, and if there’s an issue, TCP/IP will handle retries and recover from any errors. That’s the context many applications get developed in, so it’s little surprise that the network becomes such a sore spot.

In reality, production cloud applications must hold up to higher, more stringent standards of security and performance than when everything ran wholly contained within our own data centers over leased lines with conditioning and predictable performance. And the business still expects SLA’s to be met.

Hybrid clouds increasingly make use of site-to-site VPN’s and/or encrypted SSL tunnels through which web services integrate third party and SaaS sites and interoperate with cloud platform services. Public cloud provider networks tend to be very high quality between their data center regions, particularly when communications remain on the same continent and within the same provider. For those needing low-latency tunnels, AWS DirectConnect and Azure ExpressRoute can provide additional conditioning for a modest fee, if they’re available where you need them.

But what about the corporate WAN, which are often overloaded and plagued by latency and congestion? What about all those remote offices, branch offices, global manufacturing facilities and other remote stations that aren’t operating on pristine networks and remain unreachable by cost-effective network conditioning options?

Latency, congestion and packet loss are the usual culprits

It’s easy to overlook the fact that hybrid cloud applications, bulk data transfers and data integrations take place globally. And globally it’s common to see latencies in the many hundreds of milliseconds, with packet loss in the several percent range or higher.

In the US, we take our excellent networks for granted. The rest of the world’s networks aren’t always up to par with what we have grown accustomed to in pure cloud use cases, especially where many remote facilities are located. It’s common to see latency in the 200 to 300 milliseconds range when communicating globally. When dealing with satellite, wireless or radio communications, latency and packet loss is even greater.

Unfortunately, the lingua franca of the Internet is TCP over IP; that is, TCP/IP. Here’s a chart that shows what happens to TCP/IP in the face of latency and packet loss resulting from common congestion.

The X axis represents round trip latency in milliseconds, with the Y axis showing effective throughput in Kbps up to 1 Gbps, along with network packet loss in percent along the right side. It’s easy to see how rapidly TCP throughput degrades when facing more than 40 to 60 milliseconds of latency with even a tiny bit of packet loss. And if packet loss is more than a few tenths of a percent, forget about using TCP/IP at all for any significant data transfers – it becomes virtually unusable.

Congestion and packet loss are the real killer for TCP-based communications. And since TCP/IP is used for most everything today, it can affect most modern network services and hybrid cloud operation.

This is because the TCP windowing algorithm was designed to prioritize reliable delivery over throughput and performance. Here’s how it works. Each time there’s a lost packet, TCP cuts its “window” buffer size in half, reducing the number of packets being sent and slowing the throughput rate. When operating over less than pristine global networks, sporadic packet loss is very common. It’s problematic when one must transfer large amounts of data to and from the cloud. TCP/IP’s susceptibility to latency and congestion render it unusable. This well-known problem has been addressed on some networks by deploying specialized “WAN Optimizer” appliances, so this isn’t a new problem – it’s one IT managers and architects are all too familiar with and have been combating for many years.

Latency and packet loss turn data transfers from hours into days, and days into weeks and months

So even though we may have paid for a 1 Gbps network pipe, latency and congestion conspire with TCP/IP to limit actual throughput to a fraction of what it would be otherwise; e.g., just a few hundred kilobits per second. When you are moving gigabytes to terabytes of data to and from the cloud or between remote locations or over the hybrid cloud, what should take minutes takes hours, and days turn into weeks or months.

We regularly see these issues with customers who are migrating large amounts of data from their on-premises datacenters over the WAN and Internet into the public cloud. A 50TB migration project that should take a few weeks turns into 6 to 8 months, dragging out migration projects, causing elongated content freezes and sending manpower and cost overruns through the roof vs. what was originally planned and budgeted.

As we continued to repeatedly wait for customer data to arrive in the public cloud to complete cloud migration projects involving SoftNAS Cloud NAS, we realized this problem was acute and needed to be addressed. We had many customers approach us and ask us if we had thought about helping in this area – as far back as 2014. Several even suggested we have a look at IBM Aspera, which they said was a great solution.

In late 2014, we kicked off what turned into a several year R&D project to address this problem area. Our original attempts were to use machine learning to automatically adapt and adjust dynamically to latency and congestion conditions. That approach failed to yield the kind of results we wanted.

Eventually, we ended up inventing a completely new network congestion algorithm (that’s now Ultra pending patent) to break through and achieve the kind of results we see below.

We call this technology “UltraFast™.”

As can be easily seen here, UltraFast overcomes both latency and packet loss to achieve 90% or higher throughput, even when facing up to 800 milliseconds and several percent packet loss. Even when packet loss is in the 5% to 10% range, UltraFast continues to get the data through these dirty network conditions.

I’ll save the details of how UltraFast does this for another blog post, but suffice it to say here that it uses a “congestion discriminator” that provides the optimization guidance. The congestion discriminator determines the ideal maximum rate to send packets without causing congestion and packet loss. And since TCP/IP constantly re-routes packets globally, the algorithm quickly adapts and optimizes for whatever path(s) the data ends up taking over IP networks end-to-end.

What UltraFast means for cloud migrations

We combine UltraFast technology with what we call “Lift and Shift” data replication and orchestration. This combo makes migration of business applications and data into the public cloud from anywhere in the world a faster, easier operation. The user simply answers some questions about the data migration project by filling in some wizard forms, then the Lift and Shift system handles the entire migration, including acceleration using UltraFast. This makes moving terabytes of data globally a simple job any IT or DevOps person can do.

Additionally, we designed Lift and Shift for “live migration”, so once it replicates a full backup copy of the data from on-premise into the cloud, it then refreshes that data so the copy in the cloud remains synchronized with the live production data still running on-premise. And if there’s a network burp along the way, everything automatically resumes from where it left off, so the replication job doesn’t have to start over each time there’s a network issue of some kind.

Lift and Shift and UltraFast take a lot of the pain and waiting out of cloud migrations and global data movement. It took us several years to perfect it, but now it’s finally here.

What UltraFast means for global data movement and hybrid cloud

UltraFast can be combined with FlexFiles™, our flexible file replication capabilities, to move bulk data around to and from anywhere globally. Transfers can be point-to-point, one to many (1-M) and/or many to one (M-1). There is no limitation on the topologies that can be configured and deployed.

Finally, UltraFast can be used with Apache NiFi, so that any kind of data can be transferred and integrated anywhere in the world, over any kind of network conditions.

SUMMARY

The network is the Achilles heel of the cloud. Internet and WAN latency, congestion and packet loss prevent hybrid cloud performance, timely and cost-effective cloud migrations and slow global data integration and bulk data transfers.

SoftNAS’ new UltraFast technology, combined with Lift and Shift migration and Apache NiFi data integration and data flow management capabilities yield a flexible, powerful set of tools for solving what have historically been expensive and difficult problems with an purely software solution that runs everywhere; i.e., on VMware or VMware-compatible hypervisors and in the AWS and Azure clouds. This powerful combination puts IT in the driver’s seat and in control of its data, overcoming the cloud’s Achilles heel.

NEXT STEPS

Visit Buurst, Inc to learn more about how SoftNAS is used by thousands of organizations around the world to protect their business data in the cloud, achieve a 100% up-time SLA for business-critical applications and move applications, data and workloads into the cloud with confidence. Register here to learn more and for early access to UltraFast, Lift and Shift, FlexFiles and NiFi technologies.

ABOUT THE AUTHOR

Rick Braddy is an innovator, leader and visionary with more than 30 years of technology experience and a proven track record of taking on business and technology challenges and making high-stakes decisions. Rick is a serial entrepreneur and former Chief Technology Officer of the CITRIX Systems XenApp and XenDesktop group and former Group Architect with BMC Software. During his 6 years with CITRIX, Rick led the product management, architecture, business and technology strategy teams that helped the company grow from a $425 million, single-product company into a leading, diversified global enterprise software company with more than $1 billion in annual revenues. Rick is also a United States Air Force veteran, with military experience in top-secret cryptographic voice and data systems at NORAD / Cheyenne Mountain Complex. Rick is responsible for SoftNAS business and technology strategy, marketing and R&D.