A transport protocol's view of Starlink
177 points
1 month ago
| 10 comments
| blog.apnic.net
| HN
cagenut
1 month ago
[-]
I am soooo grateful for this post. After years and years of words and one-off measurements, this is the first time I have seen clear measuring of the key metrics, specifically for someone considering the 'fast twitch shooter' (counterstrike et al) use case. To sum:

  - the satellite handoff period is 15 seconds
  - you cycle through the three nearest satellites
  - on the farthest one, inside the 15 second stable connection window, latency is 40 - 55ms
  - on the middle one latency is 30 - 45ms
  - on the closest one latency is 20 - 35ms
  - the moment of handoff between satellites is a 70 - 100ms latency spike
  - more importantly it is a near guaranteed moment of ~10% packet loss
so my takeaway here is "it will mostly seem fine but you will stutter-lag every 15 seconds". given that not every 15 seconds will be an active moment of 'twitch off' shooting the engine will probably smooth over most of them without you noticing.

this could probably be used in subtle ways the same as laghacking already is. like if you knew some packet loss was coming and you knew how your engine handled it you could do things like time jumps around corners so you appear to teleport. or if the engine is very conservative then you could at least know "don't push that corner right now, wait 2 seconds for the handoff, then go".

edit: side note, thinking about the zoom use case, and this would be kindof awful? imagine dropping a syllable every 15 seconds in a conversation.

reply
bhbh
1 month ago
[-]
This is not how it works in practise. I’ve been on Starlink for more than three months, and I play CS and other competetive shooters daily. There is no such a thing 100ms latency spikes every X seconds. Same for packet loss. Teams calls are no problem at all, and the latency and quality of the Internet is comparible to that of a fibre lane. You wouldn’t be able to tell the difference if you didn’t know you are on the Starlink network.
reply
brokenmachine
1 month ago
[-]
>the latency and quality of the Internet is comparible to that of a fibre lane

#doubt

reply
EricE
1 month ago
[-]
Based on what? The receiver knows where it is since it has GPS. Starlink knows the orbits of their own satellites. Why wouldn't they queue up connections to upcoming satellites and then hand off from one to the next in anticipation of the one you are on eventually rotating out?

What's up with the apparent assumption they only track one satellite at a time or until there is a communication problem? That would be stupid (and why they obviously don't do that).

reply
brokenmachine
1 month ago
[-]
Based on life experience with anything wireless.

From cellphones to wifi to bluetooth, it's all shit compared to fiber.

reply
j45
1 month ago
[-]
I have tried Starlink as well and video calls have been fine.
reply
teleforce
1 month ago
[-]
Such a quality written article on satellite networking technology, kudos to APNIC.

This makes me wonder perhaps TCP is not really suitable or optimized for satellite network.

John Ousterhout (of TCL/TK fame) has recently proposed a new Homa transport protocol as an alternative for TCP in data center [1]. Perhaps a new more suitable transport protocol for satellite or NTN is also needed. That's the beauty of the Internet, the transport protocol is expendable but not network protocol or IP. The fact that IPv6 still a fringe rather than becoming mainstream although it's arguably better than IPv4.

[1] Homa, a transport protocol to replace TCP for low-latency RPC in data centers:

https://news.ycombinator.com/item?id=28204808

reply
supriyo-biswas
1 month ago
[-]
Throwing out TCP for a message-oriented layer as Homa does is not really required for addressing this need.

Perhaps what would be more useful in this context would be for operating system vendors to perform a HTTP request to a globally distributed endpoint similar to captive portal detection, and then use a more aggressive congestion control algorithm in the case of networks with good throughputs but with high latency.

reply
teleforce
1 month ago
[-]
I am not saying similar to Homa protocol since it's catering for data center traffic but a new transport protocol alternative to cater for rapid delay variations or jitter needs that's unique to LEO networking.
reply
orangeboats
1 month ago
[-]
>The fact that IPv6 still a fringe rather than becoming mainstream although it's arguably better than IPv4.

FWIW, Starlink does IPv6 and CGNAT'd IPv4.

reply
NelsonMinar
1 month ago
[-]
I noticed a huge improvement just switching to stock BBR to my Starlink as well. During a particularly congested time I was bouncing between 5 to 12 Mbps via Starlink. With BBR enabled I got a steady 12. The main problem is that you need BBR on the server for this to work, as a client using Starlink I don't have any control over what all the servers I connect to are doing. (Other than my one server I was testing with).

I like Huston's idea of a Starlink-tuned BBR, I wonder if it's a traffic shaping that SpaceX could apply themselves in their ground station datacenters? That'd involve messing with the TCP stream though, maybe a bad idea.

The fact that Starlink has this 15 second switching built in is pretty weird, but you can definitely see it in every continuous latency measure. Even weirder it seems to be globally synchronized: all the hundreds of thousands of dishes are switching to new satellites the same millisecond, globally. Having a customized BBR aware of that 15 second cycle is an interesting idea.

reply
btilly
1 month ago
[-]
If you use a VPN, wouldn't it suffice to just make your VPN connection use BBR?

Ditto if you use an https proxy of some kind.

reply
Hikikomori
1 month ago
[-]
Proxy yes, vpn no. Tcp over tcp vpn is bad, no tcp vpn would make no difference to no vpn.
reply
jofla_net
1 month ago
[-]
I would guess that that would be beneficial, but again only if youre using a TCP vpn, which is suboptimal for other reasons. I think it was called meltdown. If that is all you have access to though, im sure it would help.
reply
Alex-Programs
1 month ago
[-]
https://github.com/apernet/hysteria has the option to use https://github.com/apernet/tcp-brutal, a deliberately unfair/selfish congestion control algorithm.

It's designed to mitigate certain methods of blocking-via-throttling.

I looked into it for a report I wrote a while back, and I was surprised to find that nobody has made something purpose-built for greedy TCP congestion handling in order to improve performance at the expense of others. If there is such a thing, I couldn't find it. Perhaps I'm a little too cynical in my expectations!

Maybe TCP-over-TCP is so bad that it's not worth it?

reply
kjellsbells
1 month ago
[-]
In the US, there is a program kicking off known as BEAD that throws money at states to implement broadband access to underserved communities. Fiber being the classic mechanism: cheap, fast, reliable.

With new occupants in the White House, I expect that the FCC will come under intense pressure to allow Starlink service to qualify as broadband. This article gives me pause that Starlink could ever provide decent broadband service?

reply
chiph
1 month ago
[-]
I'm not sure BEAD is needed, as the Telecommunications Act of 1996 expanded the definition of Universal Service to include high-speed internet. The FCC later set up the Connect America Fund in 2013 to administer this, defining high speed as 10/1 broadband.

https://www.fcc.gov/general/universal-service

https://www.fcc.gov/general/connect-america-fund-caf

A friend is served by a rural cooperative telco and has fiber to the home. They were able to get grants from the Connect America Fund to run fiber in their service area, so I think it's been a success. It may be that there are other rural telcos who don't know they can do this, or are just bad at writing grant requests. And BEAD won't help solve that.

reply
tonyarkles
1 month ago
[-]
It’s an interesting article. I’ve had Starlink for close to two years now at a Canadian farm and while I have noticed (and tested to verify) the kind of variance they’re seeing… in practice it has been a complete non-issue. Teams calls with 20 people, massive bulk data uploads and downloads from S3, etc. and the service has been basically flawless. There have been a couple of 1-2 minute outages here and there, generally during very intense thunderstorms.

And then through a series of fortunate events for me I ended up with gigabit symmetric fibre out there this fall. I haven’t cancelled the Starlink subscription yet, still waiting to feel confident that the fibre is reliable long-term.

reply
kevincox
1 month ago
[-]
> the endpoints need to use large buffers to hold a copy of all the unacknowledged data, as is required by the TCP protocol.

It makes me wonder if anyone has tried to break down the layers to optimize this. In the fairly common case of serving a file off of long-term storage you can jut fetch the old data if needed (likely from the page cache anyways, but still better than duplicating it) and some encryption algorithms are seekable so you can redo the encryption as well.

Right now the kernel doesn't really have a choice but to buffer all unacknowledged data as the UNIX socket API has no provision for requesting old data from the writer. But a smarter API could put the application in charge of making that data available as required and avoid the need for extra copies in common cases.

I know that Netflix did lots of work with the FreeBSD kernel for file to socket and eventually adding in-kernel TLS to remove user space from the equation. But I don't know if they went as far to remove the socket buffers.

reply
moandcompany
1 month ago
[-]
This is an "old" problem that has historically been addressed through things like "Performance Enhancing Proxies (PEPs)" that are defined in RFC 3135 and RFC 3449. (https://en.wikipedia.org/wiki/Performance-enhancing_proxy)

In internet-style communications, such as routing IP traffic over satellite links to low-earth-orbit or GEO, with much longer round-trip times, link latency is substantially higher than most terrestrial wired or wireless applications and acknowledgements required as part of TCP take much longer to facilitate. PEPs as an example augment the connection allowing end-user/client devices in the network with inline-PEPs to retain their normal network settings and perform the task of running or starting sessions with higher TCP-window sizes, as a method for improving overall throughput.

The utility of a PEP, or PEP-acting device goes up when you imagine multiple devices, or a network of devices attached to a satellite communications terminal for WAN/backhaul connections as the link's performance can be managed at one point versus on all downstream client devices.

reply
lxgr
1 month ago
[-]
TIL that that's what they're called. Thank you!

Do you know if they became obsolete due to modern TCP stacks handling LFNs better or for some other reason?

I could imagine them being quite useful for high-loss, high-latency paths (i.e. in addition to LFNs in conjunction with poorly tuned TCP implementation), but most wireless protocols I know (802.11, GPRS and beyond etc.) just implement an ARQ process that masks the loss at the packet or lower layer.

So maybe between that and LFN-aware TCPs, there wasn't much left for them to improve to justify the additional complexity?

reply
moandcompany
1 month ago
[-]
It's been a long time since I've paid attention to that world, but AFAIK, PEPs are still used and essential equipment for internet-style communication (i.e. TCP over IP) via GEO satellites.

It looks like in this 2022 blog post evaluating latency and throughput over Starlink, they concluded that PEPs were not being used in the Starlink network (and probably unnecessary) due to the lower latency characteristics from use of LEO satellites. They also mention that PEPs are (still) commonly employed by GEO-satcom based operators.

https://blog.apnic.net/2022/11/28/fact-checking-starlinks-pe...

reply
lxgr
1 month ago
[-]
> they concluded that PEPs were not being used in the Starlink network

That makes sense, given that they're probably most useful for high-latency networks. But what I find quite surprising is that Starlink does nothing about the 1-2% packet loss, as described in TFA; I'd really have expected them to fix that using an ARQ at a lower layer.

Then again, maybe that's a blessing – indiscriminate ARQs like that would be terrible for time critical things like A/V, which can usually tolerate packet loss much better than ARQ-induced jitter.

Thinking about it, that actually strengthens the case for PEPs: They could improve TCP performance (and maybe things like QUIC?), while leaving non-stream oriented things (like non-QUIC UDP) alone.

Maybe Starlink just expects BBR to eventually become the dominant TCP congestion control algorithm and the problem to solve itself that way?

reply
cyberax
1 month ago
[-]
> It makes me wonder if anyone has tried to break down the layers to optimize this.

Yep. There was a bunch of proxy servers that optimized HTTP for satellite service. I used Globax back in the day to speed up one-way satellite service: https://web.archive.org/web/20040602203838/http://globax.inf...

Back then traffic was around 10 cents per megabyte in my city, so satellite service was a good way to shave off these costs.

reply
lxgr
1 month ago
[-]
There aren't necessarily extra copies even when just using TCP, thanks to sendfile(2) and similar mechanisms.

Buffer size isn't that much of an issue either, given the relatively low latencies involved and that you can indicate which parts exactly are missing pretty accurately these days with selective TCP acknowledgements, so you'll need at most a few round trips to identify these to the sender and eventually receive them.

Practically, you'll probably not see much loss anyway, for better or worse: TCP historically interpreted packet loss as congestion, instead of actual non-congestion-induced loss on the physical layer. This is why most lower-layer protocols with higher error rates than Ethernet usually implement some sort of lower-layer ARQ to present themselves as "Ethernet-like" to upper layers in terms of error/loss rate, and this in turn has made loss-tolerant TCP (such as BBR, as described in the article) less of a research priority, I believe.

reply
sgt101
1 month ago
[-]
Fascinating that the throughput is about 250mbs. Presumably that's over the area served by one satellite? I wonder how much cache they put in each one... I vaguely remember a stat that 90% of requests (in data terms) are served from a TB of cache on the consumer internet, perhaps having the satellites gossip for cache hits would work to preserve uplink bandwidth as well. Maybe downlink bandwidth is the thing for this network though and caches just won't work.
reply
echoangle
1 month ago
[-]
I would be surprised if there is a lot or even any cache on the satellites itself. Fast large storage that's radiation hardened would be extremely expensive, and they have a lot of satellites. The satellites are low enough that general radiation isn't that bad, but every pass through the South Atlantic Anomaly would risk damage if regular flash storage is used.
reply
Alex-Programs
1 month ago
[-]
Beyond that, how would the cache actually work?

Everything is HTTPS nowadays; you can't just MITM and stick a caching proxy in front. You could put DNS on the sat I suppose, but other than that you'd need to have a full Netflix/Cloudflare node, and the sats are far too small and numerous for that.

reply
michaelt
1 month ago
[-]
I'm working on a new type of protocol so that when there's a large-scale online event like the superbowl or a major boxing match on a streaming platform,

satellite and cable internet providers will be able to send a single copy of the video stream on the uplink, and using a new technique we've named "casting the net broadly" multiple viewers will be able to receive the same downlink packets

which we believe will have excellent scalability, enabling web-scale streaming video.

reply
spockz
1 month ago
[-]
Isn’t this what multicast does?
reply
pests
1 month ago
[-]
I can't tell if he is being sincere or just making a joke
reply
vardump
1 month ago
[-]
I think a single Starlink v1 satellite has a maximum bandwidth of about 20 Gbps. Newer versions might have a lot more.
reply
sgt101
1 month ago
[-]
I wonder how they could do that if they are on 8 250MHz bearers?
reply
jeffrallen
1 month ago
[-]
I'm wondering if what TCP wants is an ICMP side channel where the link layer can say, "for the TCP session started with cookie Z, you should know that X packet drops in Y seconds are normal and should not count against the bandwidth estimation".

Or even more precisely, "expect up to Z drops during this X ms window every Y ms".

reply
trebligdivad
1 month ago
[-]
How different is this behaviour to using a mobile phone in a car or train? Doesn't that also get you odd changes in latency and notice the handoffs between cells?
reply
lxgr
1 month ago
[-]
The distances involved are orders of magnitude smaller, so you don't get these effects nearly as much.
reply
supriyo-biswas
1 month ago
[-]
reply
koksik202
1 month ago
[-]
Retransmission and packet loss not shown?
reply