FilterHN

Efficient set-membership filters and dictionaries based on SAT

49 points

by keepamovin

5 days ago

| past

| 3 comments

| github.com

| HN

▲

gopalv

2 days ago

[-]

My favourite part of these research publications from the US Gov is the licensing.

All of the USDS work is published with "No Copyright".

The SAT filters however still do not support incremental building, which is one of bloom filters fun features when you use them in distributed databases (you can build N of them and then OR bloom filters to get a single one).

I imagine it will still be incredibly useful where you can iterate over them and do OR the old fashioned way, but at higher accuracy for the same size.

▲

inasio

2 days ago

[-]

Membership filters are very efficient filters that guarantee no false negatives, but false positives are possible (how much and how many can be adjusted based on the dataset and filter's parameters). An obvious application could something like checking whether passengers are in a no-fly list, where false-positives could be handled by further checks. As far as I know cuckoo filters [0] are the state of the art for this, but per this work in principle you could make very efficient with using a SAT (or XORSAT) solver that could generate many feasible solutions out of random SAT problems.

- Google scholar pointed out this link to get a pdf for one of the papers cited in the repo [1]

[0] https://en.wikipedia.org/wiki/Cuckoo_filter

[1] http://t-news.cn/Floc2018/FLoC2018-pages/proceedings_paper_4...

▲

thaumasiotes

2 days ago

[-]

> An obvious application could something like checking whether passengers are in a no-fly list, where false-positives could be handled by further checks.

Why is this an obvious application? How does this application benefit from a "very efficient" first pass? Just the boarding process on an airplane takes 20-30 minutes; you can easily check the entire passenger manifest in an error-free way in much less time than that. People have to buy their tickets before the boarding process begins.

▲

jauntywundrkind

2 days ago

[-]

If 99% of people aren't on the list, and 1% are, if your check is super fast but makes 1% false positives, you still end up having to only do a full check on 2%. Which could be a huge huge huge win computationally.

Your post is really weird to me, talking about boarding times? You start skeptical of the example & I'm confused how you think this is anything but a fine example. Ultimately there's some service running in the cloud somewhere that needs to have checks run against it. 2.9m people fly a day in the US, and whether the servers doing that work can do it efficiently or whether they do it in a dogsbit bad manner seems like an obvious concern to me? https://www.faa.gov/air_traffic/by_the_numbers

I suspect the actual usage for this is for much broader higher traffic systems. For things that watch sizable chunks of the internet for patterns and traffic. But checking passengers against. I fly lists sounds like a pretty reasonable example use to me, and the criticism seems off base & weird in a number of dimensions that straight up don't make sense.

▲

FridgeSeal

2 days ago

[-]

Assuming the airport runs from 6am to 11pm, 2.9m people a day works out to be about ~47 reqs/second. Which is not terribly much.

Even if we check them at both ends, and effectively double the load, thats only ~100reqs/second. A single machine would happily handle that.

▲

thaumasiotes

2 days ago

[-]

> Assuming the airport runs from 6am to 11pm

That's a strange assumption. The airports that have significant traffic are operating 24 hours.

Under the assumption that airports close between 11 and 6, there would be no such thing as a redeye flight.

▲

jauntywundrkind

1 day ago

[-]

Congratulations on re-highlighting my greatest complaint about your qualm-making, a propensity for adding factors into the mix that have nothing to do with the big picture.

To me, 47 or 37 req/s seems like a fantastically immaterial difference. It's just not a big enough change in magnitude to really affect the situation.

Accurate qualm, and being technically correct. Personally I'd try to find a more liberal minded approach when trying to hold in my mind the question for what efficient set membership might be good for.

▲

thaumasiotes

1 day ago

[-]

The change in magnitude is over 20%; if you're thinking about it in terms of "change in magnitude", it's huge.

As FridgeSeal points out, both numbers are very small, but that's not a reason you'd want to set up an inaccurate triage system on top of the accurate one. If you don't have very much work to do, you don't need to invest much in optimizing it.

▲

AlotOfReading

1 day ago

[-]

Most airports are not open 24/7 in the sense that flights are departing 24/7 or that you can get through security checkpoints 24/7. They simply don't kick people out of the secure area when they shut down. You'll still have difficulties showing up to your 5am flight 3 hours early before the checkpoints open at most major airports.

▲

thaumasiotes

1 day ago

[-]

You have difficulties showing up to your flight three hours early because checkin is not available that far in advance. But the airport obviously is open that far in advance. Planes are departing and landing at that time. The checkin counter may be open (or may not; volume does go down).

It it likely true that "most airports" are not operating 24/7, but how is that relevant? It could be just as true that "most airports" don't serve commercial flights at all. The airports that have a lot of passengers are operating 24/7. We're talking about a metric assessed per passenger.

▲

AlotOfReading

10 hours ago

[-]

I'm not talking about the check-in counter, I'm talking about general security that non-charter passengers always have to go through to board a flight. Security checkpoints operate limited hours at even the largest airports, like LAX where they're open from 4h00-22h00 max. If we look the departures for LAX today, we can see that no passenger planes departed between 2h00-5h00, except for one charter jet out of the private terminal. Yesterday had no such departures, only the usual nighttime cargo flights.

▲

rurban

1 day ago

[-]

The false-positive check is trivially just comparing the given key to the key at the resulting index. Trivial.

The only size problem is with non-ordered MPHF's where you need to reference the index through an index order table also.

The SAT approach is cute, but doesn't scale. It might have better runtime costs as you can spare one additional table lookup. Efficient MPHF's are miles better at construction time.

▲

convolvatron

2 days ago

[-]

the reference in the repo is paywalled (US$ 30!). I did find this https://arxiv.org/pdf/1912.08258 which may or may not be related. but what I found interesting is that the construction looks alot like perfect hashes

▲

joe_the_user

2 days ago

[-]

They list two references I think. The first is from the Journal Of Satifiability. The link appears empty to my browser but this link leads to an article of the same authros and the same journal; https://www.cs.uky.edu/~marek/papers.dir/11.dir/JSAT8_10_Wea...

The second paper was from a conference originally and found this link to it through Google Scholar (also listed in another comment); http://t-news.cn/Floc2018/FLoC2018-pages/proceedings_paper_4...