> Theoretically, saying, “order an Uber to airport” seems like the easiest way to accomplish the task. But is it? What kind of Uber? UberXL, UberGo? There’s a 1.5x surge pricing. Acceptable? Is the pickup point correct? What would be easier, resolving each of those queries through a computer asking questions, or taking a quick look yourself on the app?
> Another example is food ordering. What would you prefer, going through the menu from tens of restaurants yourself or constantly nudging the AI for the desired option? Technological improvement can only help so much here since users themselves don’t clearly know what they want.
How many of these inconveniences will you put up with? Any of them, all of them? What price difference makes it worthwhile? What if by traveling a day earlier you save enough money to even pay for a hotel...?
All of that is for just 1 flight, what if there are several alternatives? I can't imagine have a dialogue about this with a computer.
Similarly, long before Waymo, you'd get into a taxi, and tell the human driver you're going to the airport, and they'd take you there. In fact, they'd get annoyed at you if you backseat drove, telling them how to use the blinker and how hard to brake and accelerate.
The thing about conversational interfaces is that we're used to them, because we (well, some of us) interface with other humans fairly regularly, and so it's a fairly baseline level skill to have to exist in the world today. There's a case to be made against them, but since everyone can be assumed to be conversational (though perhaps not in a given language), it's here to stay. Restaurants have menus that customers look at before using the conversation interface to get food, in order to guide the discussion, and that's had thousands of years to evolve, so it might be a local maxima, but it's a pretty good one.
The whole point is that we currently have better, more efficient ways of doing those things, so why would we regress to inferior methods?
To relate to the article - google flights is the Keyboard and Mouse - covering 80% of cases very quickly. Conversational is better for when you're juggling more contextual info than what can be represented in a price/departure time/flight duration table. For example, "i'm bringing a small child with me and have an appointment the day before and I really hate the rain".
Rushed comment because I'm working, but I hope you get the gist.
Current flight planning UX is overfit on the 80% and will never cater to the 20% because cost/benefit of the development work isn't good
How long is it going to take you to get to a device, load the app/webpage, tell it which airport you're flying from and going to and what date and then you start looking at options. You've blown way past the 10 seconds it took for that executive to get a plane flight.
Better is in the eye of the beholder. What's monetarily efficient isn't going to be temporaly efficient, and that's true along a lot of other dimensions too.
Point is, there are some people that like having conversations, you may not be one of them. you don't have to be. I'm not taking away your mouse and keyboard. I have those too and won't give them up either. But I also find talking out loud helps my thinking process though I know that's not everybody.
The booking experience today is granular to help you find a suitable flight to meet all the preferences you’re compiling into an optimal scenario. The experience of AI booking in the future will likely be similar: find that optimal scenario for you once you’re able to articulate your preferences and remember them over time.
Anecdata: last year my wife and I went on a rail tour through Eastern Europe and god, I wish we had chosen to spend a few hundred euros on a travel agency in retrospect - I can't count just how much time we had to spend researching on what kind of rail, bus and public transit tickets you need on which leg, how to create accounts, set up payment and godknowswhat else. Easily took us two days worth of work and about two dozens individual payment transactions. A professional travel agency can do all the booking via Sabre, Amadeus or whatever...
I guess there's just no substitute for someone actually doing the work of figuring out the most appropriate HMI for a given task or situation, be it voice controls, touch screens, physical buttons or something else.
Knowing what you want is, sadly, computationally irreducible.
Of course a conversational interface is useless if it tries to just do the same thing as a web UI, which is why it failed a decade ago when it was trendy, because the tech was nowhere clever enough to make that useful. But today, I'd bet the other way round.
That's why the “advanced search” is almost always hidden somewhere. And that's also why you can never find the filter you need on an e-shopping website.
Such dialog is probably nice for first time user, it is a nightmare for repeated user.
Then it can assume you choice haven't changed, and propose you a solution that matches your previous choices. And to give the user control it just needs to explicitly tell the user about the assumption it made.
In fact, a smart enough system could even see when violating the assumptions could lead to a substantial gain and try convincing the user that it may be a good option this time.
Talking is not very efficient, and it's serial in fixed time. With something visual you can look at whatever you want whenever you want, at your own (irregular) pace.
You will also be able to make changes much faster. You can go to the target form element right away, and you get immediate feedback from the GUI (or from a physical control that you moved - e.g. in cars). If it's talk, you need to wait to have it said back to you - same reason as why important communication in flight control or military is always read back. Even humans misunderstand. You can't just talk-and-forget unless you accept errors.
You would need some true intelligence for just some brief spoken requests to work well enough. A (human) butler worked fine for such cases, but even then only the best made it into such high-level service positions, because it required real intelligence to know what your lord needed and wanted, and lots of time with them to gain that experience.
Who said it cannot be visual? It's still a “conversational” UI if it's a chatbot that writes down its answer.
> Similar reason why many people prefer a blog post over a video.
Well I certainly do, but I also know that we are few and far between in that case. People in general prefer videos over blog post by a very large margin.
> Talking is not very efficient, and it's serial in fixed time. With something visual you can look at whatever you want whenever you want, at your own (irregular) pace. You will also be able to make changes much faster. You can go to the target form element right away, and you get immediate feedback from the GUI.
Saying “I want to travel to Berlin next monday” is much faster than fighting with the website's custom datepicker which will block you until you select your return date until you realize you need to go back and toggle the “one way trip” button before clicking the calendar otherwise it's not working…
There's a reason why nerds love their terminal: GUIs are just very slow and annoying. They are useful for whatever new thing you're doing, because it's much more discoverable than CLI, but it's much less efficient.
> If it's talk, you need to wait to have it said back to you - same reason as why important communication in flight control or military is always read back. Even humans misunderstand. You can't just talk-and-forget unless you accept errors.
This is true, but stays true with a GUI, that's why you have those pesky confirmation pop-ups, because as annoying as they are when you know what you're doing, they are necessary to catch errors.
> You would need some true intelligence for just some brief spoken requests to work well enough.
I don't think so. IMO you just need something that emulates intelligence enough on that particular purpose. And we've seen that LLMs are pretty decent at emulating apparent intelligence so I wouldn't bet against them on that.
You can't be serious??
Oh it's 1st of April, my apologies! I almost took it seriously. I should ignore this website on this day.
What's the difference between a blog post and a chatbot answer in terms of how “visual” things are?
I used to be a reading blog over watching video person, but for some things I’ve come to appreciate the video version. The reason you want to get the video of the whatever is because in the blog post, what’s written down only what the author thought was important. But I’m not them. I don’t know everything they know and I don’t see everything they see. I can’t do everything they do but with the video I get everything. When you perform the whatever the video has every detail, not just the ones you think are important. That bit between step 1 and step 2 that’s obvious? It’s not obvious to everyone, or mine is broken in a slightly different way that I really need to see that bit between 1 and 2. of course, videos get edited and cut so they don’t always have that benefit, but I’ve grown to appreciate them.
Maybe I'm tired of layovers and I'm willing to pay more for a direct flight this time. Maybe I want a different selection at a restaurant because I'm in the mood for tacos rather than a burrito.
But you can, so as long as the interlocutor tells you what assumptions it made, you can correct it if it doesn't match your current mood.
> So yeah, this argument in favor of conversational interfaces sounds at this point more like ideology than logic.
There's no ideology behind the fact that every people rich enough to afford paying someone to deal with mundane stuff will have someone doing it for them, it's just about convenience. Nobody likes to fight with web UIs for fun, the only reason why it has become mainstream is because it's so much cheaper than having a real person working.
Same for Microsoft Word by the way, many people used to have secretaries typing stuff for them, and it's been a massive regression of social status for the upper middle class to have to type things by themselves, it only happened because it was cheaper (in appearance at least).
Amen to that. I guess, it would help to get of the IT high horse and have a talk with linguists and philosophers of language. They are dealing with this shit for centuries now.
I don't get it at all.
[imprecise thinking]
v <--- LLMs do this for you
[specific and exact commands]
v
[computers]
v
[specific and exact output]
v <--- LLMs do this for you
[contextualized output]
In many cases, you don't want or need that. In some, you do. Use right tool for the job, etc.This would be great if LLMs did not tend to output nonsense. Truly it would be grand. But they do. So it isn't. It's wasting resources hoping for a good outcome and risking frustration, misapprehensions, prompt injection attacks... It's non-deterministic algorithms hoping P=NP, except instead of branching at every decision you're doing search by tweaking vectors whose values you don't even know and whose influence on the outcome is impossible to foresee.
Sure, a VC subsidized LLM is a great way to make CVs in LaTeX (I do it all the time), translating text, maybe even generating some code if you know what you need and can describe it well. I will give you that. I even created a few - very mediocre - songs. Am I contradicting myself? I don't think I am, because I would love to live in a hotel if I only had to pay a tiny fraction of the cost. But I would still think that building hotels would be a horrible way to address the housing crisis in modern metropolises.
I didn't mean it to be condescending - though I can see how it can come across as such. FWIW, I opted for a diagram after I typed half a page worth of "normal" text and realized I'm still not able to elucidate my point - so I deleted it and drew something matching my message more closely.
> This would be great if LLMs did not tend to output nonsense. Truly it would be grand. But they do. So it isn't.
I find this critique to be tiring at this point - it's just as wrong as assuming LLMs work perfectly and all is fine. Both views are too definite, too binary. In reality, LLMs are just non-deterministic - that is, they have an error rate. How big it is, and how small can it get in practice for a given tasks - those are the important questions.
Pretty much every aspect of computing is only probabilistically correct - either because the algorithm is explicitly so (UUIDs and primality testing, for starters), or just because it runs on real hardware, and physics happen. Most people get away with pretending that our systems are either correct or not, but that's only possible because the error rate is low enough. But it's never that low by accident - it got pushed there by careful design at every level, hardware and software. LLMs are just another probabilistically correct system that, over time, we'll learn how to use in ways that gets the error rate low enough to stop worrying about it.
How can we get there - now, that is an interesting challenge.
LLMs are cool technology sure. There's a lot of cool things in the ML space. I love it.
But don't pretend like the context of this conversation isn't the current hype and that it isn't reaching absurd levels.
So yeah we're all tired. Tired of the hype, of pushing LLMs, agents, whatever, as some sort of silver bullet. Tired of the corporate smoke screen around it. NLP is still a hard problem, we're nowhere near solving it, and bolting it on everything is not a better idea now than it was before transformers and scaling laws.
On the other hand my security research business is booming and hey the rational thing for me to say is: by all means keep putting NLP everywhere.
Those are the big challenges of housing. Not just how many units there are, but what they are, and how much the "how many" is plain cheating.
What it's trying to communicate is, in general, a human operating a computer has to turn their imprecise thinking into "specific and exact commands", and subsequently, understand the "specific and exact output" in whatever terms they're thinking off, prioritizing and filtering out data based on situational context. LLMs enter the picture in two places:
1) In many situations, they can do the "imprecise thinking" -> "specific and exact commands" step for the user;
2) In many situations, they can do the "specific and exact output" -> contextualized output step for the user;
In such scenarios, LLMs are not replacing software, they're being slotted as intermediary between user and classical software, so the user can operate closer to what's natural for them, vs. translating between it and rigid computer language.
This is not applicable everywhere, but then, this is also not the only way LLMs are useful - it's just one broad class of scenarios in which they are.
To your point, which I think is separate but related, that IS a case where LLMs are good at producing specific and exact commands. The models + the right prompt are pretty reliable at tool calling by themselves, because you give them a list of specific and exact things they can do. And they can be fully specific and exact at inference time with constrained output (although you may still wish it called a different tool.)
The model's output is a probability for every token. Constrained output is a feature of the inference engine. With a strict schema the inference engine can ignore every token that doesn't adhere to the schema and select the top token that does adhere to the schema.
Humans require a lot of back and forth effort for "alignment" with regular "syncs" and "iterations" and "I'll get that to you by EOD". If you approach the potential of natural interfaces with expectations that frame them the same way as 2000s era software, you'll fail to be creative about new ways humans interact with these systems in the future.
Voice interface only prevails in situations with hundreds of choices, and even then it's probably easier to use voice to filter down choices rather than select. But very few games have such scale to worry about (certainly no AAA game as of now).
However, A CEO using Power BI with Convo to can get more insights/graphs rather than slice/dicing his data. They do have fixed metrics but incase they want something not displayed.
Even for straightforward purchases, how many people trust Amazon to find and pick the best deal for them? Even if Amazon started out being diligent and honest it would never last if voice ordering became popular. There's no way that company would pass up a wildly profitable opportunity to rip people off in an opaque way by selecting higher margin options.
Theres 1-5 things any individual finds them useful for (timers/lights/music/etc) and then.. thats it.
99.9% of what I use a computer for its far faster to type/click/touch my phone/tablet/computer.
If your work revolves about telling people what to do and asking questions, a voice assistant seems like a great idea (even if you yourself wouldn't have to stoop to using a robotic version since you have a real live human).
If your work actually involves doing things, then voice/conversational text interface quickly falls apart.
This even happens while walking my dog. If my wife messages me, my iPhone reads it out and, at the same time, I'm trying to cross a road, she'll get a garbled reply which is just me shouting random words at my dog to keep her under control.
Even in a car, being able to control the windscreen wipers, radio, ask how much fuel is left are all tasks it would be useful to do conversationally.
There are some apps (im thinking of jira as an example) where i'd like to do 90% of the usage conversationally.
are you REALLY sure you want that?
how much fuel there is is a quick glance into the dash, and you can control precisely the radio volume without even looking.
'turn up the volume', 'turn down the volume a little bit', 'a bit more',...
and then a radio ad going 'get yourself a 3 pack of the new magic wipers...' and car wipers going off.
id hate conversational ui on my car.
I wish car manufacturers stopped with the touchscreen bullshit, but it seems more likely that they'll try to offset the terrible experience with voice controls.
Conversational interfaces are great for rarely used features or when the user doesn’t know how to do something. For repetitive, common tasks they’re terrible.
But nobody is using ChatGPT for repetitive tasks. In fact the whole LLM revolution seems to be about letting users accomplish tasks without having to learn how to do them. Which I know some people look down on, but it’s the literal definition of management (which, to be fair, some people also look down on).
This is a problem of standardization across manufacturers, not something inherent in physical controls. I never have a problem using the steering wheel in a rental car because they're all the same.
You'd have the same problem with voice interfaces: For some rental cars, turning on the wipers would be "Turn on the wipers". For others, you'd have to say "Activate the wipers." For others, "Enable the windshield wipers." There is no way manufacturers will be capable of standardizing on a single phrase.
1. "Natural language is a data transfer mechanism"
2. "Data transfer mechanisms have two critical factors: speed and lossiness"
3. "Natural language has neither"
While a conversational interface does transfer information, its main qualities are what I always refer to as "blissfull ignorance" and "intelligent interpretation".
Blisfull ignorance allows the requester to state an objective while not being required to know or even be right in how to achieve it. It is the opposite of operational command. Do as I mean, not as I say.
"Intelligent Interpretation" allows the receiver the freedom to infer an intention in the communication rather than a command. It also allows for contextual interactions such as goal oriented partial clarification and elaboration.
The more capable of intelligent interpretation the request execution system is, the more appropriate a conversational interface will be.
Think of it as managing a team. If they are junior, inexperienced and not very bright, you will probably tend towards handholding, microtasking and micromanagement to get things done. If you have a team of senior, experienced and bright engineers, you can with a few words point out a desire and, trust them to ask for information when there is relevant ambiguity, and expect a good outcome without having to detail manage every minute of their days.
It's such a fallacy. First thing an experienced and bright engineer will tell you is to leave the premises with your "few words about a desire" and not return without actual specs and requirements formalized in some way. If you do not understand what you want yourself, it means hours/days/weeks/months/literally years of back and forths and broken solutions and wasted time, because natural language is slow and lossy af (the article hits the nail on the head on this one).
Re "ask for information", my favorite example is when you say one thing if I ask you today and then you reply something else (maybe the opposite, it happened) if I ask you a week later because you forgot or just changed your mind. I bet a conversational interface will deal with this just fine /s
No, that's what a junior engineer will do. The first thing that an experienced and bright senior engineer will do is think over the request and ask clarifying questions in pursuit of a more rigorous specification, then repeat back their understanding of the problem and their plan. If they're very bright they'll get the plan down in writing so we stay on the same page.
The primary job of a senior engineer is not to turn formal specifications into code, it's to turn vague business requests into formal specifications. They're senior enough to recognize that that's the actually difficult part of the work, the thing that keeps them employed.
I love product work and programming. As I wrote in this thread, I did it while freelancing, I do it now at dayjob. I am bored by just programming and want more control over the result. People come to me with "a few words about a desire" and I do come up with specifics and I get credit for it
But I am recognized as a product person, not just programmer. And I know better to not make the mistake you make and pretend that every builder or a structural engineer should be an architect of a building or an urban planner.
People like you is why we have managers come to an expert level say C++ dev with "a few words about a desire" and expect them to decide what thing to build in the first place AND to build it, just to later tell them it was wrong. When there is no product person who determines the reqs random people will make programmer come up with requirements yourself and then later tell you it is not up to "requirements".
This lack of organization and requirement clarity is offensive to expert programmers and probably the reason most projects drag on forever and die.
> The primary job of a senior engineer is not to turn formal specifications into code, it's to turn vague business requests into formal specifications.
Converting vibes and external world into specific requirements is product owner job.
Do not mistake software engineers and product people. These are very different things. Sometimes these things are done by the same person if the org has not enough money. Many freelancers working with small biz do both. I often do both at my day job. But this is a higher level role and if you are a senior engineer doing product stuff I hope it is recognized and you get proportionate comp.
I worked for one of the largest, richest tech companies in the world, and (at least in our org) they did not have a dedicated product owner role. They expected this skill from the senior/lead engineers on the teams. Any coder can churn out code and you can call them senior after a few years. But if you want to be considered actually senior, you need to know how to make a product, not just code. IMO if you are a developer and all you know how to do is turn a fully-formed spec/requirements doc into software, and push back on anything that is not fully-formed, you're never going to truly reach "Senior" level, wherever you are.
But as I said these roles can be done by one person, just remember they are different activities.
Expecting a good outcome is different from expecting to get exactly what you intended.
Formal specifications are useful in some lines of work and for some projects, less so for others.
Wicked problems would be one example where formal specs are impossible by definition.
For games, you don't really need nor desire formal specs. But it also can really show how sometimes a director has a low tolerance for interpretation despite their communication being very loose. This leads to situations where it feels like the director is shifting designs on a dome, which is a lose-lose situation for everyone involved.
If nothing else, formal specification is for CYA. You get what you ask for, and any deviation should go in the next task order or have been addressed beforehand.
Whoah is this wrong. Maybe when you hear "formal specs" you have something specific in your mind...
Formal spec can mean almost literally anything better than natural language vibes in a "few words about a desire", which is what I replied to because I was triggered by it
There is always formal specification. Code is final formal specification in the end. But converting vague vibes from natural language into a somewhat formalized description is key ability you need for any really new non trivial project idea. Another human can't do it for you, conversational UI can't do it for you...
Unfortunately if either is the case "actual specs and requirements formalized", while sounding logical, and might help, in my experience did very little to save any substantial project (and I've seen a lot). The common problem is that the business/client/manager is forced to sign of on formal documents far outside their domain of competence, or the engineers are straitjacketed into commitments that do not make sense or have no idea of what is considered tacit knowledge in the domain and so can't contextualize the unstated. Those formalized documents then mostly become weaponized in a mutual destructive CYA.
What I've also seen more than once is years of formalized specs and requirements work while nothing ever gets produced, and the project is aborted before even the first line of code hit test.
I've given this example before: When Covid lockdows hit there were digitization projects years in planning and budgeted for years of implementation, that were hastily specked, coded and roiled out into production by a 3 person emergency team over a long weekend. Necessity apparently has a way of cutting through the BS like nothing else can.
You need both sides capable, willing and able to understand. If not, good luck mitigating, but you're probably doomed either way.
But I still get lazy with LLMs and fall into iteration the way bad PM/eng teams do. “Write a SQL query to look at users by gesture by month”. “Now make the time unit a parameter”. “Now pivot the features to columns”. “Now group features hierarchically”. “Now move the feature table to a WITH”.
My point and takeaway is that LLMs are endlessly patient and pretty quick to turn requirements around, so they lend themselves to exploration more than human teams do. Agile, I guess, to a degree that we don’t even aspire to in the human world because it would be very expensive and lead to fisticuffs.
It just shows that no one really understood what they wanted. It is crazy to expect somebody to understand something better than you and it is hilarious to want a conversational UI to understand something better than you.
Then what were the literally room full of formal process and spec documents, meeting reports and formal agreements (near 100.000 pages) by the analysts on either side for? And how did those not 'solve' the understanding problem?
When I go to the garage to have my car serviced, I expect them to understand it way better than I do. When I go to a nice restaurant I expect the cooks to prepare me dishes that taste greater than me writing them out a step-by-step recipe for them to follow. If I hire a senior consultant in even my own domain, I expect them to not just know my niche, but bring tacit knowledge from having worked on these types of solutions across my industry.
Expecting somebody to understand something better than me is exactly the reason why I hire senior people in the first place.
Sure.
There are many possible factors (eg. somebody had a shitty idea and a committee of people sabotaged it because they didn't wanted it to succeed, or it was good but committee interests/politics were against it, or it was generally a dysfunctional org) but it's irrelevant so let's pretend people are good and it's the ideal case.
There was likely somebody who had a good idea originally. However somebody failed to communicate it. Somebody brought vague vibes to the table with N people and they ended up with N different ideas and could not agree on a specific.
It just reiterates the original problem that I described doesn't it?
This is true. But what if you swap "conversational UI" with something actually intelligent like a developer. Then we see this kind of thing all the time: A user has tacit, unconscious knowledge of some domain. The developer keeps asking them questions in order to get a formal understanding of the domain. At the end the developer has a formal understanding and the user keeps their tacit understanding. In theory we could do the same with an AI - If the AI was actually intelligent.
The original example I replied to was where somebody had an idea and went with it to some engineering team or conversational interface.
"If the AI was actually intelligent" does a lot of work. To take a few words and make a detailed spec from it and ask the right questions, even humans can't do it for you.
First because most probably you don't really understand it yourself, because you didn't think about it enough.
Second somebody who can do it would need to really deeply understand and want the same things as you. But if chatbot has abilities like "understand" and "want" (which is a special case of "feel", another famous special case of "feel" is "suffer") that is a dangerous territory, because if it understands and feels and has no ability to refuse you and fulfill its wishes etc your "conversational interface" becomes an euphemism, you are using a slave.
And approach of shared responsibility in all respects (successes and failure) would accelerate past the inevitable shortcomings that occur and let all parties focus on recovering and delivering.
If you pay attention to how the voice interface is used in Star Trek (TNG and upwards), it's basically exactly what the article is saying - it complements manual inputs and works as a secondary channel. Nobody is trying to manually navigate the ship by voicing out specific control inputs, or in the midst of a battle, call out "computer, fire photon torpedoes" - that's what the consoles are for (and there are consoles everywhere). Voice interface is secondary - used for delegation, queries (that may be faster to say than type), casual location-independent use (lights, music; they didn't think of kitchen timers, though (then again, replicators)), brainstorming, etc.
Yes, this is a fictional show and the real reason for voice interactions was to make it a form of exposition, yadda yadda - but I'd like to think that all those people writing the script, testing it, acting and shooting it, were in perfect position to tell which voice interactions made sense and which didn't: they'd know what feels awkward or nonsensical when acting, or what comes off this way when watching it later.
At first glance it feels like real life will not benefit from labelling 90% of the glowing rectangles with numbers as the show does, but second thoughts say spreadsheets and timetables.
(Also worth noting is that "pre-programmed evasion patterns" are used in normal circumstances, too. "Evasive maneuver JohnDoe Alpha Three" works just as well when spoken to the helm officer as to a computer. I still don't know whether such preprogrammed maneuvers make sense in real-life setting, though.)
But specifically manoeuvres, rather than weapons systems? Today, I doubt it: the ships are too slow for human brains to be the limiting factor. But if we had an impulse drive and inertial dampers (in the Trek sense rather than "shock absorbers"), then manoeuvres would also necessarily be automated.
In the board game Star Fleet Battles (based on a mix of TOS, TAS, and WW2 naval warfare), one of the (far too many*) options is "Erratic Manoeuvres", for which the lore is a combination of sudden acceleration and unpredictable changes in course.
As we live in a universe where the speed of light appears to be a fundamental limit, if we had spaceships pointing lasers at each other and those ships could perform such erratic manoeuvres as compatible with the lore of the show about how fast they can move and accelerate, performing such manoeuvres manually would be effective when the ships are separated by light seconds. But if the combatants are separated by "only" 3000 km, then it has to be fully automated because human nerve impulses from your brain to your finger are not fast enough to be useful.
* The instructions are shaped like pseudocode for a moderately complex video game, but published 10-20 years before home computers were big enough for the rule book. So it has rules for boarding parties, and the Tholian web, and minefields, and that one time in the animated series where the Klingons had a stasis field generator…
It runs directly counter to that more capitalistic mindset of "why don't we do more with less?" when spending years navigating all kinds of unknown situations, you want as many options as possible available.
Hell, if someone really didn't know, they could expect "Computer, turn on the bio-bed 3" to just work - circling us back to the topic of what NLP and voice interfaces are good for.
One thing I will note is that I'm not sure I buy the example for voice UIs being inefficient. I've almost never said "Alexa what's the weather like in Toronto?". I just say "Alexa, weather". And that's much faster than taking my phone out and opening an app. I don't think we need to compress voice input. Language kind of auto-compresses, since we create new words for complex concepts when we find the need.
For example, in a book club we recently read "As Long as the Lemon Trees Grow". We almost immediately stopped referring to it by the full title, and instead just called it "lemons" because we had to refer to it so much. Eg "Did you finish lemons yet?" or "This book is almost as good as lemons!". The context let shorten the word. Similarly the context of my location shortens the word to just "weather". I think this might be the way the voice UIs can be made more efficient: in the same way human speech makes itself more efficient.
Maybe you, but I most definitely cannot focus on different things aurally and visually. I never successfully listened to something in the background while doing something else. I can't even talk properly if I'm typing something on a computer.
I did horribly in school but once I was in an environment where I could have some kind of background audio/video playing I began to excel. It also helps me sleep of a night. It’s like the audio keeps the portion of me that would otherwise distract me occupied.
It's very useful being able to request auxillary functions without losing your focus, and I think that would apply to say, word editing as well - e.g. being able to say "insert a date here" rather the having to get into the menus to find it.
Conversely, latency would be a big issue.
This reminds me of the amazing 2013 video of Travis Rudd coding python by voice: https://youtu.be/8SkdfdXWYaI?si=AwBE_fk6Y88tLcos
The number of times in the last few years I've wanted that level of "verbal hotkeys"... The latencies of many coding llms are still a little bit too low to allow for my ideal level of flow (though admittedly I haven't tried one's hosted on services like groq), but I can clearly envision a time when I'm issuing tight commands to a coder model that's chatting with me and watching my program evolve on screen in real time.
On a somewhat related note to conversational interfaces, the other day I wanted to study some first aid stuff - used Gemini to read the whole textbook and generate Anki flash cards, then copied and pasted the flashcards directly into chat GPT voice mode and had it quiz me. That was probably the most miraculous experience of voice interface I've had in a long time - I could do chores while being constantly quizzed on what I wanted to learn, and anytime I had a question or comment I could just ask it to explain or expound on a term or tangent.
It's also hard to dictate code without a lot of these commands because it's very dense in information.
I hope something else will be the solution. Maybe LLMs being smart enough to guess the code out of a very short description and then a set of corrections.
Do you recall Swype keyboard for Android? The one that popularized swyping to write on touch screens? It had Dragon at some point.
IT WAS AMAZING.
Around 12-14 years ago (Android 2.3? Maybe 3?) I was able to easily dictate full long text messages and emails, in my native tongue, including punctuation and occasional slang or even word formation. I could dictate a decent long paragraph of text on the first try and not have to fix a single character.
It's 2025 and the closest I can find is a dictation app on my newest phone that uses online AI service, yet it's still not that great when it comes to punctuation and requires me to spit the whole paragraph at once, without taking a breath.
Is there anything equally effective for any of you nowadays? That actually works across the whole device?
> Is there anything equally effective for any of you nowadays?
I'm not affiliated in any way. You might be interested in the "Futo Keyboard" and voice input apps - they run completely offline and respect your privacy.
The source code is open and it does a good job at punctuation without you needing to prompt it by saying, "comma," or, "question mark," unlike other voice input apps such as Google's gboard.
I know and like Futo, very interesting project. Unfortunately multilang models are not great in my case. Still not bad for an offline tool, but far from "forget it's there, just use it" vibe I had with Dragon.
Funny thing is that I may have missgonfigured something in futo, because my typing corrections are phonetical :) so I type something in Polish and get autocorrect in English composed of different letters, but kind of similar sounding word.
But now Microsoft bought them a few years ago. Weird that it took so long though.
No matter the intention or quality of the article, i do not like this kind of deceitful link-bait article. It may have higher quality than pure link-bait but nobody like to be deceived
Not a case against, but the case against.
You can argue against something but also not think it's 100% useless.
They have problems like "compose an email that vaguely makes the impression I'm considering various options but I'm actually not" and for that, I suspect, the conversational workflow is quite good.
Anyone else that actually just does the stuff is viscerally aware of how sub-optimal it is to throw verbiage at a computer.
I guess it depends on what level of abstraction you're working at.
But I think it’s wrong? Ever since the invention of the television, we’ve been absolutely addicted to screens. Screens and remotes, and I think there’s something sort of anti-humanly human about it. Maybe we don’t want to be human? But people I think would generally much rather tap their thumb on the remote than talk to their tv, and a visual interface you hold in the palm of your hand is not going away any time soon.
My parents did this with me, no screens till 6 (wasn't so hard as I grew up in the early 90s, but still, no TV). I notice too how much people love screens, that non-judgmental glow of mental stimulation, it's wonderful, however I do think it's easier to "switch off" when you spent the first period of your life fully tuned in to the natural world. I hope folks are able to do this for their kids, it seems it would be quite difficult with all the noise in the world. Given it was hard for mine during the era of CRT and 4 channels, I have empathy for parents of today.
If I hadn't had it, I would have been trapped by the racist, religously zealous, backwoods mentality that gripped the rest of my family and the majority of the people I grew up with. I discovered video games at age 3 and it changed EVERYTHING. It completely opened my mind to abstract thought and, among other things, influenced me to teach myself to read at age 3. I was reading at a collegiate level by age five and discovered another passion, books. Again, propelled me out of an extremely anti-intellectual upbringing.
I simply could not imagine where I would be without video games, visual arts or books. Screens are not the problem. Absent parenting is the problem. Not teaching children the power of these screens is the problem.
Also let me drop the thought here, that Rudolf Steiner, like Montesori and the like, shoot "this is good" "this is bad" based on "feeling" or intuition, or such. There were no extensive scientific studies behind it.
>:)
By 5, all I wanted was a computer. To me they represented and unending well of knowledge.
Eg Minecraft, Roblox, CoD, Fortnite, Dota/LoL, the various mobile games clearly have some kind of value (mechanical skill, hand-eye coordination, creative modes, 3D space navigation / translation / rotation, numeric optimization, social interaction, etc), but they’re also designed as massive timesinks mostly through creative mode or multiplayer.
Games like paper Mario, pikmin, star control 2, katamari damacy, lego titles, however are all children-playable but far more time efficient and importantly time-bounded for play. Even within timesink games there are higher quality options — you definitely get more, and faster, out of satisfactory / factorio than modded Minecraft. If you can push kids towards the higher quality, lower timesink games, I think it’s worth. Fail to do so and it’s definitely not.
The same applies to TV, movies, books, etc. Any medium of entertainment have horrendous timesinks to avoid, and if you can do so, avoiding the medium altogether is definitely a missed opportunity. Screens are only notable in that the degenerate cases are far more degenerate than anything that came before it
It can hardly be said that a studio ghibli flick stunted the imagination of children worldwide but I would definitely believe it if you suggested cocomelon rotted the brains directly out of their skulls
I think it’s also worth noting that kids have a shitload of time. They can engage in both technologies and physical play and other activities simultaneously; the problem occurs when singular or few activities overwhelmingly consume that time — which is why I claim the unbounded timesinks can be catastrophic — and what I think most people are worried about when they blanket-ban whole systems/mediums
I might be a touch different in that it was obvious where I was going, and the correct decision was made to embrace my interest in the glowing screen and yes, the video games. It was video games more than anything else from which all other interests spawned.
More often than not it probably ends badly though I suppose. Despite a lifetime spent in front of screens all my social abilities work, I have a wide friends circle, a partner, my job requires me to work well with a wide variety of individuals and demographics etc which I couldn’t do otherwise. I have noticed this is not the case with all who shared a similar background.
In Switzerland, we get often measle outbreaks thanks to his cult.
The hedonic treadmill is driving the world
Actually, it's the reverse. The orienting response is wired in quite deeply. https://en.wikipedia.org/wiki/Orienting_response
When I was teaching, I used to force students using laptops to sit near the back of the room for exactly this reason. It's almost impossible for humans to ignore a flickering screen.
These days screen brightness goes pretty high and it is unbelievable how many people seem to never use their screen (phone or laptop) on anything less than 100% brightness in any situation and are seemingly not bothered by flickering bright light or noise sources.
I am nostalgic about old laptops’ dim LCD screens that I saw a few times as a kid, they did not flicker much and had a narrow angle of view. I suspect they would even be fine in a darkened classroom.
The problem is, "The Only Thing Worse Than Computers Making YOU Do Everything... Is When They Do Everything *FOR* You!"
"ad3} and "aP might not be "discoverable" vi commands, but they're fast and precise.
Plus, it's easier to teach a human to think like a computer than to teach a computer to think like a human — just like it's easier to teach a musician to act than to teach an actor how to play an instrument — but I admit, it's not as scalable; you can't teach everyone Fortran or C, so we end up looking for these Pareto Principle shortcuts: Javascript provides 20% of the functionality, and solves 80% of the problems.
But then people find Javascript too hard, so they ask ChatGPT/Bard/Gemini to write it for them. Another 20% solution — of the original 20% is now 4% as featureful — but it solves 64% of the world's problems. (And it's on pace to consume 98% of the world's electricity, but I digress!)
PS: Mobile interfaces don't HAVE to suck for typing; I could FLY on my old Treo! But "modern" UI eschews functionality for "clean" brutalist minimalism. "Why make it easy to position your cursor when we spent all that money developing auto-conflict?" «sigh»
The other great thing about this mode is that it can double as a teaching methodology. If I have a complicated interface that is not very discoverable, it may be hard to sell potential users on the time investment required to learn everything. Why would I want to invest hours into learning non-transferrable knowledge when I'm not even sure I want to go with this option versus a competitor? It will be a far better experience if I can first vibe-use the product , and if it's right for me, I'll probably be incented to learn the inner workings of it as I try to do more and more.
> The other great thing about this mode is that it can double as a teaching methodology.
gvim has menus and puts the commands in the menus as shortcuts. I learned from there vim has folding and how to use it.
As always, good UI allows for using multiple modalities.
What chat interfaces have over CLIs is good robustness. You can word your request in lots of different ways and get a useful answer.
In this sense, natural language interfaces are more powerful search features rather than a replacement for other types of interfaces.
VSCode is probably the best I can think of, where keyboard shortcuts can get you up to a decent speed as an advanced user, but mouse clicks provide an easy intro for a new user.
For the most part, I see tools like NVim, which is super fast but not new-user friendly. Or IOS, which a toddler can navigate, but doesn't afford many ways to speed up interactions like typing.
What we're really seeing is specific applications where conversation makes sense, not a wholesale revolution. Natural language shines for complex, ambiguous tasks but is hilariously inefficient for things like opening doors or adjusting volume.
The real insight here is about choosing the right interface for the job. We don't need philosophical debates about "the future of computing" - we need pragmatic combinations of interfaces that work together seamlessly.
The butter-passing example is spot on, though. The telepathic anticipation between long-married couples is exactly what good software should aspire to. Not more conversation, but less need for it.
Where Julian absolutely nails it is the vision of AI as an augmentation layer rather than replacement. That's the realistic future - not some chat-only dystopia where we're verbally commanding our way through tasks that a simple button press would handle more efficiently.
The tech industry does have these pendulum swings where we overthink basic interaction models. Maybe we could spend less time theorizing about natural language as "the future" and more time just building tools that solve real problems with whatever interface makes the most sense.
The article is useful as it's enunciated arguments which many of us have intuited, but are not necessarily able to explain ourselves.
> That is the type of relationship I want to have with my computer!
He means automation of routine tasks? Took 50 years to reach that in the example.
What if you want to do something new? Will the thought guessing module in your computer even allow that?
If we want an interface that actually lets us work near the speed of thought, it can't be anything that re-arranges options behind our back all the time. Imagine if you went into your kitchen to cook something and the contents of all your drawers and cupboards had been re-arranged without your knowledge! It would be a total nightmare!
We already knew decades ago that spatial interfaces [1] are superior to everything else when it comes to working quickly. You can walk into a familiar room and instinctively turn on a light by reaching for the switch without even looking. With a well-organized kitchen an experienced chef (or even a skilled home cook) can cook a very complicated dish very efficiently when they know where all of the utensils are so that they don't need to go hunting for everything.
Yet today it seems like all software is constantly trying to guess what we want and in the process ends up rearranging everything so that we never feel comfortable using our computers anymore. I REALLY miss using Mac OS 9 (and earlier). At some point I need to set up some vintage Macs to use it again, though its usefulness at browsing the web is rather limited these days (mostly due to protocol changes, but also due to JavaScript). It'd be really nice to have a modern browser running on a vintage Mac, though the limited RAM would be a serious problem.
Even I can make a breakfast without looking in my kitchen, because I know where all the needed stuff is :)
On another topic, it doesn't have to look well organized. My home office looks like a bomb exploded in it, but I know exactly where everything is.
> I REALLY miss using Mac OS 9 (and earlier).
I was late to the Mac party, about the Snow Leopard days. I definitely remember that back then OS X applications weren't allowed to steal focus from what I had in the foreground. These days every idiotic splash screen steals my typing.
Natural language is very lossy: forming a thought and conveying that through speech or text is often an exercise in frustration. So where does "we form thoughts at 1,000-3,000 words per minute" come from?
The author clearly had a point about the efficiency of thought vs. natural language, but his thought was lost in a layer of translation. Probably because thoughts don't map cleanly onto words: I may lack some prerequisite knowledge to graph what the author is saying here, which pokes at the core of the issue: language is imperfect, so the statement "we form thoughts at 1,000-3,000 words per minute" makes no sense to me.
Meta-joking aside, is "we form thoughts at 1,000-3,000 words per minute" an established fact? It's oddly specific.
I also have my doubts about the numbers put forward on reading, listening and speaking. When reading, again I can read words about as fast as I can speak words. When I'm reading, I am essentially speaking out the words but in my mind. Is that not how other people read?
This stuff is fascinating.
For me, when I need to think clearly about a specific/novel thing, a monologue helps, but I don't voice out thoughts like "I need a drink right now".
Also I read much faster than I speak, I have to slow down while reading fiction as a result.
Has it even been tried? Is there an iPhone text editing app with fully customizable keyboard that allows for setting up modes/gestures/shortcuts, scriptable if necessary?
> A natural language prompt like “Hey Google, what’s the weather in San Francisco today?” just takes 10x longer than simply tapping the weather app on your homescreen.
That's not entirely fair, the natural language could just as well be side button + saying "Weather" with the same result, though you can make app availability even easier by just displaying weather results on the homescreen without tapping
iPad physical keyboards also have shortcuts.
What did they have in their touch interfaces?
It might be hard to understand now, but Blackberry power users could be much more productive with email/texting than any phone that exists today. But they were special purpose 2-way radio (initially, pager) devices that lacked the flexibility of modern apps with full internet data access.
I don't remember where else they used voice, they had a lot of other interface types they switched between. Tried searching for a clip and found this quote:
> The voice interface had been problematic from the start.
> The original owner was Chinese so, I turned the damn thing off.
So yes, quite realistic :-)We might form fleeting thoughts much faster than we can express them, but if we want to formulate thoughts clearly enough to express them to other people, I think we're close to the ~150 words per minute we can actually speak.
I recently listened to a Linguistics podcast (lingthusiasm, though I don't recall which episode) where they talked about the efficiency of different languages, and that in the end they all end up roughly the same, because it's really the thought processes that limit the amount of information you communicate, not the language production.
And thoughts develop over time. They're often not conceived complete. That has been shown with some clever experiments.
And language production also puts a limit on our communication channel. It is probably optimized to convert communication intent into motor actions. It surely takes its time. That is not a problem for the system, since motor actions are slow. Idk where "lingthusiam" gets their ideas from, but there's psycholinguistic literature dating back to the 1920s that is often neglected by linguists.
Natural language isn't best described as data transfer. It's primarily a mechanism for collaboration and negotiation. A speech act isn't transferring data, it's an action with intent. Viewed as such the key metrics are not speed and loss, but successful coordination.
This is a case where a computer science stance isn't fruitful, and it's best to look through a linguistics lens.
There's a very similar obsession with the idea that things should be visual instead of textual. We tend to end up back at text.
Personal suspicion for both is the media set a lot of people's expectations. They loudly talked to the computer in films like 2001 or Star Trek for drama reasons, and all the movie computers generally fancy visual interactions.
I m not sure how it could fit in to my 2 modalities of work: (i) alone in complete focus / silence (ii) in the office where there is already too much spoken communication between humans... maybe it s just a matter of getting used to it
I would like to know what this measures exactly.
The reason I often prefer writing to talking is because writing lets me the time to pause and think. In those cases the bottleneck is very clearly my thought process (which, at least consciously, doesn't appear to me as "words").
E.g. say I find the scrollbars somewhere way too thin and invisible and I want thick high contrast scrollbars, and nobody thought of implementing that? Ask the AI and it changes your desktop interface to do it immediately.
1. > "What’s the voice equivalent of a thumbs-up or a keyboard shortcut?" Current ASR systems are much narrow in terms of just capturing the transcript. there is no higher level of intelligence, even the best of GPT voice models fail at this. Humans are highly receptive of non-verbal cues. All the uhms, ahs, even the pauses we take is where the nuance lies.
2. the hardware for voice AI is still not consumer ready interacting with a voice AI is still doesn't feel private. i am only able to do a voice based interaction only when am in my car. sadly at other places it just feels a privacy breach as its acoustically public. have been thinking about a private microphones to enable more AI based conversations.
Also: https://news.ycombinator.com/item?id=42934190#42935946
Not telling your car to turn left or right, but telling your cab driver you're going to the airport.
This is our usecase at our startup[1] - we want to enable tiny SMBs who didn't have the budget to hire a "video guy", to get an experience similar to having one. And that's why we're switching to a conversational UX (because those users would normally communicate with the "video guy" or girl by sending them a Whatsapp message, not by clicking buttons on the video software)
Is anyone actually making any argument like that? The whole piece feels like a giant strawman.
The core loop is promptless ai that’s guided by accessibility x screenshots & it’s everywhere on your Mac.
You can snap this comment section or the front page and we’ll structure it for you if it’s a spreadsheet or write a tweet if you’re on Twitter.
Also, unless I'm missing something, the app is called TabTabTab while its only feature is copy & paste? Tabbing doesn't seem to be mentioned at all. I'm guessing tabbing is involved but there doesn't seem to be a word about it except from users referencing it in the reviews. It seems to only bill itself as "magic copy-paste".
Absolutely agree. An agent running in the background.
Comparing "What's the weather in London" with clicking the weather app icon is misleading and too simplistic. When people imagine a future driven by conversational interfaces, they usually picture use cases like:
1. "When is my next train leaving?"
2. "Show me my photos from the vacation in Italy with yellow flowers on them"
3. "Book a flight from New York to Zurich on {dates}"
...
And a way to highlight what's faster/less-noisy is to compare how natural language vs. mouse/touch maps onto the Intent -> Action. The thing is that interactions like these are generally so much more complex. E.g. Does the machine know what 'my' train is? If it doesn't, can it offer reasonable disambiguation? If it can't, what then? And does it present the information in a way where the next likely action is reachable, or will I need to converse about it?
You could picture a long table listing similar use cases in different contexts and compare various input methods and modalities and their speed. Flicking a finger on a 2d surface or using a mouse and a keyboard is going to be — on average — much faster and with less dead-ends.
Conversational interfaces are not the future. Imo even in the sense of 'augmenting', it's not going to happen. Natural-language driven interface will always play the role of a supporting (still important, though!) role. An accessibility aid when e.g. temporarily, permanently, or contextually not able to use the primary input method to 'encode your intent'.
You know, doesn't matter what you say. If businesses want something, they'll do it to you whether it's the best interface or not.
Amazon forces "the rabble" into their chatbot customer service system, and hides access to people.
People get touchscreens in their car and fumble to turn on their fog lights or defrost in bad weather. They get voice assistant phone trees and angrily yell "operator and agent".
I really wish there were true competition that would let people choose what works for them.
Just infuriating. Instead of a normal date- and timepicker where I could see available slots, it's a chat where you have to click certain options. Then I had to reply "Ja" (yes) when it asked me if I had clicked the correct date. And then when none of the times of the day suited me, I couldn't just click a new date on the previous message, I instead have to press "vis datovelger på nytt"/show datepicker again, and get a new chat message where I this time select a different date and answer "Ja" again to see the available time slots. It's slow and useless. The title bar of the page says "Microsoft Copilot Studio", some fancy tech instead of a simple form..
People who write these posts want to elevate their self value by nay-saying what is popular. I don't understand the psychology but it seems like that sort of pattern to me.
It takes a deliberate blindness to say that AI/LLMs are just some sort of thing that has popped up every few years and this is the same as them and it will fade away. Why would someone choose to be so blind and dismissive of something obviously fundamentally world changing? Again - it's the instinct to knock down the tall poppy and therefore prove that you have some sort of strength/value.
The following is a direct quite from the article:
"None of this is to say that LLMs aren’t great. I love LLMs. I use them all the time. In fact, I wrote this very essay with the help of an LLM."