Show HN: An unstructured data workspace for data transformations with LLM
3 points
1 hour ago
| 0 comments
| usefolio.ai
| HN
hi HN!

a couple of months ago I had to analyze a few thousand audio recordings to help identify issues with customer support. i was able to get some raw high-level initial results with python scripts invoking LLM APIs, but they were too general and unhelpful. writing basic prompts is easy, but tuning them and making them specific enough to ensure no faint signal is missed is hard. you need to iterate through the data with an initial prompt, segment the data into different buckets, chain another prompt for each bucket etc. Then you need to constantly review the raw data to tweak the prompts just the right way to get the desired results.

There are no good user-facing tools for scaling to thousands of rows of unstructured data analysis with LLMs. Claude Cowork / agents with access to filesystems are scratching the surface, but having a text-only UI is challenging, especially when you want to go back and adjust your research pipeline, narrow down deterministically to a specific subset of your data with SQL-like filters, or do any cost management. Scaling past 100 files is not well supported. Deep research is difficult to steer and verify.

I needed a mini-data warehouse that could help me get insights out of my data, optimize costs with bulk LLM operations (via cost estimation and model choice), and let me browse and verify the data in a user-friendly way, without requiring me to set up something like Databricks. So, I built folio.

Folio is a free, local, macOS app for analyzing your unstructured data with LLMs. It's a UI wrapper around a minimal data warehouse that lets users (and agents) do LLM-based transformations on big unstructured datasets. All you need to get started is an AI API key and an account with modal.com

Users bring their files into Folio which then get loaded into a tabl, where each row contains a markdown representation of the file contents. Users can then run LLM operations in bulk on those files and use sql filters to create views and narrow down the scope of the transformations. Agents are a first-class citizen and they can plug into folio to do most of the work for you. To take load off the desktop for OCR/Audio Transcription as well as the thousands of http requests to AI APIs we integrate with modal.com as the execution engine. A local orchestrator fans out jobs to modal and then fans them in once complete. Data is never stored anywhere, and only moves in transit through AI API provider and the user's own modal infrastructure.

folio workspaces are multi-modal (you can load different data types in the same workspace and move it through the same analysis pipeline) and they can support thousands of files.

People use folio today to: - review customer support tickets/emails: bucket issue into different categories, narrow in on categories of interest, and then action that data by generating a response. - extract detailed data from financial documents: load all data that can be found on a particular company, extract structured data like revenue numbers and projections. - do literature reviews: there are lots of agents that help you load data from research paper repositories. once that data is loaded into folio, users can do a steerable deep research over those files. - perform criteria-based search: generate yes/no criteria like "document contains data on XYZ", "document mentions ABC", "documented cites XYZ".

Companies like v7labs, hebbia, Legora, Harvey have similar "Tabular Document Review" features, but they are not scalable or compatible with outside agents like Claude Code. Additionally they require expensive enterprise contracts.

I see folio moving beyond data analysis into the perfect companion for agentic tasks that require a human-facing UI/UX, cost management and actioning on data in bulk.

Website: https://www.usefolio.ai Github: https://github.com/usefolio/folio X: https://x.com/usefolio_ai

Looking forward to hearing what people think!

No one has commented on this post.