Show HN: Scriber Pro – Offline AI transcription for macOS

rezivor - 4d

Hey HN! Built this because I was tired of waiting hours for transcription services and didn't want to upload sensitive recordings to the cloud.

  Real metrics from my M1 Max: 4.5hr video file transcribed in 3 minutes 32
  seconds. Works completely offline.

   First 5 HN users who click the button on the page get it free. Literally promo code straight to the app sore  


  Key differences vs Rev/Otter:
  - No 2-hour file limits (handles any length)
  - Timecodes stay accurate on long files (no drift from chunking)
  - Supports MP3, WAV, MP4, MOV, M4A, FLAC
  - Exports to SRT, VTT, JSON, PDF, DOCX, CSV, Markdown

  Built for macOS. Happy to answer questions!

Source

I've been using MacWhisper for this, with a huge variety of transcription options and things like speaker detection. It works great for all the 1 hour and shorter videos I've fed it, but does this have more to offer?

I haven't tried a 4+ hour video with MacWhisper but I presume that would work the same.

Please be my guest to test my claims. No tall tales here!

MacWhisper handles multiple-hour-long recordings just fine for me. I regularly process 4hrs on MacWhisper. Even whisper-cpp works fine these days for long recordings too.

Cool product, but it would be better if you stopped spreading misinformation to support it.

> Cool product, but it would be better if you stopped spreading misinformation to support it.

I don’t see this sort of thing, has the page changed? Edit: the comments here…

The drop shadow on the pages does make it deeply unpleasant to read.

As a side project, I just launched a privacy-first web-based meeting transcriber (https://basilai.app/app). Everything runs entirely in your browser — both the transcription and AI summarization — so no audio or text ever leaves your device.

I'm using the browser built in transcription service plus downloading a model and running it via webgpu. No login. At the end of your meeting, you get a zip file with the audio, transcript and summary.

You can also run Whisper locally in your browser for free: https://ggml.ai/whisper.cpp/

Great when you have time to kill and not a lot to process I suppose

What languages does this support? Does it support switching between multiple languages in one video?

For example, could it support a video that included spoken Latin, ancient Greek, German, and Italian?

So weird that this is nowhere stated on the website at all. Was literally the first and only thing I was interested in. So bad.

eng der fr de es it pt ru zh ko ar and ja

So, can it handle multiple languages in one video, or do you need to segment the different languages using LID first? This has been a thorny issue for people working in multilingual audio (there are at least two or three of us).

I haven't test that specific edge case, I'm sorry. I tested 2 langue's having a normal conversation and that worked fine- "Auto or English" handle multiple lan the best

Does it support speaker diarization?

You use the word "transcribe" but the page doesn't appear to support that claim? This looks like straightforward STT? Or does it actually support transcription (diarization, etc.)?

(Also, the text is completely illegible on your site.)

r/#FF0000_rage

Question: can it discern (and label) different speakers? If so, could you kindly share the limit on speakers per video?

MacWhisper Pro supports this, if your need for this is time-sensitive. https://macwhisper.helpscoutdocs.com/article/32-automatic-sp...

You are looking for speaker diarization. No one is doing this well currently on device (in macOS land at least).

Or in the cloud tbh

No, not yet! That will definitely be included in the next update next month. Thank you for reminding me of peoples unique need for this use case

One thing that Rev and other online services have as well as MacWhisper is a good interface for editing the text to correct inevitable errors. Being able to click on the text and have it sync to the correct place in the audio is a must for my use case of transcribing interviews. Also speaker diarization.

Scribers’ iCloud system automatically backs up each transcription and organizes them in a three-pane folder view—somewhat inspired by Bars’ layout. This structure allows a surprising degree of customization for all your data needs, especially when transcribing interviews. It would probably make for a very comfortable workflow here

Seconding/thirding the request for diarization! I would use this as my main transcription app if it had that.

I use it as my own transcription app, I really do love it ( biased I know, but genuinely)

Also look at Vibe:

It even supports speaker differentiation/recognition and is open source on mac/windows/linux;

https://github.com/thewh1teagle/vibe

It uses whisper, but also directly calls other tools and puts everything under one nice Gui

Does it do separate speaker identification (diarization)?

What's the stack, if I may ask? (I believe Whisper-X does the diarization thing)

I vibecoded a similar app. Here’s the open source link, if folks want to build their own:

https://github.com/naveedn/audio-transcriber

Slower

Yes, but by a negligible margin. My program is designed for multi-track audio, which means I run this in parallel on multiple 3 hour recordings, and get results in 12 minutes.

You haven’t shared any architectural details. What model? What size? How can anyone be sure that what you’re building is truly offline?

Yours isn't OSS, meaning I have no idea what I'm running

OSS would be incredibly slow, also seems like overkill for this use case

I was going to buy the app, but these responses are putting me off massively. How would making it OSS slow it down?

I suspect, from the responses of the creator here, that this app they are selling is likely violating a number of open source licenses…

the obnoxious site deisgn and comments like this stopped me from clicking buy in the apple store

What does that even mean? Why would OSS make it slower? Why would it be an overkill? This is not Producthunt, you have to give at least some kind of explanation for your claims.

OSS as in open source software. Not Open Sound System. Just in case.

Can you back up your claim that it's slow?

[dead]

Timecode drift is an interesting issue, think I faced this recently while translating a Google Meet transcript into an incident report timeline.

The elapsed-time timestamps didn't correlate well with other data sources. I figured it was a mistake on my end, and just brushed it off.

How does it compare to MacWhisper?

MacWhisper crashes at about an hour of context. This uses, smart, invisible regex in the text generation pipe. Makes this fast. + bonus, there is no context limit

Smart invisible regex makes it fast and prevents it from crashing? What does that mean?

I've done 3+hours with MacWhisper without issue? One downside is the transcription is not real time - can Scriber Pro do realtime?

I haven't worked in a while with transcription, but whisper.cpp itself (which I assume is the underlying tech behind MacWhisper) does realtime transcription on my MBP with an M1 Pro chip. When I first started writing my last completed novel, I fired it up and just started telling the story to test it out. Realtime.

That was back in 2023. I assume things work better now.

I am a MacWhisper Pro user, and I successfully transcribed and translated a 15-hour course inside the app without any issues

> MacWhisper crashes at about an hour of context.

This is not true. (I've been a MacWhisper user since 2023. I have two bugs during that time, which the author addressed quickly.)

"Smart, invisible regex" sounds like a lot of bs... could you give a more technical explanation?

Also the Whisper model doesn't really have a context window, it already segments the audio with a certain amount of overlap between the chunks, I really have a hard time understanding what you are trying to say here.

Whisper will fail > 99%* (edit, most of the time) of the time at lengths over 90 minutes and fairly high over one hour.

This is absolutely not my experience. I regularly (weekly at least) use whisper for 90-120 minutes pieces of content and only rarely have problems.

This is just plain wrong. I have my own Whisper App in the AppStore (on iOS, with very limited memory capacity) and there are no problems at all with longer Audio / Video files.

I've never had whisper complete a single attempt a anything over 75 min

Can't really declare that without declaring which whisper model in particular you are referring to, as there are a number of them

I’ve used whisper-cop on 5-hour podcasts without problems.

Would also love to hear what you mean by “smart invisible regex,” sounds like AI slop to me.

> Smart invisible regex

I've never heard a regex person speak this way of a regex.

Please tell me you didn't vibecode the regex... one of the areas it's still not good at

What do you mean context limit?

Neither whisper nor MacWhisper have any context limit

I’ve also been pretty careful with sensitive recordings, so the offline part really stands out to me. This looks great.

Not gonna lie, the pricing is super attractive! :) I'd love to see an API, so I can also run automations using this tool on my mac.

Will it transcribe audio in Czech (in future versions)?

Actually I would be happy if it could just identify occurrences (timestamps) of a specific word or a small set of words.

I sort of use SuperWhisper, it is sort of good. https://superwhisper.com/

What is your tech stack to make this? Is it end to end swift?

Swift 37.0% C++ 26.5% C 19.8% Rust 4.7% Shell 4.4% Objective-C 1.7% Other 5.9%

What language do you have the model architecture and implementation in? Feel like it would be the biggest proportion of the codebase, curious if you did it in Swift?

Is it only for English? is CLI available? There are thousands of files on my local and I'd like to save results to local db. Thanks!