Reel Transcriber Bot

Instagram reel transcription via Telegram bot

Problem

I consume a lot of short-form video content for research and learning. Watching every reel fully is slow, and rewatching to catch a specific detail is slower. I needed a way to turn any reel into text I could read, save, and reference later.

Reading a transcript takes seconds. Watching a 90-second reel twice takes three minutes.

Stack

Make.com orchestrates the entire pipeline. A RapidAPI service downloads the reel from Instagram. Gemini API processes the audio and generates a structured transcript with analysis. The bot only processes audio, not video frames. Input and output both happen through Telegram Bot API.

Make.comRapidAPIGemini APITelegram Bot API

Flow diagram

7 steps — triggered instantly

Webhook → HTTP (RapidAPI download) → HTTP (process media) → HTTP (download file, conditional filter) → Resume (error handling) → HTTP (Gemini API transcription) → HTTP (send to Telegram)

Make.com scenario for Reel Transcriber — 7-step pipeline from webhook to Telegram delivery

~30 seconds per reel. No scheduling.

The output has four sections: Transcript, Summary, Tone and audience, and Takeaways.

Telegram message showing the structured reel transcript output with transcript, summary, tone, and takeaways

Prompt iterations

First Gemini prompt: raw unformatted text. Round 1 defined explicit output sections (transcript, summary, tone, audience, takeaways). Round 2 tuned handling of overlapping audio, background music, and unclear speech to flag gaps instead of guessing.

Hardest part: getting consistent output format across different reel styles (talking head, voiceover, interview).

Failures fixed

RapidAPI failures

Instagram URLs don't always resolve. Added validation checks and fallback error messages to Telegram.

Timeout on longer reels

Reels over 60 seconds exceeded Make.com execution window. Adjusted timeout settings and added Resume module for recovery.

Result

Used multiple times a day. Send a link, get structured transcript in ~30 seconds. Replaced watching, rewatching, and manual note-taking. Running in production on v2 with no interruptions.