Automating Video Transcription with VideoToText

In today's fast-paced world, finding the time to watch long videos can be challenging. Whether it's educational content, meetings, or webinars, sometimes you just need to quickly scan through the main points without watching the entire video. This is where VideoToText comes in handy. This project automates the process of extracting text from video files, allowing you to quickly read through the content.

The Problem

Many of us encounter situations where we have a video that we don't have time to watch but need to know its content. Manually transcribing video content is time-consuming and impractical. An automated solution can save significant amounts of time and effort by converting video to text quickly and accurately.

Technology Stack

To achieve this, we have chosen a robust technology stack:

yt-dlp: A powerful command-line tool for downloading videos from YouTube and other sites.
FFmpeg: A versatile tool for processing audio and video files, used here to extract audio from videos.
OpenAI Whisper: A state-of-the-art speech recognition model from OpenAI, used for transcribing audio to text.
Docker: Ensures the project is easy to set up and run in a consistent environment.

Dockerfile Explained

The Dockerfile is designed to create a consistent environment for the project, ensuring that all dependencies are correctly installed and that the Whisper model is pre-downloaded for efficiency. Here’s a breakdown of the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

COPY . /app

RUN pip install --no-cache-dir yt-dlp ffmpeg-python whisper openai-whisper torch tqdm python-slugify

ARG WHISPER_MODEL=small
ENV WHISPER_MODEL=$WHISPER_MODEL

RUN python -c "import whisper; whisper.load_model('$WHISPER_MODEL')"

EXPOSE 80

CMD ["python", "main.py"]

Explanation:

FROM python:3.12-slim: Uses a slim version of Python 3.12 as the base image.
WORKDIR /app: Sets the working directory in the container to /app.
COPY . /app: Copies the current directory contents into the container.
RUN pip install --no-cache-dir ...: Installs the necessary Python packages without caching to keep the image size small.
ARG WHISPER_MODEL=small: Defines a build-time variable for the Whisper model.
ENV WHISPER_MODEL=$WHISPER_MODEL: Sets an environment variable for the Whisper model.
RUN python -c "import whisper; whisper.load_model('$WHISPER_MODEL')": Pre-downloads the specified Whisper model to ensure it's available in the container.
EXPOSE 80: Exposes port 80.
CMD ["python", "main.py"]: Runs the main Python script when the container launches.

Getting Started

To get started with VideoToText, follow these steps:

Clone the Repository

git clone https://github.com/vshloda/VideoToText.git
cd VideoToText

Build the Docker Image

docker compose build

Run the Docker Container

For processing YouTube videos:

docker compose run --rm app python main.py --url "https://www.youtube.com/watch?v=example"

For processing local video files:

docker compose run --rm app python main.py --file "files/video.mp4"

Segmented Audio Processing

The transcription process involves splitting the audio into smaller, manageable segments. This approach has several advantages:

Efficiency: Smaller segments are easier and faster to process.
Accuracy: The model can focus on shorter spans of audio, reducing the chances of errors.

Here is a minimal code snippet demonstrating how audio segmentation is implemented:

def split_audio(audio, segment_length=10, sample_rate=16000):
    segments = []
    audio_length = len(audio) // sample_rate
    for start in range(0, audio_length, segment_length):
        end = min(start + segment_length, audio_length)
        segment = audio[start * sample_rate:end * sample_rate]
        segments.append((segment, start, end))
    return segments

def convert_audio_to_text_whisper(model, audio_path):
    audio = whisper.load_audio(audio_path)
    segments = split_audio(audio, segment_length=10)

    result_list = []
    for segment, start, end in segments:
        mel = whisper.log_mel_spectrogram(segment).to(model.device)
        result = whisper.decode(model, mel)
        result_list.append({
            "start": start,
            "end": end,
            "text": result.text.strip()
        })
    return result_list

The Whisper Model

OpenAI's Whisper model is a versatile and powerful speech recognition system. Whisper offers various model sizes, each designed to balance performance and resource usage:

tiny: Fast and lightweight, suitable for quick transcriptions with limited resources.
base: A good balance between speed and accuracy.
small: Provides higher accuracy while still being relatively fast.
medium: Delivers very accurate transcriptions with moderate resource requirements.
large: The most accurate model, but requires significant computational resources.

Choosing the right model depends on your specific needs and available resources. In VideoToText, you can easily switch models by setting the WHISPER_MODEL environment variable.

# Load the Whisper model once and reuse it
model = whisper.load_model(os.getenv('WHISPER_MODEL', 'small'))

Conclusion

VideoToText is a tool for automating the transcription of video content, making it easier to quickly understand the main points without watching the entire video. While the current project setup provides accurate transcriptions, future enhancements could include a summarization feature to extract only the essential information from the text.

Check out the VideoToText project on GitHub to get started.