Technical Analysis of Duplicate Content and Repost Detection on Video Platforms

In-depth analysis of how these video platforms detect video duplication and re-uploading behavior, exploration of the algorithmic technologies employed, and detailed explanation of their working principles through specific examples.

2025-08-29 547 views

Short video platforms such as Douyin, TikTok, and YouTube face the challenge of massive user-generated content, where the detection of duplicate videos and repurposed content has become a core technical issue in platform content management. This report will delve into how these video platforms determine video duplication and repurposing behaviors, explore the algorithmic technologies employed, and explain their working principles through specific examples.

Flowchart of video platform duplication detection technology

Basic Methods of Video Duplication Detection

Hash Comparison Technology

Video platforms first employ hash comparison technology, which is the most basic yet fastest detection method. The platform generates multiple types of hash values for each uploaded video:

MD5 hashing is the simplest method, identifying identical files by calculating the MD5 value of the video file. When users directly upload unmodified videos, the system can detect duplicate content within milliseconds through MD5 value matching. However, this method cannot detect any edited videos, as even simple format conversion or compression produces completely different MD5 values.

Perceptual hashing technology is more advanced, capable of detecting visually similar but technically different videos. The system extracts key frames from the video and generates a fixed-length hash code through DCT (Discrete Cosine Transform) or other algorithms. The perceptual hash values of two videos are compared for similarity using Hamming distance; if the Hamming distance is below a set threshold, the content is deemed duplicate.

Audio Fingerprinting Technology

Audio fingerprinting technology is an important means for video platforms to detect repurposed content, with the most famous being audio recognition technology based on the Shazam algorithm. This technology identifies identical or similar audio content by analyzing the spectral features of audio signals to generate unique "audio fingerprints."

The generation process of audio fingerprints includes: first, sampling the audio at 44.1kHz, then generating a spectrogram through **Short-Time Fourier Transform (STFT)**. The system extracts peak points from the spectrogram, which represent the most prominent frequency components in the audio signal. Next, the algorithm pairs these peak points to form a "constellation map," with each pair containing two frequency values and the time difference between them: Constellation(P1,P2,Δt)=(f1,f2,Δt)Constellation(P1,P2,Δt)=(f1,f2,Δt).

Visual Feature Analysis

Modern video platforms widely adopt deep learning-based visual feature extraction techniques. Through deep learning models such as Convolutional Neural Networks (CNN), the system can extract high-level semantic features from video frames, which capture the essence of the video content rather than surface pixel information.

The advantage of this method lies in its ability to detect videos that have undergone complex edits, such as color grading, cropping, adding watermarks, or speed changes. Even if the video undergoes significant changes at the pixel level, its deep semantic features often remain relatively stable.

Comparison of video duplication detection algorithms

Temporal Consistency Detection

Temporal consistency analysis is another important dimension for detecting repurposed videos. This technology identifies duplicate content by analyzing the temporal relationships and motion continuity between video frames. The dual-level detection method is a significant breakthrough in this field, encompassing Video Editing Detection (VED) and Frame Scene Detection (FSD).

The Video Editing Detection module first determines whether the video has been edited. For unedited videos, the system uses random vectors as descriptors to save computational resources. For edited videos, the system conducts deeper frame-level analysis, including detecting whether the video contains concatenated scenes.

Detailed Explanation of Core Algorithm Technologies

Perceptual Hash Algorithm Family

The pHash (perceptual hash) algorithm is a widely used technique in video duplication detection. The algorithm generates hash values through the following steps: first, the image is scaled to a standard size of 32×32 pixels, then the Discrete Cosine Transform (DCT) is applied to extract frequency domain features. Next, the algorithm retains the top-left 8×8 region of the DCT coefficients (low-frequency part), calculates the mean of these coefficients, and finally generates a 64-bit binary hash code by comparing each coefficient to the mean.

The dHash (difference hash) algorithm adopts a different strategy: it scales the image to 9×8 pixels, then calculates the difference between adjacent pixels. If a pixel is brighter than its right neighbor, a 1 is recorded in the hash code; otherwise, a 0. This method is more sensitive to horizontal changes in the image and can better capture the structural features of the image.

In-depth Analysis of Audio Fingerprint Algorithm

Detailed workflow of the Shazam audio fingerprint algorithm

The core of the Shazam algorithm lies in constellation map matching. The algorithm first converts the time-domain audio signal into a frequency-domain representation using Fast Fourier Transform (FFT):

STFT(t,f)=∑n=0N−1x(t+n)⋅e−j2πfnSTFT(t,f)=∑n=0N−1x(t+n)⋅e−j2πfn

where x(t+n)x(t+n) represents the audio sample points within the time window, and e−j2πfne−j2πfn is the complex exponential function.

The peak extraction process identifies significant feature points in the spectrogram by setting a threshold:

STFT(t,f) & \text{if } STFT(t,f) > threshold \\ 0 & \text{otherwise} \end{cases}$$ The construction of the constellation map is a key step in the algorithm. The system pairs the extracted peak points, with each pair containing two frequency values and the time difference between them. This pairing method makes the algorithm robust to noise and slight audio distortions. [4] The hash generation process converts the constellation map information into compact digital fingerprints: $$Hash(P1, P2, \Delta t) = Hash(f1, f2, \Delta t)$$ This hash value serves as a unique identifier for the audio segment and is stored in the database for subsequent fast matching. [4] ### Deep Learning Feature Extraction

Self-Supervised Video Hashing (SSVH) technology represents the latest application of deep learning in video duplication detection. This technology adopts a hierarchical binary autoencoder architecture, including an encoder and three decoders: a forward hierarchical binary decoder, a backward hierarchical binary decoder, and a global hierarchical binary decoder.

The encoder uses a binary LSTM (BLSTM) structure, which can directly generate binary hash codes without the need for post-processing steps. The data flow of BLSTM follows the standard LSTM pattern, but a sign function bt=sgn(ht)bt=sgn(ht) is added at the end to produce binary outputs.

To address the NP-hard problem of binary optimization, the algorithm employs an approximate sign function:

-1 & \text{when } h < -1 \\ h & \text{when } -1 \leq h \leq 1 \\ 1 & \text{when } h > 1 \end{cases}$$ This approximation allows gradients to pass through the sign function during backpropagation, enabling end-to-end training of the entire network. [9] ### Temporal Consistency Analysis Algorithm

The temporal consistency re-ranking algorithm is a core technology for localizing video segments. The algorithm first extracts image-level features through keypoint aggregation and deep learning, then uses a multi-k-d tree structure for efficient KNN search to obtain a set of candidate video segments.

The innovation of the algorithm lies in the temporal consistency pruning step, which precisely identifies matching segments and their temporal positions in the sequence by analyzing the timestamp information and sequence IDs of candidate segments. This method can complete a single-frame query in 83.96 milliseconds in a database of 1 million frames, and 462.59 milliseconds in a database of 4.5 million frames.

Specific Implementation Case Analysis

YouTube Content ID System

YouTube's Content ID system is one of the most mature copyright detection technologies in the industry. The system employs a multi-level detection strategy:

The first level is audio fingerprint matching. The system generates an audio fingerprint for each uploaded video and compares it against a vast reference database. Even if the audio undergoes pitch changes, speed adjustments, or the addition of background noise, the system can still detect matching content through spectral analysis.

The second level is visual content analysis. The system uses deep learning models to analyze the visual features of the video, including color distribution, texture patterns, object recognition, and more. These features are encoded into high-dimensional vectors, and cosine similarity is calculated to determine video similarity.

The third level is metadata comparison. The system compares metadata such as the video's title, description, and tags, combined with the results of the above technologies, to make a comprehensive judgment.

TikTok/Douyin's Dual Detection Mechanism

Douyin and TikTok employ a dual detection mechanism to address the specificities of short videos:

Real-time detection: During the user's video upload process, the system calculates the perceptual hash value and audio fingerprint of the video in real time. Through rapid comparison with the existing database, the system can identify obvious duplicate content within seconds.

Offline deep analysis: For videos that pass real-time detection, the system conducts deeper analysis in the background. Semantic features are extracted using CNN models to analyze the content originality of the video. For videos detected with slight modifications, the system calculates a similarity score, and content exceeding the threshold is flagged as suspected repurposing.

Actual Detection Performance Data

According to research data, modern audio fingerprinting technology can achieve 100% recognition accuracy under ideal conditions:

1-second audio clip: recognition accuracy 60%
2-second audio clip: recognition accuracy 95.6%
5 seconds and above: recognition accuracy 100%

For video detection, the dual-level detection method achieved a recall rate of 98.8% on the FIVR-200K dataset and 94.1% on the VCSL dataset.

The performance of perceptual hashing technology is as follows:

Processing speed: less than 1 millisecond per frame
Storage efficiency: only 8 bytes of hash storage per video frame
Detection accuracy: for slightly modified videos, detection accuracy can reach 85-90%

Challenges and Technology Development Trends

Countering Adversarial Attacks

With the global popularity of the short video field, content repurposers are constantly upgrading their anti-detection methods. Adversarial attacks are one of the main challenges currently faced. Attackers attempt to deceive detection systems by adding tiny perturbation signals to videos or using specific editing techniques.

To address these challenges, platforms are developing more robust detection algorithms. For example, topological fingerprint technology analyzes the topological structure of audio signals through persistent homology theory, which is more robust to time stretching and pitch changes.

Multi-modal Fusion Detection

Modern video detection systems are increasingly adopting multi-modal fusion strategies. By simultaneously analyzing the visual content, audio features, text information (such as subtitles and titles), and social network propagation patterns of the video, the system can construct a more comprehensive content fingerprint.

The advantage of this approach is that even if one modality is deliberately modified, the features of other modalities can still provide effective detection signals. For example, even if the video footage is significantly altered, its audio features and propagation patterns may still reveal its repurposed nature.

Edge Computing Optimization

In the future, video detection is moving towards real-time processing and lightweight solutions. New algorithm designs focus on:

Computational efficiency: Developing lightweight detection algorithms that can run on mobile devices, reducing dependence on cloud services.

Real-time capability: Implementing real-time detection during video upload, rather than traditional post-processing modes.

Privacy protection: Conducting content detection while protecting user privacy, avoiding the leakage of original video content.

Algorithm Performance Comparison

Different detection algorithms have their own advantages and applicable scenarios:

MD5 hashing is suitable for detecting identical files, offering extremely high speed and accuracy, but cannot handle any form of modification.

Perceptual hashing strikes a good balance between speed and robustness, making it suitable for detecting slightly modified content and the preferred technology for most platforms.

Audio fingerprinting offers extremely high accuracy in detecting audio content, maintaining good performance even with background noise, but has relatively high computational complexity.

Deep learning methods can understand the semantic content of videos and have strong detection capabilities for complex edits, but require significant computational resources and training data.

Temporal analysis excels at detecting the splicing and recombination of video segments, but processing speed is relatively slow, usually serving as a secondary verification method.

In practical applications, video platforms typically adopt a multi-algorithm fusion strategy, dynamically selecting the most suitable algorithm combination based on the characteristics of the video and detection requirements. This layered detection architecture ensures comprehensive detection while balancing computational efficiency and cost control.

Final Remarks

The current mainstream technical routes include perceptual hashing, audio fingerprinting, deep learning feature extraction, and temporal consistency analysis, each with its unique advantages and applicable scenarios. With the continuous development of artificial intelligence technology, future detection systems will become more intelligent, real-time, and precise, while also needing to find a better balance between technological progress and user experience.