Datasets collection
In order to meet the project requirements, the experiment collected football training video resources from the internet. These data were collected from real football training videos. A total of three crawls were conducted, resulting in 90 football match videos with a total duration of approximately 85 h. In order to ensure video quality, a resolution of 720p was selected. After obtaining the video resources, OpenCV was used to extract frames from the videos, resulting in a series of images. In order to ensure the diversity of images, a frame extraction interval of every 10 s was set, resulting in approximately 28,000 original images with a size of 720 × 1280. These image resources meet the requirements of the subsequent project. This data set is self-collected.
Furthermore, a human-body parsing experiment was conducted to validate the model’s performance. The models compared in the experiment included the youth soccer training keypoint detection model proposed in this study, local–global long short-term memory (LG-LSTM), Whole-Body Human Pose Estimation (WSHP), Pyramid Scene Parsing Network (PSPNet), Pose Guided Person Image Generation (PGN), DeepLab V2, and others. The dataset used for this experiment was the PASCAL-Person-Part dataset. The PASCAL-Person-Part dataset comprises a total of 3533 images, categorizing human body parts and the background into seven classes, including background, head, torso, upper arms, lower arms, upper legs, and lower legs. The dataset link is: PASCAL-Part Dataset (roozbehm.info). ***Accuracy and the mean Intersection over Union (mIoU) across all categories were employed as evaluation criteria. The calculation of mIoU is described by Eq. (4):
$$mIoU=\frac{1}{n+1}\sum_{i=0}^{n}\frac{{M}_{ii}}{\sum_{j=0}^{n}{M}_{ij}+\sum_{j=0}^{n}{M}_{ji}-{M}_{ii}}$$
(4)
In Eq. (4), \(n\) represents the number of categories in the dataset, counting from 0 onwards. Therefore, \(n+1\) signifies the total number of categories. \({M}_{ii}\) denotes the count of true positives where the true value is i, and it is predicted as i, \({M}_{ij}\) represents false positives and \({M}_{ji}\) signifies false negatives. Accuracy is a metric for assessing the degree of match between the model’s predicted results and the actual labels36. The calculation of accuracy is shown in Eq. (5):
$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(5)
In Eq. (5), TP represents True Positive, TN represents True Negative, FP represents False Positive, and FN represents False Negative. The higher the accuracy, the closer the model’s predicted results align with the actual situation. The calculation of Recall is given by Eq. (6):
$$Recall=\frac{TP}{TP+FN}$$
(6)
Equation (6) measures the proportion of targets successfully detected by the model to the total number of true targets, indicating the model’s success in detecting targets among all actual targets37. Additionally, Average Precision (AP) assesses the precision of the model at various confidence thresholds. Firstly, calculate precision and recall at different confidence thresholds. Then, based on the calculated precision and recall, plot the Precision-Recall Curve. Finally, compute the area under this curve, which represents the AP38. The calculation of AP is given by Eq. (7):
$$AP=\sum_{r\in recalls}(P(r)\times \Delta r)$$
(7)
In Eq. (7), \(P(r)\) represents precision at a recall rate of r, and \(\Delta r\) indicates the change in recall rate.
Detailed information regarding the training and testing sets of both datasets is presented in Table 5.
Experimental environment
In terms of hardware, the experimental server used in this study was equipped with two NVIDIA GeForce GTX 1080 graphics cards, each with a memory size of 10,000 MB. All model training was conducted on this hardware configuration. As for the software, the research server employed a 64-bit Ubuntu 16.04 operating system. The graphics driver version was 390.67, and the CUDA version was 9.1.85. These hardware and software configurations provide the necessary support and foundation for the experiments.
Parameters setting
This study used CNNs for the keypoint detection task. Several factors were considered in selecting the parameters of the CNNs model to ensure its effectiveness in image feature extraction and keypoint detection. Firstly, for the design of the convolutional layers, three convolutional layers were used, with the size of the convolutional kernels progressively increasing from 3 × 3 for the smaller kernels to 5 × 5 and 7 × 7 for the larger ones. This design aims to capture fine-grained features in the image, such as edges and corners, with the smaller kernels, while the larger kernels help extract larger-scale features, such as motion posture and ball position. Additionally, the stride for all convolutional layers was set to 1 to prevent loss of fine details in the image, and the “same padding” strategy was used to avoid excessive shrinking of the image size, preserving important information. The number of convolutional filters increases gradually from 32 to 128 to avoid high computational complexity while extracting more complex features. The pooling layer design aims to reduce the feature map size, decrease computational load, and enhance the model’s generalization ability. After each convolutional layer, a 2 × 2 max-pooling layer with a stride of 2 was applied to downsample the feature maps while retaining key image information. The fully connected layers, located at the end of the network, map the extracted features to the output space. Two fully connected layers were designed, containing 512 and 256 neurons, respectively, to enhance the model’s representational power and prevent overfitting. The number of neurons in the output layer corresponds to the number of keypoints, ensuring precise processing of each keypoint’s regression or classification task. For the activation function, the ReLU function was used because it effectively alleviates the vanishing gradient problem and accelerates network convergence. To further prevent overfitting, a Dropout regularization method was applied in the fully connected layers, randomly dropping some neurons during training to enhance the model’s generalization ability. Regarding the optimizer, the Adam optimizer was chosen, as it combines the advantages of AdaGrad and RMSProp, adapting the learning rate to provide better training performance and convergence. The loss function selected was the cross-entropy loss function, which is well-suited for classification tasks and effectively measures the difference between the model’s predictions and the true values, guiding model optimization. All parameter selections were validated through multiple experiments and hyperparameter tuning, using cross-validation to ensure the model’s stability and robustness across different training and validation datasets. This process carefully selected hyperparameters, including learning rate, batch size, and dropout rate, to achieve optimal model performance.
Regarding the settings for the second-level network, 50 training epochs were performed using the mini-batch Adam algorithm for parameter optimization. The initial learning rate was set to 0.001, and a linear decay strategy was applied during training. Specifically, learning rate decay was performed at the 25th, 40th, and 50th epochs, with a decay factor of 0.1. When constructing the Ground Truth for the discriminator, the “true” and “false” heat maps need to be defined. This involves two hyperparameters, ω and μ. ω represents the threshold value of pixel intensity, and μ represents the threshold value of the number of pixels greater than that threshold. These two parameters do not need to be adjusted simultaneously in the training process. Instead, an appropriate ω is first selected, and then μ is fine-tuned. In this study, ω was set to 0.4, and the optimal μ value of 75 was obtained through experimental tuning. The selection of these parameters is crucial for the accuracy of the training process and results. This study delves into the process of hyperparameter tuning for deep learning models to ensure stability and reliability. Key hyperparameters under scrutiny in the experiments include the learning rate, batch size, and weight decay. Initially, a systematic adjustment of the learning rate is conducted. A range of learning rates is attempted, starting from smaller values and progressively increasing. The study monitors the model’s performance on both the training and validation sets, selecting a learning rate that demonstrates optimal performance between the two. The results of the hyperparameter tuning for the model parameters are depicted in Table 6. Analyzing the data in Table 6, the optimal parameter selection for the model is identified as a learning rate of 0.01, batch size of 64, and weight decay of 0.1. The model exhibits high accuracy and low loss values on both the training and validation sets, demonstrating relatively strong performance and robust generalization capabilities.
Performance evaluation
This study opts not to use pre-trained CNNs models, based on several considerations. 1. Specificity of the Dataset: The specificity of the dataset is a critical factor. This study focuses on keypoint detection in youth soccer training, a domain that exhibits unique characteristics compared to general image classification tasks, such as those addressed by datasets like ImageNet. Images from soccer training are distinguished by dynamic scenes, rapid movements, and complex motion patterns. While pre-trained CNNs models, such as ResNet and VGGNet, perform exceptionally well in general image recognition tasks, they may not effectively capture the critical features specific to the soccer training domain. 2. Custom Architecture Design: To better accommodate the keypoint detection tasks in youth soccer training, the customized CNNs architecture was designed and trained from scratch. By tailoring the convolutional layer structures and pooling strategies to meet the task’s specific requirements, the model is more capable of learning features relevant to soccer training rather than relying on generic features learned by pre-trained models. This approach enables the model to adapt more effectively to rapidly changing motion scenarios, complex movement patterns, and the diverse characteristics of players across different age groups. 3. Limitations of Pre-trained Models: Although pre-trained models can offer computational convenience and enhance performance through transfer learning, their features may differ significantly from those required for this study’s task-specific data. This disparity could introduce unnecessary biases, potentially undermining the model’s generalization ability and suitability for the task. Consequently, training the model from scratch ensures that it learns features most relevant to youth soccer training. In summary, the decision to forego pre-trained models and train the specialized CNNs from the ground up was driven by the need to capture domain-specific features, improve adaptability to complex motion patterns, and avoid biases introduced by generic features. This approach ensures that the model is finely tuned to the specific demands of keypoint detection in youth soccer training.
This study conducts a series of detailed designs to ensure the reliability and replicability of the experiments. The main elements of the experimental design are as follows: Independent Variable: PASCAL-Person-Part data. Dependent Variables: Accuracy and performance of the keypoint detection model. Control Variables: Data Preprocessing: Preprocessing of the PASCAL-Person-Part dataset is carried out to ensure data quality and consistency. Training Dataset: Images, including keypoint annotations, are extracted from the PASCAL-Person-Part dataset using OpenCV to ensure the adequacy and quality of the training data. Hardware and Software Environment: The experiments are conducted on a server equipped with two NVIDIA GeForce GTX 1080 graphics cards, running on a 64-bit Ubuntu 16.04 operating system, ensuring consistency in hardware and software environments. Data Quality: Strict quality control measures, including data cleaning and denoising, are applied to the PASCAL-Person-Part dataset to ensure the accuracy and reliability of the data. Data Annotation: Precise annotation of keypoints in the PASCAL-Person-Part dataset is performed to ensure that the model learns sufficient information for accurate keypoint detection. Randomization Procedures: To avoid experimental biases and enhance replicability, randomization procedures are adopted: Data Selection: Random selection of images and keypoint annotations from the PASCAL-Person-Part dataset is carried out to ensure diversity and uniformity in the training data. Training Dataset: Random selection of images during the image data extraction process is implemented to ensure a uniform distribution and diversity of image data.
First, model performance analysis. The accuracy and position error on the test set is analyzed, as shown in Fig. 9:

Analysis of accuracy and position error on the test set.
In Fig. 9, the performance of keypoint predictions is relatively poorer in the vicinity of the penalty area, primarily due to the densely populated keypoint distribution in that area. However, the key point prediction errors for foot positions and curve positions are relatively small. Therefore, this study further conducted a regional statistical analysis of coordinate errors in the test dataset. These analytical results suggest that keypoint visibility is not a significant challenge, as the model provides fairly accurate results. In summary, the model excels in predicting key points’ visibility, indicating that the proposed model performs well in capturing key points, especially in critical areas on the soccer field. This result holds significant implications for soccer training and skill improvement.
On the PASCAL-Person-Part dataset, the results of human body parsing classification for each model are presented in Table 7 and Fig. 10. The comparison models in Table 7 (LG-LSTM, WSHP, PSPNet, PGN, etc.) are selected from the latest research results in related fields, and the concrete implementation is based on the model architecture provided here. During the experiment, these models are reproduced and compared fairly with the same dataset (PASCAL-Person-Part) and training strategies (such as data enhancement and optimizer setting). LG-LSTM: Reference from Sun et al.’s (2022) framework of time sequence action recognition39. WSHP: A whole-body attitude estimation network based on Jung et al. (2022)40. PSPNet: The scenario analysis network of Yuan et al. (2022) is adopted41. PGN: Reproduce the posture guidance generation model of Bodaghi et al. (2018)42.

Results of human body parsing classification experiment.
In Fig. 10, the mean mIoU for the proposed soccer training keypoint detection model is 73.78%. Compared to the LG-LSTM, WSHP, PSPNet, and PGN models, the mIoU results of this study’s model have improved by 27.29%, 9.16%, 8.34%, and 7.88%, respectively, in the task of human body parsing. The results indicate that this study’s model performs better in human body parsing tasks. The recall and average precision curves for these models are depicted in Fig. 11.

Recall and average precision curves for different models.
In Fig. 11, this study’s model exhibits a recall rate of 84.2% and an average precision rate of 84.6%. Compared to the other models, the highest improvements are 24.2% and 18.8%, respectively, while the lowest improvements are 8.8% and 6.0%, respectively. The data suggests that this study’s proposed CNNs-based soccer training keypoint detection model achieves higher detection accuracy. When comparing the detection speeds of these five models, the results are depicted in Fig. 12.

Comparison of detection speeds for different models.
In Fig. 12, the detection speed of this study’s model is 35 frames per second (fps). The LG-LSTM model achieves the fastest detection speed at 55 fps but has a relatively lower mIoU of only 57.96%.
To ensure the accuracy of the deep learning model, the experiment pays special attention to the design and results of baseline evaluations. This study chooses two main baseline evaluation methods: traditional computer vision methods and simple machine learning models. Traditional computer vision methods are employed as baseline evaluations to contrast with the performance of the deep learning model. The PASCAL-Person-Part dataset is utilized, and classical computer vision algorithms such as edge detection and corner detection are applied. By comparing the results with those of the deep learning model, the experiment assesses the relative advantages of deep learning in keypoint detection tasks. Furthermore, this study introduces some simple machine learning models, such as support vector machines and decision trees, as another means of baseline evaluation. The selection of these models is based on their simplicity and widespread application in image processing tasks. The experiment uses the same PASCAL-Person-Part dataset and applies these simple models for keypoint detection. By comparing their performance with the deep learning model, a more comprehensive assessment of the effectiveness of deep learning in this task is achieved. Table 8 presents key indicators from baseline evaluations based on different methods. The table reveals that the proposed model outperforms in all metrics, achieving an accuracy of 94.6%, AP of 84.6%, recall of 84.2%, and F1 score of 89.2%. This indicates that the proposed deep learning model exhibits high accuracy and performance in human parsing tasks. While the simple machine learning models show improvement over traditional computer vision methods, they still fall short of the proposed model. They slightly surpass traditional methods in accuracy, AP, recall, and F1 score but still lag behind the proposed model. This suggests that deep learning models have better performance and application prospects in human parsing tasks. The baseline evaluation results of the proposed deep learning model on the PASCAL-Person-Part dataset demonstrate its significant advantages in human parsing tasks, which are crucial for improving the accuracy and efficiency of human pose detection and analysis.
This study focuses on the performance of the deep learning model in keypoint detection for youth soccer training. To verify the differences in model prediction accuracy and their practical significance, the experiment employs Analysis of Variance (ANOVA) as the primary statistical analysis method. Additionally, Cohen’s d is applied to measure the effect size, evaluating the practical differences in the effectiveness of different training techniques. Specifically, the model’s prediction accuracy and position errors on all test sets are used as input data for ANOVA. Through ANOVA analysis, the experiment can determine whether there is a significant difference in keypoint prediction accuracy in different regions, such as penalty and non-penalty areas. Furthermore, to ensure the statistical persuasiveness of the results, the experiment also calculates the statistical power. Statistical power analysis helps the experiment assess whether the current sample size is sufficient to detect the actual effects, avoiding Type II errors—accepting the null hypothesis (i.e., no effect) erroneously due to a small sample size. Table 9 presents the results of ANOVA analysis and Cohen’s d calculation.
The results in Table 9 demonstrate a significant difference in keypoint prediction accuracy between penalty and non-penalty areas (P = 0.009), with a Cohen’s d value of 0.62. The data indicates a moderate to large effect size. The statistical power is 0.88, well above the commonly accepted standard of 0.8, suggesting that the sample size is sufficiently large to reveal the actual effects. For validation of the model’s generalization ability, the experiment employed a fivefold cross-validation technique. This technique randomly divides the entire dataset into five equal parts, with each part serving as the test set in turn while the remaining parts act as the training set, thereby enhancing the model’s ability to generalize across different data. Although the model demonstrates high overall prediction accuracy, its performance is relatively lower in densely populated areas such as the penalty box. ANOVA analysis reveals that the prediction errors for keypoints in the penalty box are notably higher. This is primarily due to the high player density, complex movements, and rapid actions in this region, which increase the difficulty of keypoint detection. In particular, while the prediction errors for foot positions and curved trajectories within the penalty box are relatively low, there remain certain deviations that could affect the precision required for soccer training. Considering the differences in player movements across various regions, a region-adaptive technique was employed to adjust the model’s learning weights. This allows the model to more accurately identify keypoints in high-density areas such as the penalty box. During training, the dataset was augmented with examples simulating densely populated scenarios, such as players moving rapidly or executing complex overlapping runs within the penalty area. These enhancements aim to improve the model’s performance in high-density regions. In practical soccer training, coaches can leverage the model’s performance insights in such regions to refine training priorities. Specifically, they can focus on enhancing players’ precision and rapid response abilities in critical areas like the penalty box. By implementing these strategies, the prediction accuracy of the model in dense keypoint regions can be effectively improved, thereby increasing its practical value in soccer training applications.
Additionally, to boost the model’s generalization capability, this study introduced additional datasets and data augmentation techniques. Specifically, apart from the original PASCAL-Person-Part dataset, the study conducted additional validation on an independent dataset suitable for training detection models in sports analysis. This dataset is collected from the 2017 UEFA Super Cup match between Real Madrid and Manchester United (highlight reel), and the dataset link is: Data augmentation methods such as rotation, scaling, and cropping are applied during the preprocessing stage to enhance the model’s adaptability and generalization across different scenes and poses. Through testing on this independent dataset, the experiment confirms the effectiveness and generalization ability of the model. Table 10 presents the results of fivefold cross-validation and independent dataset validation:
By comparing the baseline model with other state-of-the-art models and baseline evaluation methods, the proposed model has demonstrated superior performance in both accuracy and average position error. Particularly noteworthy is the independent dataset validation, where the proposed model maintained high accuracy and low position error even when faced with unknown data, showcasing strong generalization ability and practical application potential.
To further validate the model’s effectiveness in real-world scenarios, this study conducted on-site tests in actual soccer training environments. These tests covered various weather conditions, training grounds, and time periods to assess the model’s performance in complex environments. The key performance indicators obtained by the proposed model in on-site tests are presented in Table 11. The results indicate that the proposed model maintains high accuracy across different environmental conditions and exhibits stability in predicting keypoint positions. Even in the face of challenging weather and lighting conditions, the model consistently provides reliable results, demonstrating its feasibility and practicality in real soccer training scenarios.
Second, the following are the results of interviews conducted to understand the application of AI and deep learning technologies in campus football. Interviews were conducted with five technology companies and 30 sports teachers. To gather feedback from these participants, a questionnaire based on a scoring scale was designed to quantify each aspect of their responses. For instance, feedback regarding technical performance, user experience, accuracy, and real-time responsiveness required respondents to rate each aspect on a 5-point scale. These ratings facilitated an overall evaluation of performance across various technical dimensions. The quantitative data collected from the feedback were subjected to statistical analysis, including calculations of mean, standard deviation, and correlation analysis, to identify relationships and differences among various feedback dimensions. For open-ended questions or descriptive feedback, the responses were categorized into thematic groups, such as “model accuracy,” “real-time response,” and “ease of operation.” Analyzing this qualitative feedback provided deeper insights into the specific needs of respondents and the practical application scenarios of the technology. Text analysis techniques, such as topic modeling and word frequency analysis, were employed to automate the analysis of large volumes of textual feedback. This helped identify the most frequently mentioned themes and recommendations for technical improvements. The results were summarized in Fig. 13.

Understanding of the application of AI and deep learning technologies in campus football by enterprises and teachers (A: Technology Companies B: Teachers H: Wearable Devices I: VR Technology J: Big Data Analysis K: Video Analysis L: Positioning Systems).
In the context of campus football, interviews were conducted with five technology companies and 30 sports teachers to assess their understanding of the application of AI and deep learning technologies in this field. The results indicate that all five technology companies apply wearable devices and big data analysis technologies in campus football, with four of them also utilizing video analysis technology and positioning systems. Only one technology company offers VR technology products for campus football. Among the teachers, the majority have some understanding of the application of wearable devices and big data analysis technologies in campus football, while over half of the teachers also have some knowledge of video analysis technology. Regarding the application of positioning systems in campus football, general teachers have some understanding, whereas only a few teachers know about VR technology in campus football. The understanding of technology companies and teachers regarding the application of AI and deep learning technologies in campus football contributes to research aimed at comprehending the current status and potential of these technologies in the field of campus football.
Third, in youth football training, the application of deep learning and AI is influenced by various factors. The main influencing factors were identified through expert interviews and surveys of teachers and coaches, including policy, technological, hardware infrastructure, and cognitive and attitudinal factors. The specific details are shown in Fig. 14.

Perception of enterprises and sports teachers regarding the application of AI and deep learning in campus settings.
In Fig. 14, the application of deep learning and AI in youth soccer training is influenced by several factors. These factors were derived from expert interviews and surveys of teachers and coaches. Among these factors, policy considerations are regarded as the most crucial in both technology companies and campus communities, with all five technology companies unanimously identifying policy factors as key. Furthermore, four experts emphasized the importance of cognitive and attitudinal factors in the application of AI in campus soccer. One expert also recognized the significance of technological factors. In contrast, some physical education teachers and coaches view technological factors as the primary influencing factor, while several teachers consider cognitive and attitudinal factors key. Additionally, some teachers mentioned policy factors, but the impact of hardware infrastructure factors and other factors is relatively minor, with only a few teachers mentioning them. Therefore, it can be concluded that policy factors, cognitive and attitudinal factors, and technological factors play crucial roles in the application of AI and deep learning technologies in campus soccer. In contrast, the impact of hardware infrastructure factors and other factors is relatively limited. Understanding these factors is paramount to promoting the application of AI in the field of campus soccer.
Additionally, data on the number of requests for 360-degree VR soccer videos were collected to understand users’ interest and demand for these videos. The results are presented in Fig. 15.

Number of requests for 360-degree VR soccer videos.
In Fig. 15, the most popular video type among users is match highlights, accounting for 25% of the total requests. This suggests that users prefer to watch highlights and key moments of important soccer matches. Requests for goal highlights videos are also high, constituting 18.8%, indicating user interest in spectacular goals. While requests for training skill tutorials in educational content are relatively lower, they still have a certain audience, accounting for 8.3% of the total requests. These data reflect the diversity of 360-degree VR soccer videos, with different types of videos appealing to various audiences. When designing and promoting 360-degree VR soccer videos, consideration can be given to creating more relevant content based on user preferences and interests, such as more match highlights and goal highlights. Furthermore, this data can provide valuable insights into user preferences and market trends for soccer-related brands and platforms.
To further validate the computational efficiency of the proposed deep learning-based keypoint detection model for youth soccer training, this section compares it against other optimized models under similar hardware configurations, analyzing their performance during training and inference. According to the test results, the computational speed using a GTX 1080 GPU was significantly faster than the other two configurations, especially when processing larger datasets, with notable improvements in both inference speed and training time. A comparison of training times on the training and testing datasets is presented in Table 12. The GTX 1080 GPU demonstrated outstanding performance in both training and inference times while significantly reducing memory usage. Table 12 shows that the hardware configuration (such as GPU model) mainly affects the training speed, but has no significant correlation with the model accuracy. For example, the accuracy difference between GTX 1080 and GTX 1060 is less than 0.1%, and the improvement of verification performance is mainly due to model optimization.
When comparing models across different hardware configurations, this study evaluates its results against several published deep learning models utilizing similar hardware. For instance, the action recognition model proposed by Tsai et al. (2020) required 20 h of training on an NVIDIA GTX 1080 GPU, with training times exceeding 25 h on similar GPUs such as the GTX 1070 or GTX 1060. Moreover, its inference speed and memory usage were relatively high43. In contrast, the model proposed in this study achieves significantly improved computational efficiency through optimized algorithms that reduce unnecessary computations and memory access. Specifically, during image processing and keypoint detection, the proposed model employs advanced optimization strategies within the CNNs framework, such as batch normalization and weight sharing, enabling a more efficient training process. By comparing computational performance across different hardware, the results demonstrate that the proposed model offers a clear advantage in terms of computational efficiency. During both training and inference, it performs high-precision keypoint detection in shorter timeframes while maximizing hardware resource utilization. When combined with the details from Table 5 regarding the training and test sets, the proposed model optimizes memory usage and computation time, making large-scale dataset processing significantly more efficient.
To comprehensively evaluate the model’s ability to detect keypoints in youth soccer training, additional analyses focused on specific keypoints, such as foot positions and ball trajectories. The study also categorized model performance by age group and skill level to provide more targeted training recommendations for young players. For specific keypoints, such as foot positions and ball trajectories, precision and recall metrics were used for evaluation. Table 13 shows the results of precision and recall analysis for specific key points, such as foot position and ball trajectory. The significance of this table is to show the performance of the model in the detection of different key points, especially in the scene where the foot touches the ball, which shows that the model can effectively identify these key points. Although the performance of ball trajectory detection is slightly inferior, it can still accurately predict the situation of large angle projection. These results provide a theoretical basis for accurate positioning and action capture in youth football training.
To further refine the model’s performance and provide personalized training feedback for young athletes, the model was evaluated across different age groups and skill levels. Table 14 shows the performance of the model in different age groups (6–8 years old, 9–12 years old, 13–16 years old and over 16 years old). The table shows that with the increase of age, especially in the 13–16 age group, the precision and recall rate of the model have been significantly improved. This shows that with the maturity of players’ skills, the ability of the model in predicting foot position and ball trajectory has also been improved, especially in the case of fast and complex movements, the model can maintain high precision.
Table 15 shows the performance of the model among players with different skill levels (beginner, intermediate, and advanced). The table shows that beginners’ detection accuracy is relatively low, mainly focusing on the prediction of foot movement and ball trajectory, which shows that the model can capture most movements, but there are some errors in details. For intermediate players, the model is balanced in foot position and ball trajectory detection, while for advanced players, the model shows very high precision, especially in complex and fast movements, which can maintain high precision and recall rate.
To accommodate real-time application scenarios, the design of the model must consider the trade-off between speed and accuracy. Real-time applications typically require systems to complete data processing and feedback within milliseconds. Therefore, the model’s computational efficiency and response speed are critical factors. However, high-precision models often require more complex calculations and greater resources, which can lead to computation delays. The impact of different hardware configurations on the balance between speed and accuracy is shown in Table 16. There is a significant difference in performance between CPU and GPU when processing high-precision models. The GPU demonstrates a clear acceleration effect, completing computational tasks in a shorter time, while the CPU is slower. On embedded devices, lightweight models offer a more balanced trade-off between speed and accuracy, making them suitable for real-time feedback applications that are sensitive to delays.
Based on the results in Table 16, for training and analysis scenarios that require high precision, it is recommended to use GPUs with stronger computational power and employ the full deep learning model. While this may lead to extended processing times, it ensures high precision in training data and motion analysis. For applications requiring real-time feedback in sports and training, it is recommended to use a lightweight network architecture and rely on GPUs or embedded devices to balance accuracy and speed. Although the precision of the lightweight model slightly decreases, it offers significant advantages in processing speed and response time, meeting the real-time feedback requirements.
To comprehensively evaluate the proposed CNNs model, multiple comparison models were selected for the experiment, including state-of-the-art methods in the field as well as some classic benchmark models. Table 17 presents the performance of each comparison model in the keypoint detection task.
In Table 17, the proposed CNNs model performs best across multiple metrics, including accuracy, recall, and F1 score, particularly excelling in the keypoint detection task with an accuracy of 0.91, recall of 0.88, and an F1 score of 0.89. This indicates that the model offers high precision and stability in the context of youth soccer training, effectively capturing keypoint locations, especially in fast-moving and complex scenarios. In contrast, the performance of the classic benchmark CNNs model is relatively weaker, with an accuracy of 0.84, recall of 0.80, and an F1 score of 0.82. While this model still performs reasonably well in some scenarios, it lags significantly behind the more advanced CNNs models, especially in recall, suggesting potential instances of missed detections. The traditional machine learning method performs the worst, with an accuracy of only 0.77, recall of 0.75, and an F1 score of 0.76. This indicates that traditional machine learning methods have limited ability to handle complex movements and dynamic scenes, unable to fully leverage deep features in images, resulting in poorer performance compared to deep learning-based approaches. The deep learning-based keypoint detection method (OpenPose) performs relatively well, with an accuracy of 0.89, recall of 0.85, and an F1 score of 0.87. Although this method also shows strong detection capabilities, it slightly falls short compared to the advanced CNNs model, possibly due to differences in network structure or training strategies. Finally, the performance of the Simple CNN is better than the traditional machine learning method but still does not match the performance of other deep learning methods. With an accuracy of 0.80, recall of 0.78, and an F1 score of 0.79, this model is able to capture some features, but its depth and complexity are insufficient to handle higher-level features. These results indicate that the proposed model not only outperforms traditional methods in terms of performance but also effectively addresses the keypoint detection task in youth soccer training.
This study compares the performance of human experts and AI models in the keypoint detection task, thoroughly exploring the differences between the two and analyzing the reasons behind these differences. A comparison and analysis of the results between human and AI applications are shown in Table 18.
In Table 18, although the AI model performs comparably to human experts in most standard scenarios, its performance is relatively poorer in complex environments, such as fast movement or action occlusion. Furthermore, the real-time capability of the AI model is significantly superior to that of human experts. The AI can provide rapid feedback in a short time, while humans require more time for judgment and reaction. The AI model is trained on large datasets and can make predictions by learning features extracted from the data. In contrast, human experts rely more on intuition, experience, and an understanding of movement patterns when detecting key points. For example, although the AI model can accurately detect key points in standard scenarios through data learning, it often struggles to adapt and adjust in complex or unfamiliar scenarios, leading to a decrease in accuracy. Human experts, on the other hand, exhibit strong adaptability in dynamic and changing environments. For instance, in fast movements or under complex occlusion conditions, humans can infer the location of key points using contextual information, even when parts of the image are obscured or distorted. The AI model, however, relies heavily on image clarity and data consistency. When the input environment changes, the accuracy of the model may decline, especially in unfamiliar situations. While the AI model can achieve fast and accurate keypoint detection in standardized environments, its performance is limited in extreme conditions, such as intense lighting, complex movements, or severe occlusion. The model tends to rely on features present in its training dataset, and in cases of unseen situations or significant changes, the AI’s performance may be biased. In contrast, humans can compensate for these limitations through contextual awareness, accumulated experience, and rapid adaptation to the environment. Human experts are capable of making subjective judgments and adjusting actions based on real-time feedback. For example, in live matches, sports coaches or trainers not only rely on visual information but also consider the athlete’s condition, technical movements, and environmental changes. This flexibility is something current AI models find challenging to fully replicate. Although AI outperforms humans in certain aspects, such as real-time processing and data handling, AI models still face challenges in complex and dynamic scenarios. Human experts, with their experience and contextual judgment, demonstrate greater adaptability in such environments, while AI is better suited to standardized environments with sufficient data.
Performance comparison between pre-training model and this model in shown in Table 19. The table shows that this model is superior to the pre-training model in precision, recall and F1 score. Specifically, the precision of ResNet50 (pre-training) model is 89.2%, the recall is 86.5%, and the F1 score is 87.8%. The precision of VGG16 (pre-training) model is 85.7%, the recall is 83.1%, and the F1 score is 84.4%. In contrast, the precision of the proposed model is 91.0%, the recall is 88.0%, and the F1 score is 89.5%. This shows that the proposed model has higher accuracy and recall ability in key point detection tasks, and its overall performance is better than the pre-training model.
Finally, the advantages and disadvantages of the methods proposed in this study are summarized. The main advantages include high accuracy, practicality, and comprehensiveness. The study demonstrates that the model achieves good accuracy on the test dataset and can accurately detect key points in soccer training, including key movements and positions of soccer players. Such a model can be practically applied in soccer training, contributing to improved training efficiency and quality. By accurately detecting player movements, coaches can provide better guidance and enhance training programs. This study comprehensively combines deep learning CNNs and AI technology, offering a comprehensive solution for soccer training, including action recognition and position detection. The disadvantages include data requirements, computational complexity, and dependence on technical infrastructure. Deep learning models typically require a substantial amount of training data to achieve optimal performance. Insufficient soccer training data may limit the model’s performance. Deep learning models often have high computational complexity, demanding substantial computational resources and time for training and deployment. The successful application of the model may rely on the technical infrastructure of schools or soccer training facilities, including camera equipment and computing resources. In some environments, this could be a limiting factor. In conclusion, the approach proposed in this study holds great potential in soccer training but still needs to address certain challenges, such as data requirements and computational complexity. As technology continues to advance and data accumulates, these issues may gradually be resolved.
Discussion
This study developed a key point detection model for youth football training, which provides accurate results and enhances training efficiency. This is consistent with the findings of Sun and Ma (2021), whose research proposed an object recognition model that can detect the position information of key points, providing relative positional information for player identification and tracking in football video understanding44. It effectively resolves errors caused by rapid changes in video angles and provides necessary positional information for automated commentary and tactical generation.
This study identified policy, cognitive, attitudinal, and technological factors as important influencers of AI application in campus football, followed by hardware infrastructure and other factors. This is in line with the findings of Ahsan et al. (2022), which suggested that the optimal application of AI wearable devices in campus football includes smart vests, smart armbands, smart wristbands, motion patches, and football chest belts45. Additionally, Chidambaram et al. (2022) indicated that AI devices, through direct contact with players’ bodies, utilize sensors and big data technology to collect various performance and capability data of players, such as average speed, goals scored, playing time, high-intensity exercise duration, sprint counts, total running distance, and high-speed running distance46. These findings provide a theoretical basis for future experimental directions in research.
In addition, the proposed soccer training keypoint detection model in this study holds various industrial significance. Firstly, it enhances soccer training efficiency through real-time feedback. The model can monitor player movements in real time and provide feedback, which is valuable for immediate improvement and adjustments. This is beneficial for training before, during, and after matches. The model can assist coaches and players in conducting soccer training more effectively. By automatically detecting and correcting player movements and tactical errors, it can enhance the effectiveness of training, enabling players to improve their skill levels more rapidly. Secondly, it offers personalized training, reducing training costs. The model can provide personalized training recommendations based on each player’s performance and needs, making training more precise and tailored. Traditional soccer training often requires substantial human and time resources. This model can reduce the demand for human resources, thus lowering training costs. Additionally, by collecting and analyzing a large amount of training data, the model can help coaches and teams develop better training plans and tactical strategies, thereby enhancing the team’s competitiveness. Lastly, the application of AI and deep learning to soccer training represents the forefront of the sports technology field. The development of this field is expected to drive the growth of the sports technology industry, including hardware devices, software applications, and media communication. In summary, the proposed soccer training keypoint detection model has the potential to improve the efficiency and quality of soccer training, reduce costs, promote the development of sports technology, and provide better training and competitive advantages for soccer players and teams. This holds significant industrial importance for soccer and sports technology fields.
In addition to the aforementioned significance, it can promote intelligent sports training for adolescents, inspiring their interest and learning. Intelligent sports training can help them better understand soccer matches comprehensively and prepare for them more effectively. For young people interested in soccer, this study may spark their interest in science and technology. They can learn how to apply deep learning and AI to address real-world problems, which could be inspiring for their future academic and career development. Using modern technology to enhance sports training can positively impact the development of sports for the younger generation.
link