HumanoidText

Vision Transformers
Vision Transformers (ViTs) are a type of deep learning model designed for computer vision tasks. Introduced by researchers at Google in 2020, they adapt the transformer architecture, originally developed for natural language processing, to handle image data.
Each row represents a different phase of the golf swing in the video sequence for analysis.
Frame | Keypoint Annotations | Club Path | Angles (Joints) | Shot Outcome |
Frame 1 | Head (x1, y1), Shoulders (x2, y2), Hips (x3, y3) | Start position | Shoulder-hip (40°), Wrist angle (15°) | Setup position |
Frame 10 | Head (x1, y1), Shoulders (x2, y2), Hips (x3, y3) | Mid-backswing | Shoulder-hip (70°), Wrist angle (30°) | Backswing phase |
Frame 20 | Head (x1, y1), Shoulders (x2, y2), Hips (x3, y3) | Top-backswing | Shoulder-hip (90°), Wrist angle (45°) | Max backswing |
Frame 30 | Head (x1, y1), Shoulders (x2, y2), Hips (x3, y3) | Downswing start | Shoulder-hip (80°), Wrist angle (30°) | Transition phase |
Frame 40 | Head (x1, y1), Shoulders (x2, y2), Hips (x3, y3) | Impact | Shoulder-hip (50°), Wrist angle (15°) | Impact |
Frame 50 | Head (x1, y1), Shoulders (x2, y2), Hips (x3, y3) | Follow-through | Shoulder-hip (30°), Wrist angle (5°) | Post-impact |
Full Sequence with Body Postures and Club Movement:
Frame | Phase | Head | Shoulder (L/R) | Elbow (L/R) | Wrist (L/R) | Hip (L/R) | Knee (L/R) | Clubhead (C) | CSA (°) | Swing Metrics |
Frame 1 | Setup | (25, 35) | (50, 70), (75, 70) | (55, 95), (70, 95) | (50, 120), (75, 120) | (60, 200), (85, 200) | (65, 250), (85, 250) | (100, 250) | 60° | Start of the swing |
Frame 10 | Early Backswing | (25, 38) | (50, 75), (75, 75) | (60, 100), (70, 100) | (55, 115), (80, 115) | (60, 205), (85, 205) | (65, 252), (85, 252) | (110, 245) | 70° | Club moving upward |
Frame 20 | Mid Backswing | (25, 42) | (50, 80), (75, 80) | (65, 105), (75, 105) | (60, 110), (85, 110) | (60, 210), (85, 210) | (65, 254), (85, 254) | (120, 230) | 90° | Maximum shoulder rotation |
Frame 30 | Top of Backswing | (25, 44) | (48, 85), (78, 85) | (68, 110), (80, 110) | (65, 105), (90, 105) | (60, 212), (85, 212) | (65, 256), (85, 256) | (130, 225) | 120° | Club at max height |
Frame 40 | Transition | (25, 43) | (48, 83), (78, 83) | (66, 108), (80, 108) | (64, 103), (90, 103) | (62, 212), (84, 212) | (65, 258), (84, 258) | (140, 215) | 110° | Start of downswing |
Frame 50 | Downswing | (25, 42) | (48, 80), (78, 80) | (65, 105), (80, 105) | (63, 100), (89, 100) | (64, 210), (84, 210) | (65, 255), (84, 255) | (150, 200) | 80° | Club accelerating downward |
Frame 60 | Impact | (25, 40) | (50, 75), (75, 75) | (63, 100), (78, 100) | (60, 95), (85, 95) | (65, 205), (85, 205) | (65, 250), (85, 250) | (155, 190) | 0° | Ball contact |
Frame 70 | Early Follow-through | (25, 38) | (52, 70), (78, 70) | (65, 95), (80, 95) | (62, 90), (85, 90) | (65, 200), (85, 200) | (65, 248), (85, 248) | (160, 180) | 40° | Club rising post-impact |
Frame 80 | Follow-through | (25, 36) | (55, 65), (80, 65) | (68, 90), (83, 90) | (65, 85), (88, 85) | (65, 195), (85, 195) | (65, 246), (85, 246) | (170, 160) | 80° | Full shoulder rotation |
Breakdown of Vision Transformer Input:
Patches as Keypoints:
Each frame is broken down into key body and club position coordinates, and these keypoints will form image patches.
A transformer model will learn the relationships between body movements (head, shoulder, elbow, wrist, hips) and club movement to detect patterns and deviations from optimal form.
Angles and Positions:
The angular data such as shoulder-hip rotation, club shaft angle, and knee flex will be crucial features for training the model.
These numerical inputs will help the model understand the mechanics of the swing and how they change over time.
Club Path and Joint Angles:
The club path like how the club moves throughout the swing is represented as changing positions of the clubhead in each frame.
Joint angles like shoulder rotation and wrist flexion are essential to ensure the player’s swing is efficient.
Time Steps and Sequence Data:
Each frame corresponds to a step in the sequence, which allows the transformer to analyze the swing as a continuous movement rather than isolated instances.
The relationships between consecutive frames will help the model understand transitions between swing phases.
Possible Outcomes from Vision Transformers:
Swing Quality: The transformer can provide feedback on whether the swing follows an optimal path or if there is a deviation like over-the-top downswing or incorrect wrist angles
Posture Analysis: The transformer will analyze the golfer's posture at each phase, providing feedback on incorrect spine angles, weight shifts, or knee flex that may affect shot accuracy.
Club Path Optimization: The model can assess whether the club's path is inside-out, outside-in, or straight, and suggest corrections to achieve more consistent shots.
Shot Prediction: Based on the swing, the model can predict the likely outcome of the shot (a slice, hook, or straight ball flight), helping golfers understand why certain issues (like a slice) occur.
This structured, frame-by-frame approach helps Vision Transformers treat each swing as a sequence of body positions and joint angles, providing precise feedback and improvement suggestions based on learned patterns. By training with such a dataset, the app can improve its ability to detect swing flaws, provide personalized advice, and optimize performance.
11
Programs
1
Locations
2
Volunteers
Project Gallery




