Clear Sky Science · en

ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

· Back to index

Smarter Eyes on Surgical Tools

Keyhole surgery relies on long, slender instruments guided by cameras inside the body. For computers to assist surgeons—by tracking tools, warning of danger zones, or even steering cameras—they first need to know exactly where each instrument is and how it is oriented. This article introduces ROBUST-MIPS, a large, carefully labeled image collection that teaches algorithms to follow surgical tools more efficiently and accurately, paving the way for safer and more automated operations.

Figure 1
Figure 1.

Why Following Tools Inside the Body Is Hard

During minimally invasive surgery, the camera shows a circular window into a crowded, shifting scene: tissue, blood, smoke, glare, and several overlapping instruments. Many research groups have tried to make computers understand these scenes by marking every pixel that belongs to a tool, a process called segmentation. While very detailed, such pixel-perfect outlines are slow and tiring for people to draw, and they do not always capture the most useful information for knowing where a tool begins, bends, and ends. Simple rectangles, which are common in everyday computer vision, fare poorly here because instruments are long and thin, so a box around them covers a lot of irrelevant area and overlaps with other tools.

A Stick-Figure View of Surgical Instruments

The authors argue for a different point of view: instead of painting every pixel, describe each instrument as a simple “stick figure” made of a few key points connected by straight lines. In their ROBUST-MIPS dataset, every tool in every image is labeled with four standard locations: where it enters the camera’s field of view (the entry point), where the shaft meets the movable or rigid tip (the hinge), and up to two possible tip positions. This design works for both rigid tools, such as probes, and jointed ones, such as graspers and scissors. For tools that only have a single tip, or tools whose tips overlap or disappear from view, the extra point is marked as missing but kept in the same format, so that algorithms always see a consistent structure.

Dealing with Hidden and Ambiguous Parts

Real operations are messy, and parts of an instrument are often hidden behind tissue, outside the camera’s circular view, or altogether off screen. To handle this, the team adds a visibility label to every key point: clearly visible, hidden but can be confidently guessed, or entirely unknown. For example, if only the shaft is visible, the tip locations are marked as missing; if a tip is behind tissue but its position can be inferred from the visible shaft and the tool’s shape, it is marked as occluded with estimated coordinates. The authors even allow annotators to place points just beyond the image border when the instrument obviously continues out of frame, ensuring that the “stick figure” stays connected even when only part of it is visible.

Figure 2
Figure 2.

Building and Sharing a Rich Training Ground

ROBUST-MIPS is built on top of an earlier widely used dataset called ROBUST-MIS, which contains 10,040 frames from 30 colorectal surgeries. Each frame already came with detailed tool masks; the new work adds the skeletal labels and cleans up the masks by removing static camera ports that do not move and do not help with tool tracking. Every frame is packaged with the original image, a refined mask that only includes the active tools, and a file describing the key points, their visibility, and how they connect. The authors convert this information into a popular standard format, originally developed for human pose, so that many existing algorithms can use the data with minimal extra work.

Putting the Dataset to the Test

To show that these annotations are not just neat on paper, the team trains several leading pose-estimation models—originally designed to track human joints—to follow surgical tools instead. In this setting, each tool point is treated like a human joint. Because the two tips of many instruments are interchangeable, the authors customize the usual scoring method to treat swapping the tips as harmless, rather than a mistake. They also adapt how size is measured so that long, thin tools are judged fairly, no matter how they are rotated in the image. Across thousands of unseen images, the models achieve strong accuracy, suggesting that a handful of well-chosen points is enough for reliable localization, even in the presence of smoke, blood, glare, and overlapping instruments.

What This Means for Future Surgery

ROBUST-MIPS shows that representing surgical instruments as simple skeletal outlines can provide rich, practical information at a fraction of the labeling cost of pixel-wise masks. By releasing the dataset, the custom labeling software, and ready-to-use benchmark models, the authors give the community a solid foundation for building smarter systems that track tools robustly across different patients and procedures. In the long run, such capabilities could help power safer navigation, real-time safety checks, and more intuitive automation in the operating room.

Citation: Han, Z., Budd, C., Zhang, G. et al. ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments. Sci Data 13, 684 (2026). https://doi.org/10.1038/s41597-026-06938-5

Keywords: surgical tool tracking, laparoscopic surgery, pose estimation, medical imaging dataset, computer-assisted surgery