Chen Yulin's Blog

Posted 2025-05-23Updated 2025-07-24Note3 minutes read (About 435 words)

Repository:
https://github.com/PSGBOT/pixtral-12B-Inference

本地图片上传

def encode_image(image_path):
    """Encode the image to base64."""
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {image_path} was not found.")
        return None
    except Exception as e:  # Added general exception handling
        print(f"Error: {e}")
        return None

Prompt

VLM物体描述的prompt:

核心需要：准确定位物体所在方位，不把远景识别为物体，降低False Positive

Focus on the area highlighted in green in the image.

Step 1: Determine if the highlighted area represents a distinct, identifiable object or instance:
- If the highlighted area is clearly a distinct object, proceed to Step 2.
- If the highlighted area is abstract, ambiguous, or you cannot confidently identify it as a specific object (e.g., part of background, texture, partial view), respond with "Valid: No".

Step 2: If the highlighted area is a distinct object, provide:
1. The specific name of the object (be precise and use technical terms when appropriate)
2. The primary function or purpose of this object
3. Any notable features visible in the highlighted area (no color description)
4. If there is text visible on the object, include what it says

Remember, if you're uncertain about the highlighted area being a distinct object, respond only with "Valid: No".

输出结果：

Valid

Valid: Yes

1. The specific name of the object: Soap dispenser
2. The primary function or purpose of this object: To dispense liquid soap or hand sanitizer.
3. Notable features visible in the highlighted area:
	- The dispenser has a pump mechanism at the top.
	- The body of the dispenser is cylindrical.
	- The material appears to be translucent plastic.
4. There is no visible text on the object.

invalid
1
Valid: No

VLM输出->Structured Output

使用另一个LLM来对VLM输出的内容进行parse，转化成json文件, 通过mistral ai 提供的接口实现:

class Instance(BaseModel):
    valid: str
    name: Optional[str] = None
    feature: Optional[List[str]] = Field(default_factory=list)
    usage: Optional[List[str]] = Field(default_factory=list)

def parse_description_msg(msg):
    message = [
        {"role": "system", "content": "Extract the description information."},
        {
            "role": "user",
            "content": msg,
        },
    ]
    return message

chat_response = self.client.chat.parse(
	model=self.llm,
	messages=msg,
	response_format=Instance,
	max_tokens=self.llm_max_tokens,
	temperature=self.llm_temperature,
)
return json.loads(chat_response.choices[0].message.content)