11 3 11

Mariusz Kurman PRO

mkurman

AI & ML interests

AI Tech Lead | MD

Recent Activity

posted an update about 15 hours ago

How Do I Contribute (HDIC) Exciting times to come? We are working on a layer self-esteem technique to score their contribution to the final prediction. For now, it unlocks a lot of knowledge already stored in weights we couldn't force the model to extract by further fine-tuning!

Reacted to AdinaY's post with 🔥 about 18 hours ago

HunyuanVideo 📹 The new open video generation model by Tencent! 👉 https://huggingface.co/tencent/HunyuanVideo https://huggingface.co/collections/zh-ai-community/video-models-666afd86cfa4e4dd1473b64c ✨ 13B parameters: Probably the largest open video model to date ✨ Unified architecture for image & video generation ✨ Powered by advanced features: MLLM Text Encoder, 3D VAE, and Prompt Rewrite ✨ Delivers stunning visuals, diverse motion, and unparalleled stability 🔓 Fully open with code & weights

Reacted to singhsidhukuldeep's post with 🤗 about 18 hours ago

Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding. The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with: - Process 40,000+ pages across 3,000+ documents - Answer questions requiring information from multiple pages - Understand visual elements like charts, tables, and figures - Support both closed-domain (single document) and open-domain (multiple documents) queries Under the hood, M3DocRAG operates through three sophisticated stages: >> Document Embedding: - Converts PDF pages to RGB images - Uses ColPali to project both text queries and page images into a shared embedding space - Creates dense visual embeddings for each page while maintaining visual information integrity >> Page Retrieval: - Employs MaxSim scoring to compute relevance between queries and pages - Implements inverted file indexing (IVFFlat) for efficient search - Reduces retrieval latency from 20s to under 2s when searching 40K+ pages - Supports approximate nearest neighbor search via Faiss >> Question Answering: - Leverages Qwen2-VL 7B as the multi-modal language model - Processes retrieved pages through a visual encoder - Generates answers considering both textual and visual context The results are impressive: - State-of-the-art performance on MP-DocVQA benchmark - Superior handling of non-text evidence compared to text-only systems - Significantly better performance on multi-hop reasoning tasks This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.

View all activity

Organizations

Posts 4

Post

159

How Do I Contribute (HDIC)

Exciting times to come? We are working on a layer self-esteem technique to score their contribution to the final prediction. For now, it unlocks a lot of knowledge already stored in weights we couldn't force the model to extract by further fine-tuning!

Post

349

What AI-enhanced research tools would you recommend for searching and analyzing scientific papers?

View all posts

spaces 1

Running

💬

Llama 3.2 SUN 2.5B Chat Gguf

You can try MedIT Solutions latest release of SUN 2.5B Llama

models

None public yet

datasets

None public yet