Kuldeep Singh Sidhu's picture
6 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

😃 TOP 3 on HuggingFace for posts 🤗 Seeking contributors for a completely open-source 🚀 Data Science platform! singhsidhukuldeep.github.io

Recent Activity

posted an update about 12 hours ago
posted an update 1 day ago
posted an update 3 days ago

Organizations

singhsidhukuldeep's activity

posted an update about 12 hours ago
view post
Post
415
It's always exciting to revisit Google's DCN paper—impractical but good!

Deep & Cross Network (DCN) - a groundbreaking approach to click-through rate prediction that's revolutionizing digital advertising!

Key Innovation:
DCN introduces a novel cross-network architecture that automatically learns feature interactions without manual engineering. What sets it apart is its ability to explicitly model bounded-degree feature crossings while maintaining the power of deep neural networks.

Technical Deep Dive:
- The architecture combines a cross network with a deep network in parallel.
- The cross network performs automatic feature crossing at each layer.
- The embedding layer transforms sparse categorical features into dense vectors.
- Cross layers use a unique formula that enables efficient high-degree polynomial feature interactions.
- Memory-efficient design with linear complexity O(d) in the input dimension.

Performance Highlights:
- Outperforms traditional DNN models with 60% less memory usage.
- Achieved 0.4419 logloss on the Criteo Display Ads dataset.
- Consistently performs better than state-of-the-art models like Deep Crossing and Factorization Machines.
- Exceptional performance on non-CTR tasks like Forest Covertype (97.40% accuracy).

Under the Hood:
- Uses embedding vectors of dimension 6 × (category cardinality)^1/4.
- Implements batch normalization and the Adam optimizer.
- The cross network depth determines the highest polynomial degree of feature interactions.
- An efficient projection mechanism reduces cubic computational cost to linear.
- Parameter sharing enables better generalization to unseen feature interactions.

Key Advantages:
1. No manual feature engineering required.
2. Explicit feature crossing at each layer.
3. Highly memory-efficient.
4. Scalable to web-scale data.
5. Robust performance across different domains.

Thoughts on how this could transform digital advertising?
posted an update 1 day ago
view post
Post
1081
Sorry judge, my lawyer hallucinated? 😂 If you get an AI lawyer, you would want it to be hallucination-free!

New @Stanford -@Yale research reveals surprising findings about leading AI legal research tools. Here's what you need to know:

>> Key Findings
The study tested LexisNexis (Lexis+ AI), Thomson Reuters (Westlaw AI & Ask Practical Law AI), and GPT-4, finding hallucination rates between 17-33% despite claims of being "hallucination-free".

>> Technical Deep Dive
The research evaluated these tools using Retrieval-Augmented Generation (RAG) architecture, which operates in two crucial steps:

1. Retrieval System:
- Uses neural text embeddings to capture semantic meaning
- Employs both lexical and semantic search mechanisms
- Implements document filtering and extraction
- Retrieves relevant legal documents from vast databases

2. Generation Pipeline:
- Processes retrieved documents alongside original queries
- Synthesizes information from multiple legal sources
- Generates responses based on retrieved context
- Includes citation verification mechanisms

>> Performance Breakdown:
- Lexis+ AI: 65% accuracy rate
- Westlaw AI: 42% accuracy rate
- Ask Practical Law AI: Over 60% incomplete answers

>> Why This Matters
This research exposes critical vulnerabilities in AI legal tools that lawyers increasingly rely on. It's essential for legal professionals to understand these limitations when incorporating AI into their practice.
posted an update 3 days ago
view post
Post
1138
Exciting breakthrough in LLM reasoning: Introducing "Thread of Thought" (ThoT) - a novel prompting strategy that revolutionizes how language models handle chaotic contexts!

Unlike traditional approaches that struggle with complex, interleaved information, ThoT enables LLMs to methodically segment and analyze extended contexts with remarkable precision. Here's how it works:

Technical Deep Dive:
- ThoT employs a two-step prompting mechanism:
1. Initial Analysis: Uses a template combining chaotic context (X) and query (Q) with a trigger sentence that initiates systematic reasoning.
2. Conclusion Refinement: Leverages the organized thought sequence to extract definitive answers.

Implementation Details:
- Seamlessly integrates as a "plug-and-play" module with existing LLMs.
- Requires no model retraining or fine-tuning.
- Works with various prompting techniques and model architectures.

Performance Highlights:
- Outperformed traditional methods on PopQA and EntityQ datasets.
- Achieved 57.4% accuracy with GPT-3.5-turbo (vs. 48.2% for Chain-of-Thought).
- Demonstrated superior performance across model scales, from 7B to 70B parameters.

Key Applications:
- Retrieval-augmented generation.
- Multi-turn conversation responses.
- Complex reasoning tasks requiring information synthesis.

What makes it special: ThoT mirrors human cognitive processes by breaking down complex information into manageable segments while maintaining logical continuity – a game-changer for handling information-dense contexts.
posted an update 4 days ago
view post
Post
2271
Good folks at @nvidia and @Tsinghua_Uni have released LLAMA-MESH - A Revolutionary Approach to 3D Content Generation!

This innovative framework enables the direct generation of 3D meshes from natural language prompts while maintaining strong language capabilities.

Here is the Architecture & Implementation!

>> Core Components

Model Foundation
- If you haven't guessed it yet, it's built on the LLaMA-3.1-8B-Instruct base model
- Maintains original language capabilities while adding 3D generation
- Context length is set to 8,000 tokens

3D Representation Strategy
- Uses the OBJ file format for mesh representation
- Quantizes vertex coordinates into 64 discrete bins per axis
- Sorts vertices by z-y-x coordinates, from lowest to highest
- Sorts faces by the lowest vertex indices for consistency

Data Processing Pipeline
- Filters meshes to a maximum of 500 faces for computational efficiency
- Applies random rotations (0°, 90°, 180°, 270°) for data augmentation
- Generates ~125k mesh variations from 31k base meshes
- Uses Cap3D-generated captions for text descriptions

>> Training Framework

Dataset Composition
- 40% Mesh Generation tasks
- 20% Mesh Understanding tasks
- 40% General Conversation (UltraChat dataset)
- 8x training turns for generation, 4x for understanding

Training Configuration
- Deployed on 32 A100 GPUs (for Nvidia, this is literally in-house)
- 21,000 training iterations
- Global batch size: 128
- AdamW optimizer with a 1e-5 learning rate
- 30-step warmup with cosine scheduling
- Total training time: approximately 3 days (based on the paper)

This research opens exciting possibilities for intuitive 3D content creation through natural language interaction. The future of digital design is conversational!
posted an update 6 days ago
view post
Post
1875
It's not every day you see the No. 1 ranked paper of the day open-sourcing a very powerful image editing app!

Fascinating to see MagicQuill - a groundbreaking interactive image editing system that makes precise photo editing effortless through advanced AI!

The system's architecture features three sophisticated components:

1. Editing Processor:
- Implements a dual-branch architecture integrated into a latent diffusion framework
- Utilizes PiDiNet for edge map extraction and content-aware per-pixel inpainting
- Features a specialized UNet architecture with zero-convolution layers for feature insertion
- Employs denoising score matching for training the control branch
- Processes both structural modifications via scribble guidance and color manipulation through downsampled color blocks
- Maintains pixel-level control through VAE-based latent space operations

2. Painting Assistor:
- Powered by a fine-tuned LLaVA multimodal LLM using Low-Rank Adaptation (LoRA)
- Trained on a custom dataset derived from Densely Captioned Images (DCI)
- Processes user brushstrokes through specialized Q&A tasks for add/subtract/color operations
- Features bounding box coordinate normalization for precise stroke localization
- Implements streamlined single-word/phrase outputs for real-time performance

3. Idea Collector:
- Built as a modular ReactJS component library
- Supports cross-platform deployment via HTTP protocols
- Compatible with Gradio and ComfyUI frameworks
- Features comprehensive layer management and parameter adjustment capabilities
- Implements real-time canvas updates and preview generation

The system outperforms existing solutions like SmartEdit and BrushNet in edge alignment and color fidelity while maintaining seamless integration with popular AI frameworks.

What are your thoughts on AI-powered creative tools?
replied to m-ric's post 6 days ago
replied to maxiw's post 6 days ago
posted an update 7 days ago
view post
Post
1487
Sometimes, we forget that all these LLMs are trained on just raw text. Ideally, they are simply text completion models. Imagine a model that keeps on writing follow-up questions when you ask, "How to make pizza?" rather than answering you!

That's where Instruction Tuning comes in—it’s a game-changer.

Instruction tuning has revolutionized how we interact with Large Language Models (LLMs), bridging the crucial gap between raw model capabilities and practical applications.

It’s what transforms a GPT into ChatGPT!

Think of instruction tuning as teaching AI to "speak human"—it's the difference between a model that merely predicts the next words and one that truly understands and executes our intentions.

The real magic? It enables zero-shot learning, meaning models can tackle new tasks they've never encountered before, as long as the instructions are clear. This versatility is what makes modern AI assistants so powerful and user-friendly.
  • 2 replies
·
replied to maxiw's post 7 days ago
view reply

Enough to make a grown man cry! 😃🤗

Anyway, next whole week I will be posting about the best papers(according to me), 1 every day that discuss ways to reduce hallucinations (total 7)...cheers😬

posted an update 13 days ago
view post
Post
2260
Thinking about upgrading from Python 3.10 to 3.11? Here's why you should make the move - a deep technical breakdown that might convince you:

>> Performance Revolution
The performance improvements are staggering, with benchmarks showing 10-60% faster execution across different workloads. Let me break down the game-changing features:

>> Core Architecture Changes
Python 3.11's interpreter now uses statically allocated core modules, eliminating the multi-step loading process we've dealt with in 3.10. This means your applications will start 10-15% faster out of the gate.

>> Function Optimization
The redesigned frame objects are a thing of beauty - they've been stripped of unnecessary baggage, resulting in a 3-7% speedup for all function calls. But it gets better: function calls are now inlined, giving us a 1-3% boost, with recursive functions like Fibonacci seeing up to 1.7x improvement!

>> Adaptive Intelligence
The new Specializing Interpreter is perhaps the most exciting addition. Think of it as a lightweight JIT - it identifies hot code paths and optimizes them automatically.

The interpreter now automatically specializes math operations, array indexing, and even sequence unpacking based on actual usage patterns.

>> Exception Handling Revolution
My favorite feature? Zero-cost exceptions! Your try-except blocks no longer carry overhead when no exceptions occur. The code runs at full speed until an exception actually happens.

Ready to make the switch? These improvements aren't just numbers - they're real-world performance gains waiting to be unlocked in your codebase.
posted an update 17 days ago
view post
Post
2075
Exciting Research Alert: Revolutionizing Dense Passage Retrieval with Entailment Tuning!

The good folks at HKUST have developed a novel approach that significantly improves information retrieval by leveraging natural language inference.

The entailment tuning approach consists of several key steps to enhance dense passage retrieval performance.

Data Preparation
- Convert questions into existence claims using rule-based transformations.
- Combine retrieval data with NLI data from SNLI and MNLI datasets.
- Unify the format of both data types using a consistent prompting framework.

Entailment Tuning Process
- Initialize the model using pre-trained language models like BERT or RoBERTa.
- Apply aggressive masking (β=0.8) specifically to the hypothesis components while preserving premise information.
- Train the model to predict the masked hypothesis tokens from the premise content.
- Run the training for 10 epochs using 8 GPUs, taking approximately 1.5-3.5 hours.

Training Arguments for Entailment Tuning (Yes! They Shared Them)
- Use a learning rate of 2e-5 with 100 warmup steps.
- Set batch size to 128.
- Apply weight decay of 0.01.
- Utilize the Adam optimizer with beta values (0.9, 0.999).
- Maintain maximum gradient norm at 1.0.

Deployment
- Index passages using FAISS for efficient retrieval.
- Shard vector store across multiple GPUs.
- Enable sub-millisecond retrieval of the top-100 passages per query.

Integration with Existing Systems
- Insert entailment tuning between pre-training and fine-tuning stages.
- Maintain compatibility with current dense retrieval methods.
- Preserve existing contrastive learning approaches during fine-tuning.

Simple, intuitive, and effective!

This advancement significantly improves the quality of retrieved passages for question-answering systems and retrieval-augmented generation tasks.
posted an update 25 days ago
view post
Post
2567
Good folks from @Microsoft have released an exciting breakthrough in GUI automation!

OmniParser – a game-changing approach for pure vision-based GUI agents that works across multiple platforms and applications.

Key technical innovations:
- Custom-trained interactable icon detection model using 67k screenshots from popular websites
- Specialized BLIP-v2 model fine-tuned on 7k icon-description pairs for extracting functional semantics
- Novel combination of icon detection, OCR, and semantic understanding to create structured UI representations

The results are impressive:
- Outperforms GPT-4V baseline by significant margins on the ScreenSpot benchmark
- Achieves 73% accuracy on Mind2Web without requiring HTML data
- Demonstrates a 57.7% success rate on AITW mobile tasks

What makes OmniParser special is its ability to work across platforms (mobile, desktop, web) using only screenshot data – no HTML or view hierarchy needed. This opens up exciting possibilities for building truly universal GUI automation tools.

The team has open-sourced both the interactable region detection dataset and icon description dataset to accelerate research in this space.

Kudos to the Microsoft Research team for pushing the boundaries of what's possible with pure vision-based GUI understanding!

What are your thoughts on vision-based GUI automation?
posted an update 27 days ago
view post
Post
1164
Good folks from @Microsoft Research have just released bitnet.cpp, a game-changing inference framework that achieves remarkable performance gains.

Key Technical Highlights:
- Achieves speedups of up to 6.17x on x86 CPUs and 5.07x on ARM CPUs
- Reduces energy consumption by 55.4–82.2%
- Enables running 100B parameter models at human reading speed (5–7 tokens/second) on a single CPU

Features Three Optimized Kernels:
1. I2_S: Uses 2-bit weight representation
2. TL1: Implements 4-bit index lookup tables for every two weights
3. TL2: Employs 5-bit compression for every three weights

Performance Metrics:
- Lossless inference with 100% accuracy compared to full-precision models
- Tested across model sizes from 125M to 100B parameters
- Evaluated on both Apple M2 Ultra and Intel i7-13700H processors

This breakthrough makes running large language models locally more accessible than ever, opening new possibilities for edge computing and resource-constrained environments.
  • 4 replies
·
posted an update 28 days ago
view post
Post
2729
If you have ~300+ GB of V-RAM, you can run Mochi from @genmo

A SOTA model that dramatically closes the gap between closed and open video generation models.

Mochi 1 introduces revolutionary architecture featuring joint reasoning over 44,520 video tokens with full 3D attention. The model implements extended learnable rotary positional embeddings (RoPE) in three dimensions, with network-learned mixing frequencies for space and time axes.

The model incorporates cutting-edge improvements, including:
- SwiGLU feedforward layers
- Query-key normalization for enhanced stability
- Sandwich normalization for controlled internal activations

What is currently available?
The base model delivers impressive 480p video generation with exceptional motion quality and prompt adherence. Released under the Apache 2.0 license, it's freely available for both personal and commercial applications.

What's Coming?
Genmo has announced Mochi 1 HD, scheduled for release later this year, which will feature:
- Enhanced 720p resolution
- Improved motion fidelity
- Better handling of complex scene warping
  • 2 replies
·
posted an update about 1 month ago
view post
Post
1292
Looks like @Meta thinks we forgot they created PyTorch, so now they've open-sourced Lingua, a powerful and flexible library for training and inferencing large language models.

Things that stand out:

- Architecture: Pure PyTorch nn.Module implementation for easy customization.

- Checkpointing: Uses the new PyTorch distributed saving method (.distcp format) for flexible model reloading across different GPU configurations.

- Configuration: Utilizes data classes and YAML files for intuitive setup and modification.

- Profiling: Integrates with xFormers' profiler for automatic MFU and HFU calculation, plus memory profiling.

- Slurm Integration: Includes stool.py for seamless job launching on Slurm clusters.

Some results from @Meta to show off:

- 1B parameter models trained on 60B tokens achieve strong performance across various NLP tasks.

- 7B parameter Mamba model (trained on 200B tokens) shows competitive results with Llama 7B on benchmarks like ARC, MMLU, and BBH.

If you're working on LLM research or looking to experiment with cutting-edge language model architectures, Lingua is definitely worth exploring.
posted an update about 1 month ago
view post
Post
1732
Good folks at @Apple have developed a novel method called KV Prediction that significantly reduces the "time to first token" (TTFT) for on-device LLM inference.

Some highlights of the paper:

• Uses a small auxiliary transformer model to efficiently predict the KV cache of a larger base model
• Reduces TTFT by up to 4x while retaining 60-80% accuracy on benchmarks
• Achieves Pareto-optimal efficiency-accuracy trade-off compared to baselines
• Demonstrates 15-50% relative accuracy improvements on TriviaQA at equal TTFT FLOP budgets
• Shows up to 30% accuracy gains on HumanEval code completion at fixed TTFT FLOP counts
• Validated on Apple M2 Pro CPU, proving FLOP gains translate to real-world speedups


So, how's it done?

Based on the KV Prediction method described in the paper, here are the key steps for how it's done:

1. Choose a base model and an auxiliary model:
- The base model is a larger, pretrained transformer model that will be used for final generation.
- The auxiliary model is a smaller transformer model used to efficiently process the input prompt.

2. Design the KV predictor:
- Create a set of learned linear projections to map from the auxiliary model's KV cache to the base model's KV cache.
- Define a mapping from auxiliary cache layers to base cache layers.

3. Training process:
- Pass input tokens through the auxiliary model to get its KV cache.
- Use the KV predictor to generate a predicted KV cache for the base model.
- Run the base model using the predicted KV cache and compute losses.
- Backpropagate errors through the frozen base model to update the auxiliary model and KV predictor.

4. Inference process:
- Process the input prompt with the auxiliary model to get its KV cache.
- Use the KV predictor to generate the predicted base model KV cache.
- Run a single token generation step with the base model using the predicted KV cache.
- Continue autoregressive generation with the base model as normal.

Excited to hear your thoughts!
posted an update about 1 month ago
view post
Post
391
All the way from Korea, a novel approach called Mentor-KD significantly improves the reasoning abilities of small language models.

Mentor-KD introduces an intermediate-sized "mentor" model to augment training data and provide soft labels during knowledge distillation from large language models (LLMs) to smaller models.

Broadly, it’s a two-stage process:
1) Fine-tune the mentor on filtered Chain-of-Thought (CoT) annotations from an LLM teacher.
2) Use the mentor to generate additional CoT rationales and soft probability distributions.

The student model is then trained using:
- CoT rationales from both the teacher and mentor (rationale distillation).
- Soft labels from the mentor (soft label distillation).

Results show that Mentor-KD consistently outperforms baselines, with up to 5% accuracy gains on some tasks.

Mentor-KD is especially effective in low-resource scenarios, achieving comparable performance to baselines while using only 40% of the original training data.

This work opens up exciting possibilities for making smaller, more efficient language models better at complex reasoning tasks.

What are your thoughts on this approach?
posted an update about 1 month ago
view post
Post
2155
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.

The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.

How?
- It uses two separate softmax attention maps and subtracts them.
- It employs a learnable scalar λ for balancing the attention maps.
- It implements GroupNorm for each attention head independently.
- It is compatible with FlashAttention for efficient computation.

What do you get?
- Superior long-context modeling (up to 64K tokens).
- Enhanced key information retrieval.
- Reduced hallucination in question-answering and summarization tasks.
- More robust in-context learning, less affected by prompt order.
- Mitigation of activation outliers, opening doors for efficient quantization.

Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.

This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?
  • 1 reply
·
posted an update about 1 month ago
view post
Post
542
Good folks from Universitat Politècnica de Catalunya, University of Groningen, and Meta have released "A Primer on the Inner Workings of Transformer-based Language Models."

They don't make survey papers like they used to, but this is an exciting new survey on Transformer LM interpretability!

This comprehensive survey provides a technical deep dive into:

• Transformer architecture components (attention, FFN, residual stream)
• Methods for localizing model behavior:
- Input attribution (gradient & perturbation-based)
- Component importance (logit attribution, causal interventions)
• Information decoding techniques:
- Probing, linear feature analysis
- Sparse autoencoders for disentangling features
• Key insights on model internals:
- Attention mechanisms (induction heads, copy suppression)
- FFN neuron behaviors
- Residual stream properties
- Multi-component emergent behaviors

The paper offers a unified notation and connects insights across different areas of interpretability research. It's a must-read for anyone working on understanding large language models!

Some fascinating technical highlights:
- Detailed breakdowns of attention head circuits (e.g., IOI task)
- Analysis of factual recall mechanisms
- Overview of polysemanticity and superposition
- Discussion of grokking as circuit emergence

What interpretability insights do you find most intriguing?
posted an update about 1 month ago
view post
Post
2008
Just started going through the latest "State of AI Report 2024", and I cannot get over the predictions!

The report predicts major developments in AI over the next 12 months, including a $10B+ investment from a sovereign state into a large US AI lab, triggering national security scrutiny, and a viral app created by someone without coding skills.

It forecasts changes in data collection practices due to frontier labs facing trials, softer-than-expected EU AI Act implementations, and the rise of an open-source alternative to OpenAI GPT-4 outperforming in benchmarks.

NVIDIA’s dominance will remain largely unchallenged, investment in humanoid robots will decline, Apple’s on-device AI research will gain momentum, and a research paper by an AI scientist will be accepted at a major conference.

Lastly, a GenAI-based video game is expected to achieve breakout success.

Yet to go through all 200+ pages... will post summarized thoughts later.
  • 2 replies
·