SwapAnything is a new method that allows swapping any object in an image with personalized concepts given by a reference image.
Key points: 1️⃣ It uses pre-trained diffusion models to enable precise and high-fidelity object swapping in images. 2️⃣Targeted variable swapping ensures perfect background preservation while swapping specific areas. 3️⃣SwapAnything achieves good results in single-object, multi-object, partial-object, and cross-domain swapping tasks.
🔁 AutoMerger created the best 7B model on the Open LLM Leaderboard
By randomly combining top models from the Open LLM Leaderboard, AutoMerger created YamshadowExperiment28-7B. The model is three weeks old and has been at the top of the leaderboard for a week now. It was created through a simple SLERP merge of:
1/ On the Open LLM Leaderboard, it managed to outperform the excellent M7-7b model, which has been the #1 7B model for a while now.
2/ On the YALL leaderboard, YamshadowExperiment28-7B is ranked as the 9th best-performing automerge (but note that the scores are very close to each other). Compared to others, it does not perform particularly well on AGIEval or Bigbench.
3/ Thanks to @sam-paech , I have scores on EQ-Bench, where it managed to outperform all of my previous models. It even surpasses recent models such as DBRX instruct, Qwen1.5 32B Chat, and Cohere's Command R+.
Surprisingly, it does not support ChatML or Mistral Instruct, unlike my other merges (which are part of its family tree). Alpaca works well 99% of the time, but the model can sometimes produce a lot of "INST" tokens for no reason.
In my experiments, YamshadowExperiment28-7B doesn't seem smarter than other successful merges like AlphaMonarch. On the contrary, I found several mathematical or reasoning problems where it fails.
Considering these results, it looks like it might overfit the Open LLM Leaderboard. I guess it's anything but surprising when you randomly merge 156 models.
Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.
🆕 Releasing a new series of 8 zeroshot classifiers: better performance, fully commercially useable thanks to synthetic data, up to 8192 tokens, run on any hardware.
Summary: 🤖 The zeroshot-v2.0-c series replaces commercially restrictive training data with synthetic data generated with mistralai/Mixtral-8x7B-Instruct-v0.1 (Apache 2.0). All models are released under the MIT license. 🦾 The best model performs 17%-points better across 28 tasks vs. facebook/bart-large-mnli (the most downloaded commercially-friendly baseline). 🌏 The series includes a multilingual variant fine-tuned from BAAI/bge-m3 for zeroshot classification in 100+ languages and with a context window of 8192 tokens 🪶 The models are 0.2 - 0.6 B parameters small, so they run on any hardware. The base-size models are +2x faster than bart-large-mnli while performing significantly better. 🤏 The models are not generative LLMs, they are efficient encoder-only models specialized in zeroshot classification through the universal NLI task. 🤑 For users where commercially restrictive training data is not an issue, I've also trained variants with even more human data for improved performance.
Next steps: ✍️ I'll publish a blog post with more details soon 🔮 There are several improvements I'm planning for v2.1. Especially the multilingual model has room for improvement.
Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.
Key Points: * Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue * Generates malicious examples using an uninhibited "helpful-only" model * Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4 * Standard alignment techniques provide limited protection against long context attacks
Currently contains 16 notebooks in English (and some in Chinese): 1. Using LLM-as-a-judge 🧑⚖️ for an automated and versatile evaluation 2. Create a legal preference dataset 3. Suggestions for Data Annotation with SetFit in Zero-shot Text Classification 4. Implementing semantic cache to improve a RAG system 5. Building A RAG Ebook “Librarian” Using LlamaIndex 6. Stable Diffusion Interpolation 7. Building A RAG System with Gemma, MongoDB and Open Source Models 8. Prompt Tuning with PEFT Library 9. Migrating from OpenAI to Open LLMs Using TGI’s Messages API 10. Automatic Embeddings with TEI through Inference Endpoints 11. Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain 12. Embedding multimodal data for similarity search using 🤗 transformers, 🤗 datasets and FAISS 13. Fine-tuning a Code LLM on Custom Code on a single GPU 14. RAG Evaluation Using Synthetic data and LLM-As-A-Judge 15. Advanced RAG on HuggingFace documentation using LangChain 16. Detecting Issues in a Text Dataset with Cleanlab
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.
Google DeepMind introduces Gecko a new text embedding! Gecko uses a two-step process that leverages synthetic data generation and reranking.
Keypoints: * Uses an LLM to generate diverse synthetic queries and tasks from web passages * Refines the data by retrieving candidate passages and relabeling positives/negatives using the same LLM * Achieves very good results on the Massive Text Embedding Benchmark, where compact 256D Gecko outperforms 768D models. * 768D Gecko achieves state-of-the-art performance competing with models a lot larger larger.
A new paper titled "Long-Form Factuality in Large Language Models" proposes a new approach to evaluate the long-form factuality of large language models using an AI agent! They introduce SAFE (Search-Augmented Factuality Evaluator) which leverages an LLM to break down responses into individual facts, query Google to verify each fact, and perform multi-step reasoning.
Keypoints: * SAFE (Search-Augmented Factuality Evaluator) is an automated method using an LLM agent to evaluate factuality * It also introduces LongFact, a 2,280 prompt set spanning 38 topics to test open-domain factual knowledge * SAFE achieves a 72% humans agreement while being 20x cheaper. It also wins 76% of the disagreements measured on a small scale experiment where a more thorough human procedure (researchers + full internet search) was used. * Larger models like GPT-4, Claude Opus and Gemini Ultra tend to exhibit better long-form factuality.
A new paper introduces Visual CoT, a new approach that enhances multi-modal large language models with visual chain-of-thought reasoning capabilities. This allows language models to dynamically identify and focus on specific regions within images that are most relevant for answering questions, mimicking human-like efficient visual reasoning.
Keypoints: * Introduces the 373k Visual CoT dataset with bounding box annotations highlighting essential image regions * Proposes a multi-turn pipeline for focusing on relevant visual inputs * Achieves strong results on multi-modal benchmarks
🚀💃🏻🌟 New Research Alert - CVPR 2024! 🌟🕺 🚀 📄 Title: Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling 🌟🚀
📝 Description: Animatable Gaussians - a novel method for creating lifelike human avatars from RGB videos, utilizing 2D CNNs and 3D Gaussian splatting to capture pose-dependent garment details and dynamic appearances with high fidelity.
👥 Authors: Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu
📅 Conference: CVPR, Jun 17-21, 2024 | Seattle WA, USA 🇺🇸
"Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts" is a new framework designed to animate specific regions within an image through user inputs.
Key points: * Enables precise animation of selected image regions with just a user click and a concise motion description. * Achieves promising results for generating localized animations.
Synth^2 is a new approach that leverages large language models and text-to-image generators to create synthetic image-caption data for boosting visual-language model performance.
Key Points: * Overcomes data limitations by generating high-quality synthetic image-caption pairs, reducing reliance on costly human annotations. * Achieves competitive results on image captioning tasks using 40x less paired data than state-of-the-art methods.
A recent paper titled "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect" proposes a simple and effective approach to pruning Large Language Models (LLMs) by removing redundant layers.
Key points: * Discovers significant redundancy across layers in LLMs, with some layers playing a negligible role for the final performance. * Defines a new metric called Block Influence (BI) to quantify the importance of each layer in an LLM. * Removes layers with low BI scores, achieving up to 25% reduction in parameters and computation while maintaining 92% of the LLM's performance.