Data Science Things Roundup #12
After a long hiatus, I’m bringing back the Data Science Things Roundup series. For those new here, this is where I share three interesting things from the world of data science that caught my attention. I tend to focus on developments that might have flown under the radar - while breakthroughs like DeepSeek-R1 are incredibly exciting, there’s already extensive coverage elsewhere. Instead, I aim to highlight lesser-known but equally fascinating developments that deserve more attention.
ModernBERT: The Long-Awaited BERT Successor
Answer.AI and LightOn just released ModernBERT, and it might be exactly what we’ve been waiting for since BERT’s release in 2018. They’ve taken all the recent advances from large language models and applied them to create a faster, more accurate encoder that can handle 8k tokens (versus BERT’s 512). What’s particularly clever is their focus on training data diversity, especially including code, making it a powerful tool for code-related tasks. Plus, they’ve open-sourced all intermediate checkpoints for further fine-tuning.
ReaderLM v2: Small Models, Big Impact
While the industry chases larger models, Jina AI took a different approach with ReaderLM v2. This 1.5B parameter model specializes in converting HTML to markdown and JSON, and surprisingly outperforms models 20 times its size. Their secret? Using larger models to generate and refine training data, creating a small but mighty model that can handle documents up to 512K tokens long.
Cohere’s Rerank 3.5: Making Search Smarter
Rerank 3.5 from Cohere might not grab headlines, but it’s solving a real problem in enterprise search. This update specifically targets the challenge of searching through technical and specialized content, with improved handling of industry-specific terminology and multiple languages. It’s the kind of advancement that could make finding that one crucial document in your company’s knowledge base actually work the way it should.
Did you enjoy this? While quite dated now, you might find these previous editions interesting: