Search: [data-science] - ervin's web review

Why DuckDB is my first choice for data processing

I definitely would like to have some time to fiddle with DuckDB more. It looks like a really neat alternative to something like pandas.

tech · data-science · databases · pandas · duckdb

January 16, 2026 at 10:59:38 PM GMT+1 * · permalink

·

https://www.robinlinacre.com/recommend_duckdb/

·

A Modern Recommender Model Architecture

Very comprehensive resource to make your own recommender model.

tech · ai · machine-learning · data-science

December 29, 2025 at 9:57:38 AM GMT+1 * · permalink

·

https://cprimozic.net/blog/anime-recommender-model-architecture/

·

Small Data

Maybe it's time to stop obsessing about scale and distributed architectures? The hardware has been improved quite a bit at the right places, especially storage.

tech · architecture · hardware · performance · data · data-science

September 29, 2025 at 9:03:21 AM GMT+2 * · permalink

·

https://topicpartition.io/definitions/small-data

·

FireDucks : Pandas but 100x faster

OK, the numbers are indeed impressive. And it's API is fully compatible apparently, looks like a good replacement if you got Pandas code around.

tech · python · performance · pandas · data · data-science

November 21, 2024 at 8:48:34 AM GMT+1 * · permalink

·

https://hwisnu.bearblog.dev/fireducks-pandas-but-100x-faster/

·

DuckDB over Pandas/Polars

A good reminder that I should probably evaluate DuckDB for some of my tooling.

tech · databases · data-science

November 2, 2024 at 11:06:04 AM GMT+1 * · permalink

·

https://pgrs.net/2024/11/01/duckdb-over-pandas-polars/

·

I Will Fucking Piledrive You If You Mention AI Again — Ludicity

OK, this is a rant about the state of the market and people drinking kool-aid. A bit long but I found it funny and well deserved at times.

tech · ai · machine-learning · gpt · data-science · criticism · funny

June 19, 2024 at 6:31:13 PM GMT+2 * · permalink

·

https://ludic.mataroa.blog/blog/i-will-fucking-piledrive-you-if-you-mention-ai-again/

·

Why are vulnerabilities out of control in 2024? – Open Source Security

The more releases out there the more vulnerabilities are (and could be) discovered. Some actions are necessary to get things under control properly.

tech · foss · security · data · data-science

June 4, 2024 at 8:59:09 AM GMT+2 * · permalink

·

https://opensourcesecurity.io/2024/06/03/why-are-vulnerabilities-out-of-control-in-2024/

·

Polars for initial data analysis, Polars for production

Polars really looks like a nice alternative to Pandas with a nice upgrade path from data exploration to production.

tech · data-science · pandas · polars · performance

April 7, 2023 at 6:20:37 PM GMT+2 * · permalink

·

https://pythonspeed.com/articles/polars-exploratory-data-analysis-vs-production/

·

Pandas vs Polar - A look at performance

Polars looks like an interesting alternative to Pandas in the industrialization phase of a data processing pipeline. The performance difference are really notable with larger volumes. I'd be interested to see how much of it is lost when using its Python API though.

tech · data-science · python · rust · pandas · polars

July 15, 2022 at 5:54:43 PM GMT+2 * · permalink

·

https://studioterabyte.nl/en/blog/polars-vs-pandas

·

⚡️ The computers are fast, but you don't know it

And this is why you likely need to optimize your data pipelines at some point. There are plenty of levers available.

tech · programming · python · c++ · optimization · performance · data-science

June 17, 2022 at 9:40:27 AM GMT+2 * · permalink

·

http://shvbsle.in/computers-are-fast-but-you-dont-know-it-p1/

·

Jobs that Marry Together the Most | FlowingData

Interesting exploration of statistics around marriage (in the US). Some jobs are definitely more staying in their own circles than others.

statistics · data-science

October 27, 2021 at 9:37:30 AM GMT+2 * · permalink

·

https://flowingdata.com/2021/05/26/jobs-that-marry-together/

·

Command-line Tools can be 235x Faster than your Hadoop Cluster - Adam Drake

A good reminder to use the right tool for the task. Sometimes all you need is really a POSIX shell with a couple of well optimized tools.

tech · data-science · databases · command-line

August 18, 2021 at 11:13:59 AM GMT+2 * · permalink

·

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

·

Introducing Triton: Open-Source GPU Programming for Neural Networks

Now that finally looks like an interesting approach to make GPU computation more accessible to the public. This seems to do a few things right to lower a bit the complexity while retaining good performances.

tech · gpu · data-science · python

July 29, 2021 at 10:07:31 AM GMT+2 * · permalink

·

https://www.openai.com/blog/triton/

·

The Tyranny of Spreadsheets | Tim Harford

It's a very nice paper on spreadsheets and how we use them. It got enough history in it to make me tick (goes back all the way to the 1300s!). Also it's well balanced, it's not just about blindly blaming tools but looks at their shortcomings but also how we often use the wrong tool for the task... and then end up managing data and knowledge really badly.

tech · history · spreadsheets · quality · knowledge · data · data-science · health

July 23, 2021 at 1:16:05 PM GMT+2 * · permalink

·

https://timharford.com/2021/07/the-tyranny-of-spreadsheets/

·

Practical SQL for Data Analysis | Haki Benita

A good example of using the best tool for the job. Having your whole data analysis pipeline in pandas might not be what you want for performance reason. Very often there's a relational database you can leverage first.

tech · pandas · data-science · databases · sql · postgresql

May 4, 2021 at 9:07:11 AM GMT+2 * · permalink

·

https://hakibenita.com/sql-for-data-analysis

·

Introduction to Apache Arrow with Rust | by Andrew Leverette | Level Up Coding

Still keeping an eye on what's available for crunching numbers in Rust. Apache Arrow looks like an interesting option.

tech · rust · data-science

March 25, 2021 at 7:16:29 PM GMT+1 * · permalink

·

https://levelup.gitconnected.com/introduction-to-apache-arrow-with-rust-394f391ea455

·

I wrote one of the fastest DataFrame libraries - Ritchie Vink

Ah, finally looks like we got an interesting dataframe crate in the Rust world. Performances seems nice too.

To be seen how it behaves in practice. The explanations of how it's designed are interesting in any case. :-)

tech · rust · data-science

March 14, 2021 at 7:55:53 PM GMT+1 * · permalink

·

https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/

·

How to poison the data that Big Tech uses to surveil you | MIT Technology Review

Interesting discussion... could people go on strike toward providing data to big tech to demand change? At that point it seems to me more like an interesting thought experiment than something really doable... Probably worth monitoring where the conversation goes.

tech · ai · strikes · data-science

March 8, 2021 at 8:58:32 AM GMT+1 * · permalink

·

https://www.technologyreview.com/2021/03/05/1020376/resist-big-tech-surveillance-data/

·

Data Manipulation: Pandas vs Rust

Interesting comparison even though the conclusion is slightly unsurprising: Pandas is slower but more convenient, Rust is fast, consumes less memory but more work is involved. At least this gives a few indications on what type of APIs could be added to Rust to ease some things. It also indicates that Pandas can be great to develop the pipeline with then a switch to Rust when this needs to be optimized for higher volumes of data.

tech · rust · pandas · data-science

March 4, 2021 at 8:30:55 AM GMT+1 * · permalink

·

https://able.bio/haixuanTao/data-manipulation-pandas-vs-rust--1d70e7fc

·