65 private links
Maybe it's time to stop obsessing about scale and distributed architectures? The hardware has been improved quite a bit at the right places, especially storage.
OK, the numbers are indeed impressive. And it's API is fully compatible apparently, looks like a good replacement if you got Pandas code around.
A good reminder that I should probably evaluate DuckDB for some of my tooling.
OK, this is a rant about the state of the market and people drinking kool-aid. A bit long but I found it funny and well deserved at times.
The more releases out there the more vulnerabilities are (and could be) discovered. Some actions are necessary to get things under control properly.
Polars really looks like a nice alternative to Pandas with a nice upgrade path from data exploration to production.
Polars looks like an interesting alternative to Pandas in the industrialization phase of a data processing pipeline. The performance difference are really notable with larger volumes. I'd be interested to see how much of it is lost when using its Python API though.
And this is why you likely need to optimize your data pipelines at some point. There are plenty of levers available.
Interesting exploration of statistics around marriage (in the US). Some jobs are definitely more staying in their own circles than others.
A good reminder to use the right tool for the task. Sometimes all you need is really a POSIX shell with a couple of well optimized tools.
Now that finally looks like an interesting approach to make GPU computation more accessible to the public. This seems to do a few things right to lower a bit the complexity while retaining good performances.
It's a very nice paper on spreadsheets and how we use them. It got enough history in it to make me tick (goes back all the way to the 1300s!). Also it's well balanced, it's not just about blindly blaming tools but looks at their shortcomings but also how we often use the wrong tool for the task... and then end up managing data and knowledge really badly.
A good example of using the best tool for the job. Having your whole data analysis pipeline in pandas might not be what you want for performance reason. Very often there's a relational database you can leverage first.
Still keeping an eye on what's available for crunching numbers in Rust. Apache Arrow looks like an interesting option.
Ah, finally looks like we got an interesting dataframe crate in the Rust world. Performances seems nice too.
To be seen how it behaves in practice. The explanations of how it's designed are interesting in any case. :-)
Interesting discussion... could people go on strike toward providing data to big tech to demand change? At that point it seems to me more like an interesting thought experiment than something really doable... Probably worth monitoring where the conversation goes.
Interesting comparison even though the conclusion is slightly unsurprising: Pandas is slower but more convenient, Rust is fast, consumes less memory but more work is involved. At least this gives a few indications on what type of APIs could be added to Rust to ease some things. It also indicates that Pandas can be great to develop the pipeline with then a switch to Rust when this needs to be optimized for higher volumes of data.