Dran

Dran@lemmy.world · edit-2 30 days ago

It’s almost certainly related to cloud-init, (the canonical tool for handling deployment automation) or Ubuntu pro (extra long support for backporting security packages to older distros, plus some conveniences). They’re pre installed as a convenience to paid users of those services, that’s the (IMHO, quite reasonable) model they use to fund the distro. I would expect that some or all of that traffic would disappear if you disable/remove those two services.

https://cloud-init.io/

https://ubuntu.com/pro

Dran@lemmy.world · 1 month ago

Yes they were, so I’m offering you an actual theory as to why this may actually be true, yet difficult to “prove”.

Smoking was bad for your health long before anyone sat down and took the time to prove it. Autoregressive LLM tokenizer are a very new field of computer science and it’s going to take a while for the community to collectively understand everything we’re currently doing by trial and error.

Dran@lemmy.world · 1 month ago

Anecdotally, I use it a lot and I feel like my responses are better when I’m polite. I have a couple of theories as to why.

More tokens in the context window of your question, and a clear separator between ideas in a conversation make it easier for the inference tokenizer to recognize disparate ideas.
Higher quality datasets contain american boomer/millennial notions of “politeness” and when responses are structured in kind, they’re more likely to contain tokens from those higher quality datasets.

I haven’t mathematically proven any of this within the llama.cpp tokenizer, but I strongly suspect that I could at least prove a correlation between polite token input and dataset representation output tokens

Dran@lemmy.world · 1 month ago

I think that holds true in this case… but I’ll be damned if the ltt screwdriver isn’t the best hand tool I’ve ever owned.

Dran@lemmy.world · 2 months ago

RE: backups, I’d recommend altering your workflow. Instead of taking an image of a box, automate the creation of that box. Create a bash script that takes a base OS, and installs everything you use fresh. Then have it apply configuration files where appropriate, and lastly figure out which applications really need backup blobs to work properly (thunderbird, for example). Once you have that, your backups become just the data itself. Photos, documents, etc. Everything else is effectively ephemeral because it can be reproduced through automation.

Takes a lot less space, is a lot more portable. And much better in scenarios where something in your OS is broken or you get a new computer and want to replicate your setup.

Dran@lemmy.world · 2 months ago

They are indeed just that keen on our data.

They know they can’t get rid of it for all of their customers, but they do want to make it as hard as possible for random users to do so.

Dran@lemmy.world · edit-2 2 months ago

The problem with this is it doesn’t work for home users that want to pay for their software. Crazy… I know… but those people do exist.

Dran@lemmy.world · 2 months ago

For people with “that one game” there is a middle ground. Mine is Destiny 2 and they use a version of easy anticheat that refuses to run on Linux. My solution was to buy a $150 used Dell on eBay, a $180 GPU to be able to output to my 4 high-res displays, and install Debian + moonlight on it. I moved my gaming PC downstairs and a combination of wake-on-lan + sunshine means that I can game at functionally native performance, streaming from the basement. In my setup, windows only exists to play games on.

The added bonus here is now I can also stream games to my phone, or other ~thin clients~ in the house, saving me upgrade costs if I want to play something in the living room or upstairs. All you need is the bare minimum for native-framerate, native-res decoding, which you can find in just about anything made in the last 5-10 years.

Dran@lemmy.world · 2 months ago

“Open source” in ML is a really bad description for what it is. “Free binary with a bit of metadata” would be more accurate. The code used to create deepseek is not open source, nor is the training datasets. 99% of “open source” models are this way. The only interesting part of the open sourcing is the architecture used to run the models, as it lends a lot of insight into the training process, and allows for derivatives via post-training

Dran@lemmy.world · 2 months ago

There’s nothing magic about Soylent for weight loss. It’s a simple equation of calories in and calories out. The advantages that Soylent offered me was convenience for counting said calories, convenience for meal prep, and being reasonably certain my body was getting a decent distribution of micronutrients

Dran@lemmy.world · 2 months ago

Yes, 375 -> 250

Dran@lemmy.world · edit-2 2 months ago

I did ~1.5 years of only Soylent, then transitioned into 2/3 meals per day being Soylent, which I’ve done for the last ~6-7yrs.

I’m the healthiest I’ve ever been, but it does require discipline, exercise and attention like anything else. Calories are calories and if you consume more than you burn, you’ll poop a lot and gain weight. If you drink at a significant deficit (my 1.5years was at 1200kcal/day) you will poop once or twice a week and it will take a few months of your body getting used to it for it to be more than liquid.

As others have said though, it’s a deceptively dehydrating liquid. You absolutely still need to drink water, and your water intake will largely dictate how much you pee.

Dran@lemmy.world · edit-2 2 months ago

It’s a little deeper than that, a lot of advertising works on engagement -based heuristics. Today, most people would call it “AI” but it’s fundamentally just a reinforcement learning network that trains itself constantly on user interactions. It’s difficult-to-impossible to determine why input X is associated with output Y, but we can measure in aggregate how subtle changes propagate across engagement metrics.

It is absolutely truthful to say we don’t know how a modern reinforcement learning network got to the state it’s in today, because transactions on the network usually aren’t journaled, just periodically snapshot for A/B testing.

To be clear, that’s not an excuse for undesirable heuristic behavior. Somebody somewhere made the choice to do this, and they should be liable for the output of their code.

Dran@lemmy.world · 3 months ago

I think you’ve convinced me that it’s a slightly more complicated problem than I initially gave it credit for; thank you for that!

I think you could solve for the disparate community theme problem by also requiring title match for mergers. You could probably also solve for it by having a 2-way merger whitelist on links. E.g community A and B both maintain lists of “similar” communities and then if A’s list contains B and vice-versa they would merge.

Comment moderation I got nothing though. That’s a tough one.

Dran@lemmy.world · 3 months ago

Oh weird, I would not have expected to be in the minority there

Dran@lemmy.world · 3 months ago

It’s about halfway there I think, they still show up separately in clients and have separate comments threads.

Dran@lemmy.world · 3 months ago

Lemmy needs some sort of built-in way to merge them. That’d be the best solution I think. Then you could just pick a list of relevant communities and it’d be pretty seamless

Dran@lemmy.world · 3 months ago

Depends on where you work and what their policies are. My work does have many strict policies on following licenses, protecting sensitive data, etc

My solution was to MIT license and open source everything I write. It follows all policies while still giving me the flexibility to fork/share the code with any other institutions that want to run something similar.

It also had the added benefit of forcing me to properly manage secrets, gitignores, etc

Dran@lemmy.world · 4 months ago

The difference is I (the contributor of content) have the same access as anyone else to the data, and could use it for my own purposes if I wanted to.

On a platform like reddit, access to the raw data is controlled and cannot be format shifted / used in any way I wanted to.

Dran@lemmy.world · 5 months ago

I feel the sentiment though, my daily driver is built off of ubuntu-server headless. I find it’s just the right amount of “has searchable solutions for near everything” and “is properly minimal” to base my workstation off of. I run X11, pulse, awesomewm, firefox, lxterm… fairly standard stuff but without gdm, gnome/kde even installed it’s pretty lightweight. The entire os uses ~780mb of ram (+23gb for firefox tabs lmao). It gets the job done, keeps my skills relevant maintaining the automation that builds it, is dead simple to troubleshoot, and has very few black boxes.