Hacker Newsnew | past | comments | ask | show | jobs | submit | DGoettlich's commentslogin

data is 100% public domain.

very interesting observation!

well put.

thanks. i think this just took on a weird dynamic. we never said we'd lock the model away. not sure how this impression seems to have emerged for some. that aside, it was an announcement of a release, not a release. the main purpose was gathering feedback on our methodology. standard procedure in our domain is to first gather criticism, incorporate it, then publish results. but i understand people just wanted to talk to it. fair enough!

Thanks for the comment. Could you elaborate on what you find iffy about our approach? I'm sure we can improve!

Well, it would be nice to see examples (or weights to be completely open) for the baseline model, without any GPT-5 influence whatsoever. Basically let people see what the "raw" output from historical texts is like, and for that matter actively demonstrating why the extra tweaks and layers are needed to make a useful model. Show, don't tell, really.

valid point. its more of a stepping stone towards larger models. we're figuring out what the best way to do this is before scaling up.

If there's very little text before the internet, what would scaling up look like?

exactly

Also one of our fears. What we've done so far is to drop docs where the datasource was doubtful about the date of publication, if there are multiple possible dates we take the latest to be conservative. During training, we validate that the model learns pre- but not post-cutoff facts. https://github.com/DGoettlich/history-llms/blob/main/ranke-4...

If you have other ideas or think thats not enough, I'd be curious to know! (history-llms@econ.uzh.ch)


what makes you think we trained on only a few gigabytes? https://github.com/DGoettlich/history-llms/blob/main/ranke-4...

fully understand you. we'd like to provide access but also guard against misrepresentations of our projects goals by pointing to e.g. racist generations. if you have thoughts on how we should do that, perhaps you could reach out at history-llms@econ.uzh.ch ? thanks in advance!

What is your worst-case scenario here?

Something like a pop-sci article along the lines of "Mad scientists create racist, imperialistic AI"?

I honestly don't see publication of the weights as a relevant risk factor, because sensationalist misrepresentation is trivially possible with the given example responses alone.

I don't think such pseudo-malicious misrepresentation of scientific research can be reliably prevented anyway, and the disclaimers make your stance very clear.

On the other hand, publishing weights might lead to interesting insights from others tinkering with the models. A good example for this would be the published word prevalence data (M. Brysbaert et al @Ghent University) that led to interesting follow-ups like this: https://observablehq.com/@yurivish/words

I hope you can get the models out in some form, would be a waste not to, but congratulations on a fascinating project regardless!


It seems like if there is an obvious misuse of a tool, one has a moral imperative to restrict use of the tool.

Every tool can be misused. Hammers are as good for bashing heads as building houses. Restricting hammers would be silly and counterproductive.

Yes but if you are building an voice activated autonomous flying hammer then you either want it to be very good at differentiating heads from hammers OR you should restrict its use.

OR you respect individual liberty and agency, hold individuals responsible for their actions, instead of tools, and avoid becoming everyone's condescending nanny.

Your pre-judgement of acceptable hammer uses would rob hammer owners of responsible and justified self-defense and defense of others in situations in which there are no other options, as well as other legally and socially accepted uses which do not fit your pre-conceived ideas.


Perhaps you could detect these... "dated"... conclusions and prepend a warning to the responses? IDK.

I think the uncensored response is still valuable, with context. "Those who cannot remember the past are condemned to repeat it" sort of thing.


You can guard against misrepresentations of your goals by stating your goals clearly, which you already do. Any further misrepresentation is going to be either malicious or idiotic, a university should simply be able to deal with that.

Edit: just thought of a practical step you can take: host it somewhere else than github. If there's ever going to be a backlash the microsoft moderators might not take too kindly to the stuff about e.g. homosexuality, no matter how academic.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: