Hi community, I have been working on benchmarking publicly available LLMs these past couple of weeks. More precisely, I am interested on the finetuning piece since a lot of businesses are starting to entertain the idea of self-hosting LLMs trained on their proprietary data rather than relying on third party APIs.
GitHub repo: <a target="_blank" href="https://github.com/georgian-io/LLM-Finetuning-Hub">https://github.com/georgian-io/LLM-Finetuning-Hub</a>
To this point, I am tracking the following 4 pillars of evaluation that businesses are typically look into: - Performance - Time to train an LLM - Cost to train an LLM - Inference (throughput / latency / cost per token)
For each LLM, my aim is to benchmark them for popular tasks, i.e., classification and summarization. Moreover, I would like to compare them against each other.
So far, I have benchmarked Flan-T5-Large, Falcon-7B and RedPajama and have found them to be very efficient in low-data situations, i.e., when there are very few annotated samples. Llama2-7B/13B and Writer’s Palmyra are in the pipeline.
But there’s so many LLMs out there! In case this work interests you, would be great to join forces.
GitHub repo attached — feedback is always welcome :)
Happy hacking!

Hi community, I have been working on benchmarking publicly available LLMs these past couple of weeks. More precisely, I am interested on the finetuning piece since a lot of businesses are starting to entertain the idea of self-hosting LLMs trained on their proprietary data rather than relying on third party APIs.

GitHub repo: [https://github.com/georgian-io/LLM-Finetuning-Hub](https://github.com/georgian-io/LLM-Finetuning-Hub)

To this point, I am tracking the following 4 pillars of evaluation that businesses are typically look into: - Performance - Time to train an LLM - Cost to train an LLM - Inference (throughput / latency / cost per token)

For each LLM, my aim is to benchmark them for popular tasks, i.e., classification and summarization. Moreover, I would like to compare them against each other.

So far, I have benchmarked Flan-T5-Large, Falcon-7B and RedPajama and have found them to be very efficient in low-data situations, i.e., when there are very few annotated samples. Llama2-7B/13B and Writer’s Palmyra are in the pipeline.

But there’s so many LLMs out there! In case this work interests you, would be great to join forces.

GitHub repo attached — feedback is always welcome :)

Happy hacking!

Everything LLMs

Everything LLMs

Finetune LLMs via the Finetuning Hub

Finetuning has never been easier :)