Commit Graph

6 Commits

Author SHA1 Message Date
Roman Rizzi 9a79afcdbf
DEV: Better strategies for summarization (#88)
* DEV: Better strategies for summarization

The strategy responsibility needs to be "Given a collection of texts, I know how to summarize them most efficiently, using the minimum amount of requests and maximizing token usage".

There are different token limits for each model, so it all boils down to two different strategies:

Fold all these texts into a single one, doing the summarization in chunks, and then build a summary from those.
Build it by combining texts in a single prompt, and truncate it according to your token limits.

While the latter is less than ideal, we need it for "bart-large-cnn-samsum" and "flan-t5-base-samsum", both with low limits. The rest will rely on folding.

* Expose summarized chunks to users
2023-06-27 12:26:33 -03:00
Rafael dos Santos Silva e457c687ca
FIX: OpenAI Tokenizer was failing to truncate mid emojis (#91)
* FIX: OpenAI Tokenizer was failing to truncate mid emojis

* Update spec/shared/tokenizer.rb

Co-authored-by: Joffrey JAFFEUX <j.jaffeux@gmail.com>

---------

Co-authored-by: Joffrey JAFFEUX <j.jaffeux@gmail.com>
2023-06-16 15:15:36 -03:00
Rafael dos Santos Silva 739b314312
Fixes for embeddings and truncate (#67) 2023-05-18 09:21:28 +10:00
Rafael dos Santos Silva 3c9513e754
Refinements to embeddings and tokenizers (#61)
* Refinements to embeddings and tokenizers

* lint

* Truncate with tokenizers for summary

* fix
2023-05-15 15:10:42 -03:00
Sam e76fc77189
fixes (#53)
* Minor... use username suggester in case username already exists

* FIX: ensure we truncate long prompts

Previously we

1. Used raw length instead of token counts for counting length
2. We totally dropped a prompt if it was too long

New implementation will truncate "raw" if it gets too long maintaining
meaning.
2023-05-06 07:31:53 -03:00
Rafael dos Santos Silva 9783e3b025
FEATURE: Add a basic tokenizer API (#37)
* FEATURE: Add a basic tokenizer API

* Add tests

* lint
2023-04-19 11:55:59 -03:00