LLM fine-tuning by santiatpml · Pull Request #1350 · postgresml/postgresml

santiatpml · 2024-03-04T19:21:39Z

Example: https://github.com/postgresml/postgresml/tree/santi-llm-fine-tuning?tab=readme-ov-file#llm-fine-tuning
Refactored TextDataSet to handle different NLP tasks
Three tasks: text classification, text pair classification, conversation
PEFT/LoRA for conversation task
Pypgrx for callbacks to print info statements and insert logs into pgml.logs table
New tasks have to be added to pgml.tasks:
ALTER TYPE pgml.task ADD VALUE IF NOT EXISTS 'conversation';
ALTER TYPE pgml.task ADD VALUE IF NOT EXISTS 'text_pair_classification';
New pgml.logs table has to be added:

CREATE TABLE pgml.logs (
    id SERIAL PRIMARY KEY,
    model_id BIGINT,
    project_id BIGINT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    logs JSONB
);

Text classification. Model will be pushed to /financial_phrasebank_sentiment repo on HuggingFace. Example repo: https://huggingface.co/santiadavani/financial_phrasebank_sentiment

SELECT pgml.tune(
    'financial_phrasebank_sentiment',
    task => 'text-classification',
    relation_name => 'pgml.financial_phrasebank_view',
    model_name => 'distilbert-base-uncased',
    test_size => 0.2,
    test_sampling => 'last',
    hyperparams => '{
        "training_args" : {
          "learning_rate": 2e-5,
          "per_device_train_batch_size": 16,
          "per_device_eval_batch_size": 16,
          "num_train_epochs": 10,
          "weight_decay": 0.01,
          "hub_token" : "token",
          "push_to_hub" : true
        },
        "dataset_args" : { 
          "text_column" : "sentence", 
          "class_column" : "class" 
        }
    }'
);

Text pair classification.
Note: Training is initialized using a previous run and model from HF Hub.

SELECT pgml.tune(
    'glue_mrpc_nli_2',
    task => 'text_pair_classification',
    relation_name => 'pgml.glue_view',
    model_name => 'santiadavani/glue_mrpc_nli_2',
    test_size => 0.5,
    test_sampling => 'last',
    hyperparams => '{
        "training_args" : {
            "learning_rate": 2e-5,
            "per_device_train_batch_size": 16,
            "per_device_eval_batch_size": 16,
            "num_train_epochs": 1,
            "weight_decay": 0.01
        },
        "dataset_args" : { "text1_column" : "sentence1", "text2_column" : "sentence2", "class_column" : "class" }
    }'
);

Conversation

SELECT pgml.tune(
    'alpaca-gpt4-conversation-llama2-7b-chat',
    task => 'conversation',
    relation_name => 'pgml.chat_sample',
    model_name => 'meta-llama/Llama-2-7b-chat-hf',
    test_size => 0.8,
    test_sampling => 'last',
    hyperparams => '{
        "training_args" : {
            "learning_rate": 2e-5,
            "per_device_train_batch_size": 4,
            "per_device_eval_batch_size": 4,
            "num_train_epochs": 1,
            "weight_decay": 0.01,
            "hub_token" : "read_write_token", 
            "push_to_hub" : true,
            "optim" : "adamw_bnb_8bit",
            "gradient_accumulation_steps" : 4,
            "gradient_checkpointing" : true
        },
        "dataset_args" : { "system_column" : "instruction", "user_column" : "input", "assistant_column" : "output" },
        "lora_config" : {"r": 2, "lora_alpha" : 4, "lora_dropout" : 0.05, "bias": "none", "task_type": "CAUSAL_LM"},
        "load_in_8bit" : false,
        "token" : "read_token"
    }'
);

levkk · 2024-03-04T19:30:38Z

+fn insert_logs(project_id: i64, model_id: i64, logs: String) -> PyResult<String> {
+
+    let id_value = Spi::get_one_with_args::<i64>(
+        "INSERT INTO pgml.logs (project_id, model_id, logs) VALUES ($1, $2, $3::JSONB) RETURNING id;",


Did we include a migration for this table somewhere? We need to make sure it's created on all databases running PostgresML.

Yes, need to add the following three to our migration once we freeze on the version number.

ALTER TYPE pgml.task ADD VALUE IF NOT EXISTS 'conversation'; ALTER TYPE pgml.task ADD VALUE IF NOT EXISTS 'text_pair_classification'; CREATE TABLE IF NOT EXISTS pgml.logs ( id SERIAL PRIMARY KEY, model_id BIGINT, project_id BIGINT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, logs JSONB );

levkk · 2024-03-04T19:32:12Z

 MarkupSafe==2.1.3
 marshmallow==3.20.1
 matplotlib==3.8.2
+maturin==1.4.0


Don't think you need maturin inside PostgresML deployments. This may be a "leak" from the pypgrx extension venv.

levkk · 2024-03-04T19:32:41Z

    task: default!(Option<&str>, "NULL"),
    relation_name: default!(Option<&str>, "NULL"),
-    y_column_name: default!(Option<&str>, "NULL"),
+    _y_column_name: default!(Option<&str>, "NULL"),


Why the underscore? Is it because it's not used?

That's correct.

levkk · 2024-03-04T19:35:20Z

+from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
+from trl.trainer import ConstantLengthDataset
+from peft import LoraConfig, get_peft_model
+from pypgrx import print_info, insert_logs


Need to make sure we either import this conditionally (only for fine tuning) and we include this in requirements.linux.txt. I didn't see a Mac OS build for this and for the M1/M2 architecture, we've been doing releases manually from our Macs (Github actions doesn't have M1 builders).

This makes me thing we should start cross-compiling soon. Rust supports this pretty well, maturin may need a patch.

I couldn't get fine tuning to work on Mac OS. It keeps crashing. How about I check for the operating system and bail out if it is mac?
requirements.linux.txt is updated with trl and peft.

levkk · 2024-03-04T19:37:13Z

+            logs["step"] = state.global_step
+            logs["max_steps"] = state.max_steps
+            logs["timestamp"] = str(datetime.now())
+            print_info(json.dumps(logs))


If you use use print(), this will appear in Postgres logs. It won't be pretty, but we can add a function that formats it correctly.

I will add indent in json.dumps() to pretty print.

levkk · 2024-03-04T19:38:01Z

+                trainable_model_params += param.numel()
+
+        # Calculate and print the number and percentage of trainable parameters
+        print_info(f"Trainable model parameters: {trainable_model_params}")


@kczimm This will require us to use the main thread for ML workloads in our cloud.

A PR with that is close. What's the reason we need main thread here?

We need logging visibility during fine tuning.

Thanks to a commit by @levkk, we should be able to log from any thread.

montanalow · 2024-03-09T04:10:03Z

+#######################
+
+
+class PGMLCallback(TrainerCallback):


I wouldn't be opposed to this functionality living in it's own file like tune.py, since transformers is getting a bit beefy.

transformers.py is hardcoded in several places. Needs some more refactoring and testing to accomplish moving finetuning code to tune.py. Will revisit this in the next iteration. #1378

montanalow · 2024-03-09T04:11:29Z

+        self.model_id = model_id
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        _ = logs.pop("total_flos", None)


Why throw away total_flos?

montanalow · 2024-03-09T04:13:25Z

+}
+
+#[pyfunction]
+fn print_info(info: String) -> PyResult<String> {


I think this would be more reusable as log(level, msg)

montanalow · 2024-03-09T04:16:07Z

+        else:
+            self.model_name = hyperparameters.pop("model_name")
+
+        if "token" in hyperparameters:


Isn't this a model init param, not a hyperparam, like many other things in this list? Maybe hyperparams covers everything?

That's correct. Moved all the parameters to hyperparams.

montanalow · 2024-03-09T04:17:03Z

+                trainable_model_params += param.numel()
+
+        # Calculate and print the number and percentage of trainable parameters
+        print_info(f"Trainable model parameters: {trainable_model_params}")


We need logging visibility during fine tuning.

montanalow · 2024-03-09T04:22:10Z

-                y_train,
-                x_test,
-                y_test,
+            Ok::<std::option::Option<()>, i64>(Some(())) // this return type is nonsense


montanalow · 2024-03-09T04:23:28Z

+
+            let text1_column_value = dataset_args
+                .0
+                .get("text1_column")


do we require these column names?

Yes, for text pair classification - (natural language inference, qnli etc.), we need three columns - text1, text2 and the class.

montanalow · 2024-03-09T04:24:03Z

+
+            let system_column_value = dataset_args
+                .0
+                .get("system_column")


How standard are these names these days?

For conversation task, system, user and assistant have become standard keys.

montanalow · 2024-03-10T01:14:53Z

+    Ok(info)
+}
+/// A Python module implemented in Rust.
+#[pymodule]


Since this crate is interdependent, what if we moved this whole pymodule into the main pgml-extension crate, under bindings/python/mod.rs instead of publishing it as a separate crate?

montanalow · 2024-03-10T06:02:48Z



+# venv
+pgml-venv


levkk · 2024-03-26T20:15:07Z

+    project_id BIGINT,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    logs JSONB
+);


santiatpml requested review from SilasMarvin, kczimm, levkk and montanalow March 4, 2024 19:21

levkk reviewed Mar 4, 2024

View reviewed changes

Comment thread pgml-extension/src/orm/model.rs

santiatpml and others added 19 commits March 5, 2024 09:50

fine-tuning text classification in progress

e3bea27

More commit messages

c4cf332

Working text classification with dataset args and training args

fb7cc2a

finetuing with text dataset enum to handle different tasks

5584487

text pair classification task support

82cb4f7

saving model after training

c10de47

removed device to cpu

63ee09b

updated transforemrs

865ae28

Working e2e finetunig for two tasks

097a8cf

Integration with huggingface hub and wandb

2dd50e6

Conversation dataset + training placeholder

6ac8722

Updated rust to fix failing tests

1e40cd8

working version of conversation with lora + load 8bit + hf hub

312d893

Tested llama2-7b finetuning

afc2e93

pypgrx first working version

22ee5c7

refactoring finetuning code to add callbacks

97d455d

fixed merge conflicts

b700944

Refactored finetuning + conversation + pgml callbacks

65d2f8b

removed wandb dependency

5f1b5f4

Santi Adavani added 4 commits March 5, 2024 09:54

removed local pypgrx from requirements

08084bf

removed maturin from requirements

dc0c6ee

removed flash attn

421af8f

Added indent for info display

4bbca96

santiatpml force-pushed the santi-llm-fine-tuning branch from 189e9f0 to 4bbca96 Compare March 5, 2024 17:59

santiatpml added 4 commits March 6, 2024 19:47

Updated readme with LLM fine-tuning for text classification

3db857c

README updates

7cbee43

Added a tutorial for 9 classes - draft 1

9284cf1

README updates

66c65c8

montanalow reviewed Mar 9, 2024

View reviewed changes

montanalow reviewed Mar 10, 2024

View reviewed changes

SilasMarvin and others added 8 commits March 18, 2024 11:24

Moved python functions (#1374)

5759ee3

README updates

b539168

migrations and removed pypgrx

31215b8

Added r_log to take log level and message

dae6b74

Updated version and requirements

dae5ffc

Changed version 2.8.3

435f5bd

README updates for conversation task fine-tuning using lora

aeb2683

minor readme updates

e5221cc

santiatpml requested a review from levkk March 26, 2024 19:54

levkk reviewed Mar 26, 2024

View reviewed changes

added new line

6db147e

santiatpml merged commit f75114b into master Mar 26, 2024

santiatpml deleted the santi-llm-fine-tuning branch March 26, 2024 20:31

		#######################


		class PGMLCallback(TrainerCallback):



		# venv
		pgml-venv No newline at end of file

Conversation

santiatpml commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

santiatpml commented Mar 4, 2024 •

edited

Loading