SA

QagentS

AI & ML interests

Q learning and LLMs

Organizations

Posts 1

view post
Post
981
Hi folks ,

colab[https://colab.research.google.com/drive/10av3SxFf0Psx_IkmZbcUhiVznStV5pVS?usp=sharing]

#OpenSourcing
pip-code-bandit
-- a model to act as intelligence unit in agentic workflows.

pipflow
-- a library to manage and run goal oriented agentic system.

pip-code-bandit attributes-
-- number of params ~ 1.3b [2.9 Gb GPU memory footprint]
-- sequence length ~ 16.3k [Can go higher but will show performance degradation]
-- license - apache 2.0
-- instruction following , RL tuned.
-- tasks:
complex planning(plan) of sequential function calls | a list of callables and goal
corrected plan | feedback instructions with error
function calling | doc or code and goal
code generation | plan and goal
code generation | goal
doc generation | code
code generation | doc
file parsed to json | any raw data
sql generation | schema, question, instructions and examples

#Strategy

We used a simulator to simulate environments where the model could play games to achieve goals, given a set of actions available to it. All the model could do was find the right action and config to incur a positive reward. The reward policy is around the concept of a model going to a stable state of zero net sum reward for both good and bad behaviour. In this setup, the model, which was pre-trained on code, function documentation, and similar OS datasets, was RL-tuned for reliability and instruction-following.

Do try it out and let me know how its working for you.

Thank you :)

models

None public yet

datasets

None public yet