Post
981
Hi folks ,
colab[https://colab.research.google.com/drive/10av3SxFf0Psx_IkmZbcUhiVznStV5pVS?usp=sharing]
#OpenSourcing
pip-code-bandit
-- a model to act as intelligence unit in agentic workflows.
pipflow
-- a library to manage and run goal oriented agentic system.
pip-code-bandit attributes-
-- number of params ~ 1.3b [2.9 Gb GPU memory footprint]
-- sequence length ~ 16.3k [Can go higher but will show performance degradation]
-- license - apache 2.0
-- instruction following , RL tuned.
-- tasks:
complex planning(plan) of sequential function calls | a list of callables and goal
corrected plan | feedback instructions with error
function calling | doc or code and goal
code generation | plan and goal
code generation | goal
doc generation | code
code generation | doc
file parsed to json | any raw data
sql generation | schema, question, instructions and examples
#Strategy
We used a simulator to simulate environments where the model could play games to achieve goals, given a set of actions available to it. All the model could do was find the right action and config to incur a positive reward. The reward policy is around the concept of a model going to a stable state of zero net sum reward for both good and bad behaviour. In this setup, the model, which was pre-trained on code, function documentation, and similar OS datasets, was RL-tuned for reliability and instruction-following.
Do try it out and let me know how its working for you.
Thank you :)
colab[https://colab.research.google.com/drive/10av3SxFf0Psx_IkmZbcUhiVznStV5pVS?usp=sharing]
#OpenSourcing
pip-code-bandit
-- a model to act as intelligence unit in agentic workflows.
pipflow
-- a library to manage and run goal oriented agentic system.
pip-code-bandit attributes-
-- number of params ~ 1.3b [2.9 Gb GPU memory footprint]
-- sequence length ~ 16.3k [Can go higher but will show performance degradation]
-- license - apache 2.0
-- instruction following , RL tuned.
-- tasks:
complex planning(plan) of sequential function calls | a list of callables and goal
corrected plan | feedback instructions with error
function calling | doc or code and goal
code generation | plan and goal
code generation | goal
doc generation | code
code generation | doc
file parsed to json | any raw data
sql generation | schema, question, instructions and examples
#Strategy
We used a simulator to simulate environments where the model could play games to achieve goals, given a set of actions available to it. All the model could do was find the right action and config to incur a positive reward. The reward policy is around the concept of a model going to a stable state of zero net sum reward for both good and bad behaviour. In this setup, the model, which was pre-trained on code, function documentation, and similar OS datasets, was RL-tuned for reliability and instruction-following.
Do try it out and let me know how its working for you.
Thank you :)