wgcv commited on
Commit
d40f080
1 Parent(s): c02507b

add some notes

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. app.py +125 -6
  3. assets/banner_tabs.png +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/*.png filter=lfs diff=lfs merge=lfs -text
app.py CHANGED
@@ -16,6 +16,7 @@ st.markdown("Condense your Browser Tabs into a few impactful words. - Inspired i
16
 
17
 
18
  # Sidebar
 
19
  st.sidebar.caption("Tidy Tabs - Title")
20
  user_input_url = st.sidebar.text_input('Enter your url:')
21
 
@@ -34,21 +35,24 @@ def load_tab():
34
  else:
35
 
36
  with st.spinner('Wait for it...'):
37
- st.sidebar.write(f'<title>: **{title}**')
38
  time.sleep(1)
39
  with st.spinner('Wait for it...'):
40
- st.sidebar.write(f'T5-small: **{predict_model_t5(text)}**')
41
  with st.spinner('Wait for it...'):
42
- st.sidebar.write(f'Pegasus xsum: **{predict_model_pegasus(text)}**')
43
  with st.spinner('Wait for it...'):
44
- st.sidebar.write(f'Pegasus Bart: **{predict_model_bart(text)}**')
45
  else:
46
  error_message = st.sidebar.error(f'{text} is not a valid URL. Please enter a valid URL.')
47
 
48
- button_clicked = st.sidebar.button("Load tab", on_click=load_tab())
49
 
50
  st.sidebar.divider()
51
-
 
 
 
52
 
53
  with st.status("Loading models...", expanded=True, state="complete") as models:
54
  st.write("Loading https://huggingface.co/wgcv/tidy-tab-model-t5-small")
@@ -66,4 +70,119 @@ with st.status("Loading models...", expanded=True, state="complete") as models:
66
  models.update(label="All models loaded!", state="complete", expanded=False)
67
 
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
 
16
 
17
 
18
  # Sidebar
19
+ st.sidebar.title("🔬 Test this Model!")
20
  st.sidebar.caption("Tidy Tabs - Title")
21
  user_input_url = st.sidebar.text_input('Enter your url:')
22
 
 
35
  else:
36
 
37
  with st.spinner('Wait for it...'):
38
+ st.sidebar.write(f'**<title>: **{title}')
39
  time.sleep(1)
40
  with st.spinner('Wait for it...'):
41
+ st.sidebar.write(f'**T5-small: **{predict_model_t5(text)}')
42
  with st.spinner('Wait for it...'):
43
+ st.sidebar.write(f'**Pegasus xsum: **{predict_model_pegasus(text)}')
44
  with st.spinner('Wait for it...'):
45
+ st.sidebar.write(f'**Bart-Large-Cnn: **{predict_model_bart(text)}')
46
  else:
47
  error_message = st.sidebar.error(f'{text} is not a valid URL. Please enter a valid URL.')
48
 
49
+ button_clicked = st.sidebar.button("Rename the tab", on_click=load_tab())
50
 
51
  st.sidebar.divider()
52
+ ###
53
+ # Content
54
+ ###
55
+ st.image('./assets/banner_tabs.png', width=350, caption='Navigate Through Powerful Features with Intuitive Tabs')
56
 
57
  with st.status("Loading models...", expanded=True, state="complete") as models:
58
  st.write("Loading https://huggingface.co/wgcv/tidy-tab-model-t5-small")
 
70
  models.update(label="All models loaded!", state="complete", expanded=False)
71
 
72
 
73
+ st.info("All three models are deployed in a single Hugging Face Space using the free tier. Specifications: CPU-based (no GPU), 2 vCPU cores, 16 GB RAM, and 50 GB storage.",icon="ℹ️")
74
+ ###
75
+ # Examples
76
+ ###
77
+
78
+
79
+ st.markdown("""
80
+
81
+ Here are some examples you can try that aren't included in the training or test datasets
82
+ # Examples
83
+ ```
84
+ Urls:
85
+ https://www.nytimes.com/2007/01/10/technology/10apple.html
86
+ https://www.nytimes.com/2021/04/15/arts/design/Met-museum-roof-garden-da-corte.html
87
+ https://www.forbes.com/sites/davidphelan/2024/07/09/apple-iphone-16-pro-major-design-upgrade-coming-new-report-claims/
88
+ https://www.crn.com/news/channel-programs/18828789/microsoft-to-release-windows-xp-service-pack-1
89
+ https://github.com/torvalds
90
+ https://www.rickbayless.com/recipe/pastor-style-tacos/
91
+
92
+ Some websites, like x.com, are not accessible because they use JavaScript engines to load content, which is beyond the scope of this project.
93
+
94
+ ```
95
+
96
+ # The Dataset
97
+ The project's creator collected the dataset from various sources on the internet.
98
+ The dataset includes:
99
+
100
+ | Feature | Description |
101
+ |----------------------------|-----------------------------------------------------------------------------------------------|
102
+ | URL | URL of the webpage |
103
+ | title | Extracted from the HTML `<title>` |
104
+ | description | Extracted from `<meta name="description" content="description">` |
105
+ | paragraphs | Extracted from `<p>` tags |
106
+ | headings | Extracted from `<h1>`, `<h2>`, `<h3>` tags |
107
+ | combined text | Formatted as `[title] title \\n [description]description` |
108
+
109
+
110
+ The dataset primarily comprises data gathered from nytimes.com and GitHub.com, supplemented by approximately 60 other websites featuring diverse content. From GitHub, 1226 summaries were created programmatically, creating the summary with the format Username GitHub Profiles to explore the model's ability to generate patterns with new words. For the New York Times, 1056 websites were summarized based on their text content, using Claude 3.5 Sonnet of Anthropic with a specified prompt.
111
+ ## Prompt for labels Generation
112
+ ```
113
+ Claude 3.5 Sonnet Prompt
114
+
115
+ I’m going to share with you a csv file with one column . I want you to create a summary of 1 to 3 words maximum of the text. The text could have HTML tags. The title is the title of the page and the description is the page's description.
116
+ The result gives me like this
117
+ ```
118
+ summary 1
119
+ summary 2
120
+ ...
121
+ summary n
122
+ ```
123
+ Only plain text and no additional instructions
124
+ ```
125
+
126
+ - This small dataset aims to provide an initial assessment of model performance in a pre-trained task limited to concise summaries of 1 to 4 words. Due to the inherent complexity of this task, I suggest future efforts focus on constructing a larger dataset comprising 50,000 to 500,000 websites to more comprehensively evaluate model capabilities.
127
+
128
+ - Testing revealed that the description meta tag significantly enhanced result generation. Increasing dataset size and incorporating contextual data are expected to further improve model performance in larger-scale applications with millions of data points.
129
+
130
+ - Out of the 60 additional websites included, only 41 are sourced from substack.com. This means that less than 2% of the dataset contains information from substack.com. This is valuable for understanding the impact of small data examples.
131
+
132
+ - P.S. I tested ChatGPT-4.0, and the results were highly discouraging for a chunk of data consisting of 100 text filed values.
133
+
134
+ -In the future, we should aim to increase the dataset size to at least 10,000-15,000 samples and improve the train/test/validation split methodology.
135
+
136
+ st.info("I crafted this dataset using a more powerful LLM and scripts, no need for boring manual labeling. The idea is to eliminate human labeling.",icon="ℹ️")
137
+
138
+ #### Access to the data
139
+
140
+ `https://huggingface.co/datasets/wgcv/website-title-description`
141
+
142
+ # Models
143
+ My objective was to show that it was possible to create a small ML model from a bigger LLM model that could achieve good or better results in specific tasks compared to the original LLM
144
+
145
+ Given the substantial volume of data, training a model from scratch was deemed impractical. Instead, our approach focused on evaluating the performance of existing pre-trained models as a baseline. This strategy served as an optimal starting point for developing a custom, lightweight model tailored to our specific use case: enhancing browser tab organization and efficiently summarizing the core concepts of favorited websites.
146
+
147
+ ### T5-small
148
+ - The [T5-small](https://huggingface.co/wgcv/tidy-tab-model-t5-small) model is a finetuning of google-t5/t5-small.
149
+ - It's a text-to-text model
150
+ - It's a general model for all NLP tasks
151
+ - The task is defined by the input format
152
+ - To perform summarization, prefix the text with 'summarize:'
153
+ - 60.5M parameters
154
+
155
+ ### Pegasus-xsum
156
+ - The [Pegasus-xsum](https://huggingface.co/wgcv/tidy-tab-model-pegasus-xsum) model is a finetuning of google/pegasus-xsum.
157
+ - It's a text-to-text model
158
+ - It's a specialized summarization model
159
+ - 570M params
160
+
161
+ ### Bart-large
162
+ - The [Bart-large](https://huggingface.co/wgcv/tidy-tab-model-bart-large-cnn) model is a finetuning of facebook/bart-large-cnn.
163
+ - Prior to our fine-tuning, it was fine-tuned on the CNN/Daily Mail dataset.
164
+ - It's a BART model, using a transformer encoder-decoder (seq2seq) architecture.
165
+ - BART models typically perform better with small datasets compared to text-to-text models.
166
+ - 406M params
167
+
168
+
169
+
170
+ ### Potential avenues for performance enhancement include:
171
+ - Data preprocessing optimization
172
+ - Dataset expansion
173
+ - Comprehensive hyperparameter tuning
174
+ - These strategies could significantly improve model efficacy.
175
+
176
+ ## co2_eq_emissions
177
+ - emissions: 0.16 grams of CO2)
178
+ - source: mlco2.github.io
179
+ - training_type: fine-tuning
180
+ - geographical_location: U.S.
181
+ - hardware_used: 1 - T4 GPU
182
+
183
+ ##
184
+
185
+
186
+
187
+ """, unsafe_allow_html=False, help=None)
188
 
assets/banner_tabs.png ADDED

Git LFS Details

  • SHA256: 79baf5a0caaa8082e64ed4dd7c24be56d40f94e1fd8bf5fde3087ff914e3a687
  • Pointer size: 132 Bytes
  • Size of remote file: 1.57 MB