Hi I am speaker
After tag removal: P Hi I am speaker
We remove everything that starts with ["P", "BRK", "CHAPTER", "/P"]
and only keep tagnae == SPEAKER
because line starting with ", language="html"),
" as a tag since it is used by some instruction tuning dataset, but realize the ",
D_code(" ", language="html"),
" tag can easily conflict with the original text.",
style="margin-bottom: -3px",
),
Li(
"As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
style="margin-bottom: -3px",
),
Li(
"In the comment thread heirarchy, relationships had to be assigned to between the comments, sub-comments, and original story ID. ",
style="margin-bottom: -3px",
),
),
P(B("Filters Applied: ")),
Ul(
Li("Language Filter: English", style="margin-bottom: -3px"),
Li("Minimum Word Count Filter: 10", style="margin-bottom: -3px"),
Li(
"Unigram Log Probability Threshold: -20",
style="margin-bottom: -3px",
),
),
table_div_hn,
),
),
Section(
Div(
H3("USPTO"),
P("Patent documents from the United States Patent and Trademark Office."),
P(
B("Download and Extraction: "),
"Data was downloaded and extracted using tags from ",
A(
"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/",
href="https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/",
),
". There were three different formats that needed three different functions to download and extract the data based on year:",
I("Pre_2002"),
", ",
I("2002_to_2004"),
" and",
I("post_2004"),
". We used the exact code used in The Pile (citation needed).",
),
P(B("Filters Applied: ")),
Ul(
Li("Language Filter: English", style="margin-bottom: -3px"),
Li("Minimum Word Count Filter: 50", style="margin-bottom: -3px"),
Li("Unigram Log Probability", style="margin-bottom: -3px"),
),
table_div_uspto,
),
),
Section(
Div(
H3("FreeLaw"),
P(
"Legal documents and court cases from various jurisdictions provided by US-registered non-profit firm Free Law Project. We have included data from CourtListener which included millions of legal opinions from federal and state courts."
),
P(
B("Download and Extraction"),
"The dataset was downloaded from: ",
A(
"https://storage.courtlistener.com/bulk-data/",
href="https://storage.courtlistener.com/bulk-data/",
),
". There are 19 CSV files which contain overlapping content. CSV files can contain content in multiple columns requiring a holistic extraction approach. Text was extracted from the following using html2text function. The block below shows how each text type was extracted.",
),
D_code(
"""
("html", html2text), ("html_lawbox", html2text),
("html_columbia", html2text), ("html_anon_2020", html2text),
("html_with_citations", html2text), ("xml_harvard", html2text),
plain_text
""",
language="python",
),
P(
"All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priority was given to plain_text first, followed by the columns in the table in reverse order."
),
P(B("Unique Data Preparation Challenges: ")),
P("The Freelaw text uses a lot of whitespaces and newlines to format the document visually. These lines are not necessary for language model learning and sometimes have confusing semantic meanings. We attempt to unify how whitespaces appear in this dataset with the following heuristics."),
Ul(
Li(
"Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
style="margin-bottom: -3px",
),
Li(
"Whitespaces were found between new lines with no addition text. These whitespaces were removed.",
style="margin-bottom: -3px",
),
Li(
"Consecutive new lines were found in some documents without leading to a new paragraph. All consecutive newline to a single new line.",
style="margin-bottom: -3px",
),
Li(
"Converted all single new lines to whitespace. If whitespace was found after a new line with no text, the whitespace was removed. All leading and trailing whitespace was removed.",
style="margin-bottom: -3px",
),
Li(
"All form feed (",
D_code("\\f", language="bash"),
")characters were removed.", style="margin-bottom: -3px"
),
),
P(B("Filters Applied: ")),
Ul(
Li("Language Filter: English", style="margin-bottom: -3px"),
Li("Minimum Word Count Filter: 50", style="margin-bottom: -3px"),
Li("Unigram Log Probability", style="margin-bottom: -3px"),
),
P(
"Note: Local deduplication within FreeLaw itself removed 90%+ of the dataset as duplicate."
),
table_div_freelaw,
Details(
Summary("FreeLaw Filtering Examples"),
Div(
freelaw_examples,
style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
),
style="""
background-color: #FFFAEA; /* Light yellow background */
padding: 15px;
border-radius: 12px;
margin-bottom: 15px
""",
),
),
),
Section(
Div(
H3("StackExchange"),
P(
"A network of question-and-answer websites on various subjects, including programming, science, mathematics, and more. This is one of the largest publicly available repositories for question-answer pairs. We have included comments also to include an overall discussion on each post."
),
P(
B("Download and Extraction: "),
"The archive dataset was used to download all data from StackExchange and 364 StackExchange's sub URLs including: ",
A("math.stackexchange.com", href="math.stackexchange.com"),
". Raw data was extracted an XML format and only two files Posts.xml and Comments.xml were considered. To match the StackExchange hierarchy, each file was parsed using post_id to connect questions to answers and then to comments. We will include the full list of sub URLs in when the code is released.",
),
D_code(
"""
1. Questions:
2. Comment1:
3. Comment2:
4. Answer1:
5. Comment1:
6. Comment2:
7. Answer2:
8. Comment1:
9. Comment2:""",
block="block",
language="python",
),
P(B("Unique Data Preparation Challenges: ")),
Ul(
Li(
"Handling code block was a required finding the specific blocks and exacting the details in one snippet.",
style="margin-bottom: -3px",
),
Li(
"Question and Answer formatting had to be rewritten to match the question and the anwer.",
style="margin-bottom: -3px",
),
Li(
"Occasionally a title was not included at the beginning of a question. For consistent formatting, a title was added.",
style="margin-bottom: -3px",
),
),
P(B("Filters Applied: ")),
Ul(
Li("Minimum Word Count Filter: 10", style="margin-bottom: -3px"),
),
table_div_se,
Details(
Summary("StackExchange Filtering Examples"),
Div(
se_examples,
style="background-color: white; padding: 15px; margin-top: 10px; margin-bottom: 10px; border-radius: 8px; border: none; ", # Styling for the DV2 part
),
style="""
background-color: #FFFAEA; /* Light yellow background */
padding: 15px;
border-radius: 12px;
margin-bottom: 15px
""",
),
),
),
Section(
Div(
H3("Ubuntu IRC"),
P(
"Chat logs from the Ubuntu Internet Relay Chat (IRC) channels on the Freenode IRC chat server. This data is also another form of dialog dataset on niche topics."
),
P(
B("Download and Extraction: "),
"The dataset was downloaded from: ",
A(
"https://irclogs.ubuntu.com/{date.year}/{date.month:02d}/{date.day:02d}/",
href="https://irclogs.ubuntu.com/{date.year}/{date.month:02d}/{date.day:02d}/",
),
" based on the year.",
),
P("During extraction, the logs were cleaned using following functions:"),
D_code(
"""
def exclude_system(x):
return '\n'.join(line for line in x.split('\n') if not line.startswith('==='))
def exclude_select_system(x):
return '\n'.join(line for line in x.split('\n') if not (line.startswith('===')
and any(term in line for term in
['has joined #', 'has left #', 'Topic for #', "Topic (#", "is now known as"]) ))
def clean(x):
return '\n'.join('* ' + line[4:] if line.startswith('===') else line[8:] for line in x.split('\n'))
""",
block="block",
language="python",
),
P(B("Unique Data Preparation Challenges: ")),
Ul(
Li(
"Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
style="margin-bottom: -3px",
),
Li(
"The dataset comes with the usernames of post authors. We attempt to replace them with strings such as