akhaliq HF staff commited on
Commit
e524c7a
1 Parent(s): e68ebcb

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/example.jpg filter=lfs diff=lfs merge=lfs -text
ACKNOWLEDGEMENTS.md ADDED
@@ -0,0 +1,418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Acknowledgements
2
+ Portions of this Software may utilize the following copyrighted
3
+ material, the use of which is hereby acknowledged.
4
+
5
+ ------------------------------------------------
6
+ PyTorch Image Models (timm)
7
+ Ross Wightman
8
+
9
+ Apache License
10
+ Version 2.0, January 2004
11
+ http://www.apache.org/licenses/
12
+
13
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
14
+
15
+ 1. Definitions.
16
+
17
+ "License" shall mean the terms and conditions for use, reproduction,
18
+ and distribution as defined by Sections 1 through 9 of this document.
19
+
20
+ "Licensor" shall mean the copyright owner or entity authorized by
21
+ the copyright owner that is granting the License.
22
+
23
+ "Legal Entity" shall mean the union of the acting entity and all
24
+ other entities that control, are controlled by, or are under common
25
+ control with that entity. For the purposes of this definition,
26
+ "control" means (i) the power, direct or indirect, to cause the
27
+ direction or management of such entity, whether by contract or
28
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
29
+ outstanding shares, or (iii) beneficial ownership of such entity.
30
+
31
+ "You" (or "Your") shall mean an individual or Legal Entity
32
+ exercising permissions granted by this License.
33
+
34
+ "Source" form shall mean the preferred form for making modifications,
35
+ including but not limited to software source code, documentation
36
+ source, and configuration files.
37
+
38
+ "Object" form shall mean any form resulting from mechanical
39
+ transformation or translation of a Source form, including but
40
+ not limited to compiled object code, generated documentation,
41
+ and conversions to other media types.
42
+
43
+ "Work" shall mean the work of authorship, whether in Source or
44
+ Object form, made available under the License, as indicated by a
45
+ copyright notice that is included in or attached to the work
46
+ (an example is provided in the Appendix below).
47
+
48
+ "Derivative Works" shall mean any work, whether in Source or Object
49
+ form, that is based on (or derived from) the Work and for which the
50
+ editorial revisions, annotations, elaborations, or other modifications
51
+ represent, as a whole, an original work of authorship. For the purposes
52
+ of this License, Derivative Works shall not include works that remain
53
+ separable from, or merely link (or bind by name) to the interfaces of,
54
+ the Work and Derivative Works thereof.
55
+
56
+ "Contribution" shall mean any work of authorship, including
57
+ the original version of the Work and any modifications or additions
58
+ to that Work or Derivative Works thereof, that is intentionally
59
+ submitted to Licensor for inclusion in the Work by the copyright owner
60
+ or by an individual or Legal Entity authorized to submit on behalf of
61
+ the copyright owner. For the purposes of this definition, "submitted"
62
+ means any form of electronic, verbal, or written communication sent
63
+ to the Licensor or its representatives, including but not limited to
64
+ communication on electronic mailing lists, source code control systems,
65
+ and issue tracking systems that are managed by, or on behalf of, the
66
+ Licensor for the purpose of discussing and improving the Work, but
67
+ excluding communication that is conspicuously marked or otherwise
68
+ designated in writing by the copyright owner as "Not a Contribution."
69
+
70
+ "Contributor" shall mean Licensor and any individual or Legal Entity
71
+ on behalf of whom a Contribution has been received by Licensor and
72
+ subsequently incorporated within the Work.
73
+
74
+ 2. Grant of Copyright License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ copyright license to reproduce, prepare Derivative Works of,
78
+ publicly display, publicly perform, sublicense, and distribute the
79
+ Work and such Derivative Works in Source or Object form.
80
+
81
+ 3. Grant of Patent License. Subject to the terms and conditions of
82
+ this License, each Contributor hereby grants to You a perpetual,
83
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
84
+ (except as stated in this section) patent license to make, have made,
85
+ use, offer to sell, sell, import, and otherwise transfer the Work,
86
+ where such license applies only to those patent claims licensable
87
+ by such Contributor that are necessarily infringed by their
88
+ Contribution(s) alone or by combination of their Contribution(s)
89
+ with the Work to which such Contribution(s) was submitted. If You
90
+ institute patent litigation against any entity (including a
91
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
92
+ or a Contribution incorporated within the Work constitutes direct
93
+ or contributory patent infringement, then any patent licenses
94
+ granted to You under this License for that Work shall terminate
95
+ as of the date such litigation is filed.
96
+
97
+ 4. Redistribution. You may reproduce and distribute copies of the
98
+ Work or Derivative Works thereof in any medium, with or without
99
+ modifications, and in Source or Object form, provided that You
100
+ meet the following conditions:
101
+
102
+ (a) You must give any other recipients of the Work or
103
+ Derivative Works a copy of this License; and
104
+
105
+ (b) You must cause any modified files to carry prominent notices
106
+ stating that You changed the files; and
107
+
108
+ (c) You must retain, in the Source form of any Derivative Works
109
+ that You distribute, all copyright, patent, trademark, and
110
+ attribution notices from the Source form of the Work,
111
+ excluding those notices that do not pertain to any part of
112
+ the Derivative Works; and
113
+
114
+ (d) If the Work includes a "NOTICE" text file as part of its
115
+ distribution, then any Derivative Works that You distribute must
116
+ include a readable copy of the attribution notices contained
117
+ within such NOTICE file, excluding those notices that do not
118
+ pertain to any part of the Derivative Works, in at least one
119
+ of the following places: within a NOTICE text file distributed
120
+ as part of the Derivative Works; within the Source form or
121
+ documentation, if provided along with the Derivative Works; or,
122
+ within a display generated by the Derivative Works, if and
123
+ wherever such third-party notices normally appear. The contents
124
+ of the NOTICE file are for informational purposes only and
125
+ do not modify the License. You may add Your own attribution
126
+ notices within Derivative Works that You distribute, alongside
127
+ or as an addendum to the NOTICE text from the Work, provided
128
+ that such additional attribution notices cannot be construed
129
+ as modifying the License.
130
+
131
+ You may add Your own copyright statement to Your modifications and
132
+ may provide additional or different license terms and conditions
133
+ for use, reproduction, or distribution of Your modifications, or
134
+ for any such Derivative Works as a whole, provided Your use,
135
+ reproduction, and distribution of the Work otherwise complies with
136
+ the conditions stated in this License.
137
+
138
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
139
+ any Contribution intentionally submitted for inclusion in the Work
140
+ by You to the Licensor shall be under the terms and conditions of
141
+ this License, without any additional terms or conditions.
142
+ Notwithstanding the above, nothing herein shall supersede or modify
143
+ the terms of any separate license agreement you may have executed
144
+ with Licensor regarding such Contributions.
145
+
146
+ 6. Trademarks. This License does not grant permission to use the trade
147
+ names, trademarks, service marks, or product names of the Licensor,
148
+ except as required for reasonable and customary use in describing the
149
+ origin of the Work and reproducing the content of the NOTICE file.
150
+
151
+ 7. Disclaimer of Warranty. Unless required by applicable law or
152
+ agreed to in writing, Licensor provides the Work (and each
153
+ Contributor provides its Contributions) on an "AS IS" BASIS,
154
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
155
+ implied, including, without limitation, any warranties or conditions
156
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
157
+ PARTICULAR PURPOSE. You are solely responsible for determining the
158
+ appropriateness of using or redistributing the Work and assume any
159
+ risks associated with Your exercise of permissions under this License.
160
+
161
+ 8. Limitation of Liability. In no event and under no legal theory,
162
+ whether in tort (including negligence), contract, or otherwise,
163
+ unless required by applicable law (such as deliberate and grossly
164
+ negligent acts) or agreed to in writing, shall any Contributor be
165
+ liable to You for damages, including any direct, indirect, special,
166
+ incidental, or consequential damages of any character arising as a
167
+ result of this License or out of the use or inability to use the
168
+ Work (including but not limited to damages for loss of goodwill,
169
+ work stoppage, computer failure or malfunction, or any and all
170
+ other commercial damages or losses), even if such Contributor
171
+ has been advised of the possibility of such damages.
172
+
173
+ 9. Accepting Warranty or Additional Liability. While redistributing
174
+ the Work or Derivative Works thereof, You may choose to offer,
175
+ and charge a fee for, acceptance of support, warranty, indemnity,
176
+ or other liability obligations and/or rights consistent with this
177
+ License. However, in accepting such obligations, You may act only
178
+ on Your own behalf and on Your sole responsibility, not on behalf
179
+ of any other Contributor, and only if You agree to indemnify,
180
+ defend, and hold each Contributor harmless for any liability
181
+ incurred by, or claims asserted against, such Contributor by reason
182
+ of your accepting any such warranty or additional liability.
183
+
184
+ END OF TERMS AND CONDITIONS
185
+
186
+ APPENDIX: How to apply the Apache License to your work.
187
+
188
+ To apply the Apache License to your work, attach the following
189
+ boilerplate notice, with the fields enclosed by brackets "{}"
190
+ replaced with your own identifying information. (Don't include
191
+ the brackets!) The text should be enclosed in the appropriate
192
+ comment syntax for the file format. We also recommend that a
193
+ file or class name and description of purpose be included on the
194
+ same "printed page" as the copyright notice for easier
195
+ identification within third-party archives.
196
+
197
+ Copyright 2019 Ross Wightman
198
+
199
+ Licensed under the Apache License, Version 2.0 (the "License");
200
+ you may not use this file except in compliance with the License.
201
+ You may obtain a copy of the License at
202
+
203
+ http://www.apache.org/licenses/LICENSE-2.0
204
+
205
+ Unless required by applicable law or agreed to in writing, software
206
+ distributed under the License is distributed on an "AS IS" BASIS,
207
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
208
+ See the License for the specific language governing permissions and
209
+ limitations under the License.
210
+
211
+
212
+ ------------------------------------------------
213
+ DINOv2: Learning Robust Visual Features without Supervision
214
+ Github source: https://github.com/facebookresearch/dinov2
215
+
216
+
217
+
218
+ Apache License
219
+ Version 2.0, January 2004
220
+ http://www.apache.org/licenses/
221
+
222
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
223
+
224
+ 1. Definitions.
225
+
226
+ "License" shall mean the terms and conditions for use, reproduction,
227
+ and distribution as defined by Sections 1 through 9 of this document.
228
+
229
+ "Licensor" shall mean the copyright owner or entity authorized by
230
+ the copyright owner that is granting the License.
231
+
232
+ "Legal Entity" shall mean the union of the acting entity and all
233
+ other entities that control, are controlled by, or are under common
234
+ control with that entity. For the purposes of this definition,
235
+ "control" means (i) the power, direct or indirect, to cause the
236
+ direction or management of such entity, whether by contract or
237
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
238
+ outstanding shares, or (iii) beneficial ownership of such entity.
239
+
240
+ "You" (or "Your") shall mean an individual or Legal Entity
241
+ exercising permissions granted by this License.
242
+
243
+ "Source" form shall mean the preferred form for making modifications,
244
+ including but not limited to software source code, documentation
245
+ source, and configuration files.
246
+
247
+ "Object" form shall mean any form resulting from mechanical
248
+ transformation or translation of a Source form, including but
249
+ not limited to compiled object code, generated documentation,
250
+ and conversions to other media types.
251
+
252
+ "Work" shall mean the work of authorship, whether in Source or
253
+ Object form, made available under the License, as indicated by a
254
+ copyright notice that is included in or attached to the work
255
+ (an example is provided in the Appendix below).
256
+
257
+ "Derivative Works" shall mean any work, whether in Source or Object
258
+ form, that is based on (or derived from) the Work and for which the
259
+ editorial revisions, annotations, elaborations, or other modifications
260
+ represent, as a whole, an original work of authorship. For the purposes
261
+ of this License, Derivative Works shall not include works that remain
262
+ separable from, or merely link (or bind by name) to the interfaces of,
263
+ the Work and Derivative Works thereof.
264
+
265
+ "Contribution" shall mean any work of authorship, including
266
+ the original version of the Work and any modifications or additions
267
+ to that Work or Derivative Works thereof, that is intentionally
268
+ submitted to Licensor for inclusion in the Work by the copyright owner
269
+ or by an individual or Legal Entity authorized to submit on behalf of
270
+ the copyright owner. For the purposes of this definition, "submitted"
271
+ means any form of electronic, verbal, or written communication sent
272
+ to the Licensor or its representatives, including but not limited to
273
+ communication on electronic mailing lists, source code control systems,
274
+ and issue tracking systems that are managed by, or on behalf of, the
275
+ Licensor for the purpose of discussing and improving the Work, but
276
+ excluding communication that is conspicuously marked or otherwise
277
+ designated in writing by the copyright owner as "Not a Contribution."
278
+
279
+ "Contributor" shall mean Licensor and any individual or Legal Entity
280
+ on behalf of whom a Contribution has been received by Licensor and
281
+ subsequently incorporated within the Work.
282
+
283
+ 2. Grant of Copyright License. Subject to the terms and conditions of
284
+ this License, each Contributor hereby grants to You a perpetual,
285
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
286
+ copyright license to reproduce, prepare Derivative Works of,
287
+ publicly display, publicly perform, sublicense, and distribute the
288
+ Work and such Derivative Works in Source or Object form.
289
+
290
+ 3. Grant of Patent License. Subject to the terms and conditions of
291
+ this License, each Contributor hereby grants to You a perpetual,
292
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
293
+ (except as stated in this section) patent license to make, have made,
294
+ use, offer to sell, sell, import, and otherwise transfer the Work,
295
+ where such license applies only to those patent claims licensable
296
+ by such Contributor that are necessarily infringed by their
297
+ Contribution(s) alone or by combination of their Contribution(s)
298
+ with the Work to which such Contribution(s) was submitted. If You
299
+ institute patent litigation against any entity (including a
300
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
301
+ or a Contribution incorporated within the Work constitutes direct
302
+ or contributory patent infringement, then any patent licenses
303
+ granted to You under this License for that Work shall terminate
304
+ as of the date such litigation is filed.
305
+
306
+ 4. Redistribution. You may reproduce and distribute copies of the
307
+ Work or Derivative Works thereof in any medium, with or without
308
+ modifications, and in Source or Object form, provided that You
309
+ meet the following conditions:
310
+
311
+ (a) You must give any other recipients of the Work or
312
+ Derivative Works a copy of this License; and
313
+
314
+ (b) You must cause any modified files to carry prominent notices
315
+ stating that You changed the files; and
316
+
317
+ (c) You must retain, in the Source form of any Derivative Works
318
+ that You distribute, all copyright, patent, trademark, and
319
+ attribution notices from the Source form of the Work,
320
+ excluding those notices that do not pertain to any part of
321
+ the Derivative Works; and
322
+
323
+ (d) If the Work includes a "NOTICE" text file as part of its
324
+ distribution, then any Derivative Works that You distribute must
325
+ include a readable copy of the attribution notices contained
326
+ within such NOTICE file, excluding those notices that do not
327
+ pertain to any part of the Derivative Works, in at least one
328
+ of the following places: within a NOTICE text file distributed
329
+ as part of the Derivative Works; within the Source form or
330
+ documentation, if provided along with the Derivative Works; or,
331
+ within a display generated by the Derivative Works, if and
332
+ wherever such third-party notices normally appear. The contents
333
+ of the NOTICE file are for informational purposes only and
334
+ do not modify the License. You may add Your own attribution
335
+ notices within Derivative Works that You distribute, alongside
336
+ or as an addendum to the NOTICE text from the Work, provided
337
+ that such additional attribution notices cannot be construed
338
+ as modifying the License.
339
+
340
+ You may add Your own copyright statement to Your modifications and
341
+ may provide additional or different license terms and conditions
342
+ for use, reproduction, or distribution of Your modifications, or
343
+ for any such Derivative Works as a whole, provided Your use,
344
+ reproduction, and distribution of the Work otherwise complies with
345
+ the conditions stated in this License.
346
+
347
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
348
+ any Contribution intentionally submitted for inclusion in the Work
349
+ by You to the Licensor shall be under the terms and conditions of
350
+ this License, without any additional terms or conditions.
351
+ Notwithstanding the above, nothing herein shall supersede or modify
352
+ the terms of any separate license agreement you may have executed
353
+ with Licensor regarding such Contributions.
354
+
355
+ 6. Trademarks. This License does not grant permission to use the trade
356
+ names, trademarks, service marks, or product names of the Licensor,
357
+ except as required for reasonable and customary use in describing the
358
+ origin of the Work and reproducing the content of the NOTICE file.
359
+
360
+ 7. Disclaimer of Warranty. Unless required by applicable law or
361
+ agreed to in writing, Licensor provides the Work (and each
362
+ Contributor provides its Contributions) on an "AS IS" BASIS,
363
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
364
+ implied, including, without limitation, any warranties or conditions
365
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
366
+ PARTICULAR PURPOSE. You are solely responsible for determining the
367
+ appropriateness of using or redistributing the Work and assume any
368
+ risks associated with Your exercise of permissions under this License.
369
+
370
+ 8. Limitation of Liability. In no event and under no legal theory,
371
+ whether in tort (including negligence), contract, or otherwise,
372
+ unless required by applicable law (such as deliberate and grossly
373
+ negligent acts) or agreed to in writing, shall any Contributor be
374
+ liable to You for damages, including any direct, indirect, special,
375
+ incidental, or consequential damages of any character arising as a
376
+ result of this License or out of the use or inability to use the
377
+ Work (including but not limited to damages for loss of goodwill,
378
+ work stoppage, computer failure or malfunction, or any and all
379
+ other commercial damages or losses), even if such Contributor
380
+ has been advised of the possibility of such damages.
381
+
382
+ 9. Accepting Warranty or Additional Liability. While redistributing
383
+ the Work or Derivative Works thereof, You may choose to offer,
384
+ and charge a fee for, acceptance of support, warranty, indemnity,
385
+ or other liability obligations and/or rights consistent with this
386
+ License. However, in accepting such obligations, You may act only
387
+ on Your own behalf and on Your sole responsibility, not on behalf
388
+ of any other Contributor, and only if You agree to indemnify,
389
+ defend, and hold each Contributor harmless for any liability
390
+ incurred by, or claims asserted against, such Contributor by reason
391
+ of your accepting any such warranty or additional liability.
392
+
393
+ END OF TERMS AND CONDITIONS
394
+
395
+ APPENDIX: How to apply the Apache License to your work.
396
+
397
+ To apply the Apache License to your work, attach the following
398
+ boilerplate notice, with the fields enclosed by brackets "[]"
399
+ replaced with your own identifying information. (Don't include
400
+ the brackets!) The text should be enclosed in the appropriate
401
+ comment syntax for the file format. We also recommend that a
402
+ file or class name and description of purpose be included on the
403
+ same "printed page" as the copyright notice for easier
404
+ identification within third-party archives.
405
+
406
+ Copyright [yyyy] [name of copyright owner]
407
+
408
+ Licensed under the Apache License, Version 2.0 (the "License");
409
+ you may not use this file except in compliance with the License.
410
+ You may obtain a copy of the License at
411
+
412
+ http://www.apache.org/licenses/LICENSE-2.0
413
+
414
+ Unless required by applicable law or agreed to in writing, software
415
+ distributed under the License is distributed on an "AS IS" BASIS,
416
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
417
+ See the License for the specific language governing permissions and
418
+ limitations under the License.
CODE_OF_CONDUCT.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, sex characteristics, gender identity and expression,
9
+ level of experience, education, socio-economic status, nationality, personal
10
+ appearance, race, religion, or sexual identity and orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies within all project spaces, and it also applies when
49
+ an individual is representing the project or its community in public spaces.
50
+ Examples of representing a project or community include using an official
51
+ project e-mail address, posting via an official social media account, or acting
52
+ as an appointed representative at an online or offline event. Representation of
53
+ a project may be further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the open source team at [[email protected]](mailto:[email protected]). All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 1.4,
71
+ available at [https://www.contributor-covenant.org/version/1/4/code-of-conduct.html](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html)
CONTRIBUTING.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contribution Guide
2
+
3
+ Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.
4
+
5
+ While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.
6
+
7
+ ## Before you get started
8
+
9
+ By submitting a pull request, you represent that you have the right to license your contribution to Apple and the community, and agree by submitting the patch that your contributions are licensed under the [LICENSE](LICENSE).
10
+
11
+ We ask that all community members read and observe our [Code of Conduct](CODE_OF_CONDUCT.md).
LICENSE ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+
3
+ Disclaimer: IMPORTANT: This Apple software is supplied to you by Apple
4
+ Inc. ("Apple") in consideration of your agreement to the following
5
+ terms, and your use, installation, modification or redistribution of
6
+ this Apple software constitutes acceptance of these terms. If you do
7
+ not agree with these terms, please do not use, install, modify or
8
+ redistribute this Apple software.
9
+
10
+ In consideration of your agreement to abide by the following terms, and
11
+ subject to these terms, Apple grants you a personal, non-exclusive
12
+ license, under Apple's copyrights in this original Apple software (the
13
+ "Apple Software"), to use, reproduce, modify and redistribute the Apple
14
+ Software, with or without modifications, in source and/or binary forms;
15
+ provided that if you redistribute the Apple Software in its entirety and
16
+ without modifications, you must retain this notice and the following
17
+ text and disclaimers in all such redistributions of the Apple Software.
18
+ Neither the name, trademarks, service marks or logos of Apple Inc. may
19
+ be used to endorse or promote products derived from the Apple Software
20
+ without specific prior written permission from Apple. Except as
21
+ expressly stated in this notice, no other rights or licenses, express or
22
+ implied, are granted by Apple herein, including but not limited to any
23
+ patent rights that may be infringed by your derivative works or by other
24
+ works in which the Apple Software may be incorporated.
25
+
26
+ The Apple Software is provided by Apple on an "AS IS" basis. APPLE
27
+ MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
28
+ THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
29
+ FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
30
+ OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
31
+
32
+ IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
33
+ OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
34
+ SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
35
+ INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
36
+ MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
37
+ AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
38
+ STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
39
+ POSSIBILITY OF SUCH DAMAGE.
40
+
41
+
42
+ -------------------------------------------------------------------------------
43
+ SOFTWARE DISTRIBUTED IN THIS REPOSITORY:
44
+
45
+ This software includes a number of subcomponents with separate
46
+ copyright notices and license terms - please see the file ACKNOWLEDGEMENTS.
47
+ -------------------------------------------------------------------------------
README.md CHANGED
@@ -1,12 +1,97 @@
1
- ---
2
- title: Depth Pro
3
- emoji: 🐠
4
- colorFrom: red
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 4.44.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
2
+
3
+ This software project accompanies the research paper:
4
+ **[Depth Pro: Sharp Monocular Metric Depth in Less Than a Second](https://arxiv.org/abs/2410.02073)**,
5
+ *Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun*.
6
+
7
+ ![](data/depth-pro-teaser.jpg)
8
+
9
+ We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image.
10
+
11
+
12
+ The model in this repository is a reference implementation, which has been re-trained. Its performance is close to the model reported in the paper but does not match it exactly.
13
+
14
+ ## Getting Started
15
+
16
+ We recommend setting up a virtual environment. Using e.g. miniconda, the `depth_pro` package can be installed via:
17
+
18
+ ```bash
19
+ conda create -n depth-pro -y python=3.9
20
+ conda activate depth-pro
21
+
22
+ pip install -e .
23
+ ```
24
+
25
+ To download pretrained checkpoints follow the code snippet below:
26
+ ```bash
27
+ source get_pretrained_models.sh # Files will be downloaded to `checkpoints` directory.
28
+ ```
29
+
30
+ ### Running from commandline
31
+
32
+ We provide a helper script to directly run the model on a single image:
33
+ ```bash
34
+ # Run prediction on a single image:
35
+ depth-pro-run -i ./data/example.jpg
36
+ # Run `depth-pro-run -h` for available options.
37
+ ```
38
+
39
+ ### Running from python
40
+
41
+ ```python
42
+ from PIL import Image
43
+ import depth_pro
44
+
45
+ # Load model and preprocessing transform
46
+ model, transform = depth_pro.create_model_and_transforms()
47
+ model.eval()
48
+
49
+ # Load and preprocess an image.
50
+ image, _, f_px = depth_pro.load_rgb(image_path)
51
+ image = transform(image)
52
+
53
+ # Run inference.
54
+ prediction = model.infer(image, f_px=f_px)
55
+ depth = prediction["depth"] # Depth in [m].
56
+ focallength_px = prediction["focallength_px"] # Focal length in pixels.
57
+ ```
58
+
59
+
60
+ ### Evaluation (boundary metrics)
61
+
62
+ Our boundary metrics can be found under `eval/boundary_metrics.py` and used as follows:
63
+
64
+ ```python
65
+ # for a depth-based dataset
66
+ boundary_f1 = SI_boundary_F1(predicted_depth, target_depth)
67
+
68
+ # for a mask-based dataset (image matting / segmentation)
69
+ boundary_recall = SI_boundary_Recall(predicted_depth, target_mask)
70
+ ```
71
+
72
+
73
+ ## Citation
74
+
75
+ If you find our work useful, please cite the following paper:
76
+
77
+ ```bibtex
78
+ @article{Bochkovskii2024:arxiv,
79
+ author = {Aleksei Bochkovskii and Ama\"{e}l Delaunoy and Hugo Germain and Marcel Santos and
80
+ Yichao Zhou and Stephan R. Richter and Vladlen Koltun}
81
+ title = {Depth Pro: Sharp Monocular Metric Depth in Less Than a Second},
82
+ journal = {arXiv},
83
+ year = {2024},
84
+ url = {https://arxiv.org/abs/2410.02073},
85
+ }
86
+ ```
87
+
88
+ ## License
89
+ This sample code is released under the [LICENSE](LICENSE) terms.
90
+
91
+ The model weights are released under the [LICENSE](LICENSE) terms.
92
+
93
+ ## Acknowledgements
94
+
95
+ Our codebase is built using multiple opensource contributions, please see [Acknowledgements](ACKNOWLEDGEMENTS.md) for more details.
96
+
97
+ Please check the paper for a complete list of references and datasets used in this work.
data/depth-pro-teaser.jpg ADDED
data/example.jpg ADDED

Git LFS Details

  • SHA256: 7b07a583db12943bc90c2429afeca0aca63450e8eb7a2f29314ffbb1acd8d710
  • Pointer size: 132 Bytes
  • Size of remote file: 2.33 MB
get_pretrained_models.sh ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # For licensing see accompanying LICENSE file.
4
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
5
+ #
6
+ mkdir -p checkpoints
7
+ # Place final weights here:
8
+ wget https://ml-site.cdn-apple.com/models/depth-pro/depth_pro.pt -P checkpoints
pyproject.toml ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "depth_pro"
3
+ version = "0.1"
4
+ description = "Inference/Network/Model code for Apple Depth Pro monocular depth estimation."
5
+ readme = "README.md"
6
+ dependencies = [
7
+ "torch",
8
+ "torchvision",
9
+ "timm",
10
+ "numpy<2",
11
+ "pillow_heif",
12
+ "matplotlib",
13
+ ]
14
+
15
+ [project.scripts]
16
+ depth-pro-run = "depth_pro.cli:run_main"
17
+
18
+ [project.urls]
19
+ Homepage = "https://github.com/apple/ml-depth-pro"
20
+ Repository = "https://github.com/apple/ml-depth-pro"
21
+
22
+ [build-system]
23
+ requires = ["setuptools", "setuptools-scm"]
24
+ build-backend = "setuptools.build_meta"
25
+
26
+ [tool.setuptools.packages.find]
27
+ where = ["src"]
28
+
29
+ [tool.pyright]
30
+ include = ["src"]
31
+ exclude = [
32
+ "**/node_modules",
33
+ "**/__pycache__",
34
+ ]
35
+ pythonVersion = "3.9"
36
+
37
+ [tool.pytest.ini_options]
38
+ minversion = "6.0"
39
+ addopts = "-ra -q"
40
+ testpaths = [
41
+ "tests"
42
+ ]
43
+ filterwarnings = [
44
+ "ignore::DeprecationWarning"
45
+ ]
46
+
47
+ [tool.lint.per-file-ignores]
48
+ "__init__.py" = ["F401", "D100", "D104"]
49
+
50
+ [tool.ruff]
51
+ line-length = 100
52
+ lint.select = ["E", "F", "D", "I"]
53
+ lint.ignore = ["D100", "D105"]
54
+ extend-exclude = [
55
+ "*external*",
56
+ "third_party",
57
+ ]
58
+ src = ["depth_pro", "tests"]
59
+ target-version = "py39"
src/depth_pro/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ """Depth Pro package."""
3
+
4
+ from .depth_pro import create_model_and_transforms # noqa
5
+ from .utils import load_rgb # noqa
src/depth_pro/cli/__init__.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ """Depth Pro CLI and tools."""
3
+
4
+ from .run import main as run_main # noqa
src/depth_pro/cli/run.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Sample script to run DepthPro.
3
+
4
+ Copyright (C) 2024 Apple Inc. All Rights Reserved.
5
+ """
6
+
7
+
8
+ import argparse
9
+ import logging
10
+ from pathlib import Path
11
+
12
+ import numpy as np
13
+ import PIL.Image
14
+ import torch
15
+ from matplotlib import pyplot as plt
16
+ from tqdm import tqdm
17
+
18
+ from depth_pro import create_model_and_transforms, load_rgb
19
+
20
+ LOGGER = logging.getLogger(__name__)
21
+
22
+
23
+ def get_torch_device() -> torch.device:
24
+ """Get the Torch device."""
25
+ device = torch.device("cpu")
26
+ if torch.cuda.is_available():
27
+ device = torch.device("cuda:0")
28
+ elif torch.backends.mps.is_available():
29
+ device = torch.device("mps")
30
+ return device
31
+
32
+
33
+ def run(args):
34
+ """Run Depth Pro on a sample image."""
35
+ if args.verbose:
36
+ logging.basicConfig(level=logging.INFO)
37
+
38
+ # Load model.
39
+ model, transform = create_model_and_transforms(
40
+ device=get_torch_device(),
41
+ precision=torch.half,
42
+ )
43
+ model.eval()
44
+
45
+ image_paths = [args.image_path]
46
+ if args.image_path.is_dir():
47
+ image_paths = args.image_path.glob("**/*")
48
+ relative_path = args.image_path
49
+ else:
50
+ relative_path = args.image_path.parent
51
+
52
+ if not args.skip_display:
53
+ plt.ion()
54
+ fig = plt.figure()
55
+ ax_rgb = fig.add_subplot(121)
56
+ ax_disp = fig.add_subplot(122)
57
+
58
+ for image_path in tqdm(image_paths):
59
+ # Load image and focal length from exif info (if found.).
60
+ try:
61
+ LOGGER.info(f"Loading image {image_path} ...")
62
+ image, _, f_px = load_rgb(image_path)
63
+ except Exception as e:
64
+ LOGGER.error(str(e))
65
+ continue
66
+ # Run prediction. If `f_px` is provided, it is used to estimate the final metric depth,
67
+ # otherwise the model estimates `f_px` to compute the depth metricness.
68
+ prediction = model.infer(transform(image), f_px=f_px)
69
+
70
+ # Extract the depth and focal length.
71
+ depth = prediction["depth"].detach().cpu().numpy().squeeze()
72
+ if f_px is not None:
73
+ LOGGER.debug(f"Focal length (from exif): {f_px:0.2f}")
74
+ elif prediction["focallength_px"] is not None:
75
+ focallength_px = prediction["focallength_px"].detach().cpu().item()
76
+ LOGGER.info(f"Estimated focal length: {focallength_px}")
77
+
78
+ # Save Depth as npz file.
79
+ if args.output_path is not None:
80
+ output_file = (
81
+ args.output_path
82
+ / image_path.relative_to(relative_path).parent
83
+ / image_path.stem
84
+ )
85
+ LOGGER.info(f"Saving depth map to: {str(output_file)}")
86
+ output_file.parent.mkdir(parents=True, exist_ok=True)
87
+ np.savez_compressed(output_file, depth=depth)
88
+
89
+ # Save as color-mapped "turbo" jpg image.
90
+ cmap = plt.get_cmap("turbo_r")
91
+ normalized_depth = (depth - depth.min()) / (
92
+ depth.max() - depth.min()
93
+ )
94
+ color_depth = (cmap(normalized_depth)[..., :3] * 255).astype(
95
+ np.uint8
96
+ )
97
+ color_map_output_file = str(output_file) + ".jpg"
98
+ LOGGER.info(f"Saving color-mapped depth to: : {color_map_output_file}")
99
+ PIL.Image.fromarray(color_depth).save(
100
+ color_map_output_file, format="JPEG", quality=90
101
+ )
102
+
103
+ # Display the image and estimated depth map.
104
+ if not args.skip_display:
105
+ ax_rgb.imshow(image)
106
+ ax_disp.imshow(depth, cmap="turbo_r")
107
+ fig.canvas.draw()
108
+ fig.canvas.flush_events()
109
+
110
+ LOGGER.info("Done predicting depth!")
111
+ if not args.skip_display:
112
+ plt.show(block=True)
113
+
114
+
115
+ def main():
116
+ """Run DepthPro inference example."""
117
+ parser = argparse.ArgumentParser(
118
+ description="Inference scripts of DepthPro with PyTorch models."
119
+ )
120
+ parser.add_argument(
121
+ "-i",
122
+ "--image-path",
123
+ type=Path,
124
+ default="./data/example.jpg",
125
+ help="Path to input image.",
126
+ )
127
+ parser.add_argument(
128
+ "-o",
129
+ "--output-path",
130
+ type=Path,
131
+ help="Path to store output files.",
132
+ )
133
+ parser.add_argument(
134
+ "--skip-display",
135
+ action="store_true",
136
+ help="Skip matplotlib display.",
137
+ )
138
+ parser.add_argument(
139
+ "-v",
140
+ "--verbose",
141
+ action="store_true",
142
+ help="Show verbose output."
143
+ )
144
+
145
+ run(parser.parse_args())
146
+
147
+
148
+ if __name__ == "__main__":
149
+ main()
src/depth_pro/depth_pro.py ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ # Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
3
+
4
+
5
+ from __future__ import annotations
6
+
7
+ from dataclasses import dataclass
8
+ from typing import Mapping, Optional, Tuple, Union
9
+
10
+ import torch
11
+ from torch import nn
12
+ from torchvision.transforms import (
13
+ Compose,
14
+ ConvertImageDtype,
15
+ Lambda,
16
+ Normalize,
17
+ ToTensor,
18
+ )
19
+
20
+ from .network.decoder import MultiresConvDecoder
21
+ from .network.encoder import DepthProEncoder
22
+ from .network.fov import FOVNetwork
23
+ from .network.vit_factory import VIT_CONFIG_DICT, ViTPreset, create_vit
24
+
25
+
26
+ @dataclass
27
+ class DepthProConfig:
28
+ """Configuration for DepthPro."""
29
+
30
+ patch_encoder_preset: ViTPreset
31
+ image_encoder_preset: ViTPreset
32
+ decoder_features: int
33
+
34
+ checkpoint_uri: Optional[str] = None
35
+ fov_encoder_preset: Optional[ViTPreset] = None
36
+ use_fov_head: bool = True
37
+
38
+
39
+ DEFAULT_MONODEPTH_CONFIG_DICT = DepthProConfig(
40
+ patch_encoder_preset="dinov2l16_384",
41
+ image_encoder_preset="dinov2l16_384",
42
+ checkpoint_uri="./checkpoints/depth_pro.pt",
43
+ decoder_features=256,
44
+ use_fov_head=True,
45
+ fov_encoder_preset="dinov2l16_384",
46
+ )
47
+
48
+
49
+ def create_backbone_model(
50
+ preset: ViTPreset
51
+ ) -> Tuple[nn.Module, ViTPreset]:
52
+ """Create and load a backbone model given a config.
53
+
54
+ Args:
55
+ ----
56
+ preset: A backbone preset to load pre-defind configs.
57
+
58
+ Returns:
59
+ -------
60
+ A Torch module and the associated config.
61
+
62
+ """
63
+ if preset in VIT_CONFIG_DICT:
64
+ config = VIT_CONFIG_DICT[preset]
65
+ model = create_vit(preset=preset, use_pretrained=False)
66
+ else:
67
+ raise KeyError(f"Preset {preset} not found.")
68
+
69
+ return model, config
70
+
71
+
72
+ def create_model_and_transforms(
73
+ config: DepthProConfig = DEFAULT_MONODEPTH_CONFIG_DICT,
74
+ device: torch.device = torch.device("cpu"),
75
+ precision: torch.dtype = torch.float32,
76
+ ) -> Tuple[DepthPro, Compose]:
77
+ """Create a DepthPro model and load weights from `config.checkpoint_uri`.
78
+
79
+ Args:
80
+ ----
81
+ config: The configuration for the DPT model architecture.
82
+ device: The optional Torch device to load the model onto, default runs on "cpu".
83
+ precision: The optional precision used for the model, default is FP32.
84
+
85
+ Returns:
86
+ -------
87
+ The Torch DepthPro model and associated Transform.
88
+
89
+ """
90
+ patch_encoder, patch_encoder_config = create_backbone_model(
91
+ preset=config.patch_encoder_preset
92
+ )
93
+ image_encoder, _ = create_backbone_model(
94
+ preset=config.image_encoder_preset
95
+ )
96
+
97
+ fov_encoder = None
98
+ if config.use_fov_head and config.fov_encoder_preset is not None:
99
+ fov_encoder, _ = create_backbone_model(preset=config.fov_encoder_preset)
100
+
101
+ dims_encoder = patch_encoder_config.encoder_feature_dims
102
+ hook_block_ids = patch_encoder_config.encoder_feature_layer_ids
103
+ encoder = DepthProEncoder(
104
+ dims_encoder=dims_encoder,
105
+ patch_encoder=patch_encoder,
106
+ image_encoder=image_encoder,
107
+ hook_block_ids=hook_block_ids,
108
+ decoder_features=config.decoder_features,
109
+ )
110
+ decoder = MultiresConvDecoder(
111
+ dims_encoder=[config.decoder_features] + list(encoder.dims_encoder),
112
+ dim_decoder=config.decoder_features,
113
+ )
114
+ model = DepthPro(
115
+ encoder=encoder,
116
+ decoder=decoder,
117
+ last_dims=(32, 1),
118
+ use_fov_head=config.use_fov_head,
119
+ fov_encoder=fov_encoder,
120
+ ).to(device)
121
+
122
+ if precision == torch.half:
123
+ model.half()
124
+
125
+ transform = Compose(
126
+ [
127
+ ToTensor(),
128
+ Lambda(lambda x: x.to(device)),
129
+ Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
130
+ ConvertImageDtype(precision),
131
+ ]
132
+ )
133
+
134
+ if config.checkpoint_uri is not None:
135
+ state_dict = torch.load(config.checkpoint_uri, map_location="cpu")
136
+ missing_keys, unexpected_keys = model.load_state_dict(
137
+ state_dict=state_dict, strict=True
138
+ )
139
+
140
+ if len(unexpected_keys) != 0:
141
+ raise KeyError(
142
+ f"Found unexpected keys when loading monodepth: {unexpected_keys}"
143
+ )
144
+
145
+ # fc_norm is only for the classification head,
146
+ # which we would not use. We only use the encoding.
147
+ missing_keys = [key for key in missing_keys if "fc_norm" not in key]
148
+ if len(missing_keys) != 0:
149
+ raise KeyError(f"Keys are missing when loading monodepth: {missing_keys}")
150
+
151
+ return model, transform
152
+
153
+
154
+ class DepthPro(nn.Module):
155
+ """DepthPro network."""
156
+
157
+ def __init__(
158
+ self,
159
+ encoder: DepthProEncoder,
160
+ decoder: MultiresConvDecoder,
161
+ last_dims: tuple[int, int],
162
+ use_fov_head: bool = True,
163
+ fov_encoder: Optional[nn.Module] = None,
164
+ ):
165
+ """Initialize DepthPro.
166
+
167
+ Args:
168
+ ----
169
+ encoder: The DepthProEncoder backbone.
170
+ decoder: The MultiresConvDecoder decoder.
171
+ last_dims: The dimension for the last convolution layers.
172
+ use_fov_head: Whether to use the field-of-view head.
173
+ fov_encoder: A separate encoder for the field of view.
174
+
175
+ """
176
+ super().__init__()
177
+
178
+ self.encoder = encoder
179
+ self.decoder = decoder
180
+
181
+ dim_decoder = decoder.dim_decoder
182
+ self.head = nn.Sequential(
183
+ nn.Conv2d(
184
+ dim_decoder, dim_decoder // 2, kernel_size=3, stride=1, padding=1
185
+ ),
186
+ nn.ConvTranspose2d(
187
+ in_channels=dim_decoder // 2,
188
+ out_channels=dim_decoder // 2,
189
+ kernel_size=2,
190
+ stride=2,
191
+ padding=0,
192
+ bias=True,
193
+ ),
194
+ nn.Conv2d(
195
+ dim_decoder // 2,
196
+ last_dims[0],
197
+ kernel_size=3,
198
+ stride=1,
199
+ padding=1,
200
+ ),
201
+ nn.ReLU(True),
202
+ nn.Conv2d(last_dims[0], last_dims[1], kernel_size=1, stride=1, padding=0),
203
+ nn.ReLU(),
204
+ )
205
+
206
+ # Set the final convoultion layer's bias to be 0.
207
+ self.head[4].bias.data.fill_(0)
208
+
209
+ # Set the FOV estimation head.
210
+ if use_fov_head:
211
+ self.fov = FOVNetwork(num_features=dim_decoder, fov_encoder=fov_encoder)
212
+
213
+ @property
214
+ def img_size(self) -> int:
215
+ """Return the internal image size of the network."""
216
+ return self.encoder.img_size
217
+
218
+ def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
219
+ """Decode by projection and fusion of multi-resolution encodings.
220
+
221
+ Args:
222
+ ----
223
+ x (torch.Tensor): Input image.
224
+
225
+ Returns:
226
+ -------
227
+ The canonical inverse depth map [m] and the optional estimated field of view [deg].
228
+
229
+ """
230
+ _, _, H, W = x.shape
231
+ assert H == self.img_size and W == self.img_size
232
+
233
+ encodings = self.encoder(x)
234
+ features, features_0 = self.decoder(encodings)
235
+ canonical_inverse_depth = self.head(features)
236
+
237
+ fov_deg = None
238
+ if hasattr(self, "fov"):
239
+ fov_deg = self.fov.forward(x, features_0.detach())
240
+
241
+ return canonical_inverse_depth, fov_deg
242
+
243
+ @torch.no_grad()
244
+ def infer(
245
+ self,
246
+ x: torch.Tensor,
247
+ f_px: Optional[Union[float, torch.Tensor]] = None,
248
+ interpolation_mode="bilinear",
249
+ ) -> Mapping[str, torch.Tensor]:
250
+ """Infer depth and fov for a given image.
251
+
252
+ If the image is not at network resolution, it is resized to 1536x1536 and
253
+ the estimated depth is resized to the original image resolution.
254
+ Note: if the focal length is given, the estimated value is ignored and the provided
255
+ focal length is use to generate the metric depth values.
256
+
257
+ Args:
258
+ ----
259
+ x (torch.Tensor): Input image
260
+ f_px (torch.Tensor): Optional focal length in pixels corresponding to `x`.
261
+ interpolation_mode (str): Interpolation function for downsampling/upsampling.
262
+
263
+ Returns:
264
+ -------
265
+ Tensor dictionary (torch.Tensor): depth [m], focallength [pixels].
266
+
267
+ """
268
+ if len(x.shape) == 3:
269
+ x = x.unsqueeze(0)
270
+ _, _, H, W = x.shape
271
+ resize = H != self.img_size or W != self.img_size
272
+
273
+ if resize:
274
+ x = nn.functional.interpolate(
275
+ x,
276
+ size=(self.img_size, self.img_size),
277
+ mode=interpolation_mode,
278
+ align_corners=False,
279
+ )
280
+
281
+ canonical_inverse_depth, fov_deg = self.forward(x)
282
+ if f_px is None:
283
+ f_px = 0.5 * W / torch.tan(0.5 * torch.deg2rad(fov_deg.to(torch.float)))
284
+
285
+ inverse_depth = canonical_inverse_depth * (W / f_px)
286
+ f_px = f_px.squeeze()
287
+
288
+ if resize:
289
+ inverse_depth = nn.functional.interpolate(
290
+ inverse_depth, size=(H, W), mode=interpolation_mode, align_corners=False
291
+ )
292
+
293
+ depth = 1.0 / torch.clamp(inverse_depth, min=1e-4, max=1e4)
294
+
295
+ return {
296
+ "depth": depth.squeeze(),
297
+ "focallength_px": f_px,
298
+ }
src/depth_pro/eval/boundary_metrics.py ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Tuple
2
+
3
+ import numpy as np
4
+
5
+
6
+ def connected_component(r: np.ndarray, c: np.ndarray) -> List[List[int]]:
7
+ """Find connected components in the given row and column indices.
8
+
9
+ Args:
10
+ ----
11
+ r (np.ndarray): Row indices.
12
+ c (np.ndarray): Column indices.
13
+
14
+ Yields:
15
+ ------
16
+ List[int]: Indices of connected components.
17
+
18
+ """
19
+ indices = [0]
20
+ for i in range(1, r.size):
21
+ if r[i] == r[indices[-1]] and c[i] == c[indices[-1]] + 1:
22
+ indices.append(i)
23
+ else:
24
+ yield indices
25
+ indices = [i]
26
+ yield indices
27
+
28
+
29
+ def nms_horizontal(ratio: np.ndarray, threshold: float) -> np.ndarray:
30
+ """Apply Non-Maximum Suppression (NMS) horizontally on the given ratio matrix.
31
+
32
+ Args:
33
+ ----
34
+ ratio (np.ndarray): Input ratio matrix.
35
+ threshold (float): Threshold for NMS.
36
+
37
+ Returns:
38
+ -------
39
+ np.ndarray: Binary mask after applying NMS.
40
+
41
+ """
42
+ mask = np.zeros_like(ratio, dtype=bool)
43
+ r, c = np.nonzero(ratio > threshold)
44
+ if len(r) == 0:
45
+ return mask
46
+ for ids in connected_component(r, c):
47
+ values = [ratio[r[i], c[i]] for i in ids]
48
+ mi = np.argmax(values)
49
+ mask[r[ids[mi]], c[ids[mi]]] = True
50
+ return mask
51
+
52
+
53
+ def nms_vertical(ratio: np.ndarray, threshold: float) -> np.ndarray:
54
+ """Apply Non-Maximum Suppression (NMS) vertically on the given ratio matrix.
55
+
56
+ Args:
57
+ ----
58
+ ratio (np.ndarray): Input ratio matrix.
59
+ threshold (float): Threshold for NMS.
60
+
61
+ Returns:
62
+ -------
63
+ np.ndarray: Binary mask after applying NMS.
64
+
65
+ """
66
+ return np.transpose(nms_horizontal(np.transpose(ratio), threshold))
67
+
68
+
69
+ def fgbg_depth(
70
+ d: np.ndarray, t: float
71
+ ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
72
+ """Find foreground-background relations between neighboring pixels.
73
+
74
+ Args:
75
+ ----
76
+ d (np.ndarray): Depth matrix.
77
+ t (float): Threshold for comparison.
78
+
79
+ Returns:
80
+ -------
81
+ Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]: Four matrices indicating
82
+ left, top, right, and bottom foreground-background relations.
83
+
84
+ """
85
+ right_is_big_enough = (d[..., :, 1:] / d[..., :, :-1]) > t
86
+ left_is_big_enough = (d[..., :, :-1] / d[..., :, 1:]) > t
87
+ bottom_is_big_enough = (d[..., 1:, :] / d[..., :-1, :]) > t
88
+ top_is_big_enough = (d[..., :-1, :] / d[..., 1:, :]) > t
89
+ return (
90
+ left_is_big_enough,
91
+ top_is_big_enough,
92
+ right_is_big_enough,
93
+ bottom_is_big_enough,
94
+ )
95
+
96
+
97
+ def fgbg_depth_thinned(
98
+ d: np.ndarray, t: float
99
+ ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
100
+ """Find foreground-background relations between neighboring pixels with Non-Maximum Suppression.
101
+
102
+ Args:
103
+ ----
104
+ d (np.ndarray): Depth matrix.
105
+ t (float): Threshold for NMS.
106
+
107
+ Returns:
108
+ -------
109
+ Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]: Four matrices indicating
110
+ left, top, right, and bottom foreground-background relations with NMS applied.
111
+
112
+ """
113
+ right_is_big_enough = nms_horizontal(d[..., :, 1:] / d[..., :, :-1], t)
114
+ left_is_big_enough = nms_horizontal(d[..., :, :-1] / d[..., :, 1:], t)
115
+ bottom_is_big_enough = nms_vertical(d[..., 1:, :] / d[..., :-1, :], t)
116
+ top_is_big_enough = nms_vertical(d[..., :-1, :] / d[..., 1:, :], t)
117
+ return (
118
+ left_is_big_enough,
119
+ top_is_big_enough,
120
+ right_is_big_enough,
121
+ bottom_is_big_enough,
122
+ )
123
+
124
+
125
+ def fgbg_binary_mask(
126
+ d: np.ndarray,
127
+ ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
128
+ """Find foreground-background relations between neighboring pixels in binary masks.
129
+
130
+ Args:
131
+ ----
132
+ d (np.ndarray): Binary depth matrix.
133
+
134
+ Returns:
135
+ -------
136
+ Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]: Four matrices indicating
137
+ left, top, right, and bottom foreground-background relations in binary masks.
138
+
139
+ """
140
+ assert d.dtype == bool
141
+ right_is_big_enough = d[..., :, 1:] & ~d[..., :, :-1]
142
+ left_is_big_enough = d[..., :, :-1] & ~d[..., :, 1:]
143
+ bottom_is_big_enough = d[..., 1:, :] & ~d[..., :-1, :]
144
+ top_is_big_enough = d[..., :-1, :] & ~d[..., 1:, :]
145
+ return (
146
+ left_is_big_enough,
147
+ top_is_big_enough,
148
+ right_is_big_enough,
149
+ bottom_is_big_enough,
150
+ )
151
+
152
+
153
+ def edge_recall_matting(pr: np.ndarray, gt: np.ndarray, t: float) -> float:
154
+ """Calculate edge recall for image matting.
155
+
156
+ Args:
157
+ ----
158
+ pr (np.ndarray): Predicted depth matrix.
159
+ gt (np.ndarray): Ground truth binary mask.
160
+ t (float): Threshold for NMS.
161
+
162
+ Returns:
163
+ -------
164
+ float: Edge recall value.
165
+
166
+ """
167
+ assert gt.dtype == bool
168
+ ap, bp, cp, dp = fgbg_depth_thinned(pr, t)
169
+ ag, bg, cg, dg = fgbg_binary_mask(gt)
170
+ return 0.25 * (
171
+ np.count_nonzero(ap & ag) / max(np.count_nonzero(ag), 1)
172
+ + np.count_nonzero(bp & bg) / max(np.count_nonzero(bg), 1)
173
+ + np.count_nonzero(cp & cg) / max(np.count_nonzero(cg), 1)
174
+ + np.count_nonzero(dp & dg) / max(np.count_nonzero(dg), 1)
175
+ )
176
+
177
+
178
+ def boundary_f1(
179
+ pr: np.ndarray,
180
+ gt: np.ndarray,
181
+ t: float,
182
+ return_p: bool = False,
183
+ return_r: bool = False,
184
+ ) -> float:
185
+ """Calculate Boundary F1 score.
186
+
187
+ Args:
188
+ ----
189
+ pr (np.ndarray): Predicted depth matrix.
190
+ gt (np.ndarray): Ground truth depth matrix.
191
+ t (float): Threshold for comparison.
192
+ return_p (bool, optional): If True, return precision. Defaults to False.
193
+ return_r (bool, optional): If True, return recall. Defaults to False.
194
+
195
+ Returns:
196
+ -------
197
+ float: Boundary F1 score, or precision, or recall depending on the flags.
198
+
199
+ """
200
+ ap, bp, cp, dp = fgbg_depth(pr, t)
201
+ ag, bg, cg, dg = fgbg_depth(gt, t)
202
+
203
+ r = 0.25 * (
204
+ np.count_nonzero(ap & ag) / max(np.count_nonzero(ag), 1)
205
+ + np.count_nonzero(bp & bg) / max(np.count_nonzero(bg), 1)
206
+ + np.count_nonzero(cp & cg) / max(np.count_nonzero(cg), 1)
207
+ + np.count_nonzero(dp & dg) / max(np.count_nonzero(dg), 1)
208
+ )
209
+ p = 0.25 * (
210
+ np.count_nonzero(ap & ag) / max(np.count_nonzero(ap), 1)
211
+ + np.count_nonzero(bp & bg) / max(np.count_nonzero(bp), 1)
212
+ + np.count_nonzero(cp & cg) / max(np.count_nonzero(cp), 1)
213
+ + np.count_nonzero(dp & dg) / max(np.count_nonzero(dp), 1)
214
+ )
215
+ if r + p == 0:
216
+ return 0.0
217
+ if return_p:
218
+ return p
219
+ if return_r:
220
+ return r
221
+ return 2 * (r * p) / (r + p)
222
+
223
+
224
+ def get_thresholds_and_weights(
225
+ t_min: float, t_max: float, N: int
226
+ ) -> Tuple[np.ndarray, np.ndarray]:
227
+ """Generate thresholds and weights for the given range.
228
+
229
+ Args:
230
+ ----
231
+ t_min (float): Minimum threshold.
232
+ t_max (float): Maximum threshold.
233
+ N (int): Number of thresholds.
234
+
235
+ Returns:
236
+ -------
237
+ Tuple[np.ndarray, np.ndarray]: Array of thresholds and corresponding weights.
238
+
239
+ """
240
+ thresholds = np.linspace(t_min, t_max, N)
241
+ weights = thresholds / thresholds.sum()
242
+ return thresholds, weights
243
+
244
+
245
+ def invert_depth(depth: np.ndarray, eps: float = 1e-6) -> np.ndarray:
246
+ """Inverts a depth map with numerical stability.
247
+
248
+ Args:
249
+ ----
250
+ depth (np.ndarray): Depth map to be inverted.
251
+ eps (float): Minimum value to avoid division by zero (default is 1e-6).
252
+
253
+ Returns:
254
+ -------
255
+ np.ndarray: Inverted depth map.
256
+
257
+ """
258
+ inverse_depth = 1.0 / depth.clip(min=eps)
259
+ return inverse_depth
260
+
261
+
262
+ def SI_boundary_F1(
263
+ predicted_depth: np.ndarray,
264
+ target_depth: np.ndarray,
265
+ t_min: float = 1.05,
266
+ t_max: float = 1.25,
267
+ N: int = 10,
268
+ ) -> float:
269
+ """Calculate Scale-Invariant Boundary F1 Score for depth-based ground-truth.
270
+
271
+ Args:
272
+ ----
273
+ predicted_depth (np.ndarray): Predicted depth matrix.
274
+ target_depth (np.ndarray): Ground truth depth matrix.
275
+ t_min (float, optional): Minimum threshold. Defaults to 1.05.
276
+ t_max (float, optional): Maximum threshold. Defaults to 1.25.
277
+ N (int, optional): Number of thresholds. Defaults to 10.
278
+
279
+ Returns:
280
+ -------
281
+ float: Scale-Invariant Boundary F1 Score.
282
+
283
+ """
284
+ assert predicted_depth.ndim == target_depth.ndim == 2
285
+ thresholds, weights = get_thresholds_and_weights(t_min, t_max, N)
286
+ f1_scores = np.array(
287
+ [
288
+ boundary_f1(invert_depth(predicted_depth), invert_depth(target_depth), t)
289
+ for t in thresholds
290
+ ]
291
+ )
292
+ return np.sum(f1_scores * weights)
293
+
294
+
295
+ def SI_boundary_Recall(
296
+ predicted_depth: np.ndarray,
297
+ target_mask: np.ndarray,
298
+ t_min: float = 1.05,
299
+ t_max: float = 1.25,
300
+ N: int = 10,
301
+ alpha_threshold: float = 0.1,
302
+ ) -> float:
303
+ """Calculate Scale-Invariant Boundary Recall Score for mask-based ground-truth.
304
+
305
+ Args:
306
+ ----
307
+ predicted_depth (np.ndarray): Predicted depth matrix.
308
+ target_mask (np.ndarray): Ground truth binary mask.
309
+ t_min (float, optional): Minimum threshold. Defaults to 1.05.
310
+ t_max (float, optional): Maximum threshold. Defaults to 1.25.
311
+ N (int, optional): Number of thresholds. Defaults to 10.
312
+ alpha_threshold (float, optional): Threshold for alpha masking. Defaults to 0.1.
313
+
314
+ Returns:
315
+ -------
316
+ float: Scale-Invariant Boundary Recall Score.
317
+
318
+ """
319
+ assert predicted_depth.ndim == target_mask.ndim == 2
320
+ thresholds, weights = get_thresholds_and_weights(t_min, t_max, N)
321
+ thresholded_target = target_mask > alpha_threshold
322
+
323
+ recall_scores = np.array(
324
+ [
325
+ edge_recall_matting(
326
+ invert_depth(predicted_depth), thresholded_target, t=float(t)
327
+ )
328
+ for t in thresholds
329
+ ]
330
+ )
331
+ weighted_recall = np.sum(recall_scores * weights)
332
+ return weighted_recall
src/depth_pro/eval/dis5k_sample_list.txt ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ DIS5K/DIS-TE1/im/12#Graphics#4#TrafficSign#8245751856_821be14f86_o.jpg
2
+ DIS5K/DIS-TE1/im/13#Insect#4#Butterfly#16023994688_7ff8cdccb1_o.jpg
3
+ DIS5K/DIS-TE1/im/14#Kitchenware#4#Kitchenware#IMG_20210520_205538.jpg
4
+ DIS5K/DIS-TE1/im/14#Kitchenware#8#SweetStand#4848284981_fc90f54b50_o.jpg
5
+ DIS5K/DIS-TE1/im/17#Non-motor Vehicle#4#Cart#15012855035_d10b57014f_o.jpg
6
+ DIS5K/DIS-TE1/im/2#Aircraft#5#Kite#13104545564_5afceec9bd_o.jpg
7
+ DIS5K/DIS-TE1/im/20#Sports#10#Skateboarding#8472763540_bb2390e928_o.jpg
8
+ DIS5K/DIS-TE1/im/21#Tool#14#Sword#32473146960_dcc6b77848_o.jpg
9
+ DIS5K/DIS-TE1/im/21#Tool#15#Tapeline#9680492386_2d2020f282_o.jpg
10
+ DIS5K/DIS-TE1/im/21#Tool#4#Flag#507752845_ef852100f0_o.jpg
11
+ DIS5K/DIS-TE1/im/21#Tool#6#Key#11966089533_3becd78b44_o.jpg
12
+ DIS5K/DIS-TE1/im/21#Tool#8#Scale#31946428472_d28def471b_o.jpg
13
+ DIS5K/DIS-TE1/im/22#Weapon#4#Rifle#8472656430_3eb908b211_o.jpg
14
+ DIS5K/DIS-TE1/im/8#Electronics#3#Earphone#1177468301_641df8c267_o.jpg
15
+ DIS5K/DIS-TE1/im/8#Electronics#9#MusicPlayer#2235782872_7d47847bb4_o.jpg
16
+ DIS5K/DIS-TE2/im/11#Furniture#13#Ladder#3878434417_2ed740586e_o.jpg
17
+ DIS5K/DIS-TE2/im/13#Insect#1#Ant#27047700955_3b3a1271f8_o.jpg
18
+ DIS5K/DIS-TE2/im/13#Insect#11#Spider#5567179191_38d1f65589_o.jpg
19
+ DIS5K/DIS-TE2/im/13#Insect#8#Locust#5237933769_e6687c05e4_o.jpg
20
+ DIS5K/DIS-TE2/im/14#Kitchenware#2#DishRack#70838854_40cf689da7_o.jpg
21
+ DIS5K/DIS-TE2/im/14#Kitchenware#8#SweetStand#8467929412_fef7f4275d_o.jpg
22
+ DIS5K/DIS-TE2/im/16#Music Instrument#2#Harp#28058219806_28e05ff24a_o.jpg
23
+ DIS5K/DIS-TE2/im/17#Non-motor Vehicle#1#BabyCarriage#29794777180_2e1695a0cf_o.jpg
24
+ DIS5K/DIS-TE2/im/19#Ship#3#Sailboat#22442908623_5977e3becf_o.jpg
25
+ DIS5K/DIS-TE2/im/2#Aircraft#5#Kite#44654358051_1400e71cc4_o.jpg
26
+ DIS5K/DIS-TE2/im/21#Tool#11#Stand#IMG_20210520_205442.jpg
27
+ DIS5K/DIS-TE2/im/21#Tool#17#Tripod#9318977876_34615ec9a0_o.jpg
28
+ DIS5K/DIS-TE2/im/5#Artifact#3#Handcraft#50860882577_8482143b1b_o.jpg
29
+ DIS5K/DIS-TE2/im/8#Electronics#10#Robot#3093360210_fee54dc5c5_o.jpg
30
+ DIS5K/DIS-TE2/im/8#Electronics#6#Microphone#47411477652_6da66cbc10_o.jpg
31
+ DIS5K/DIS-TE3/im/14#Kitchenware#4#Kitchenware#2451122898_ef883175dd_o.jpg
32
+ DIS5K/DIS-TE3/im/15#Machine#4#SewingMachine#9311164128_97ba1d3947_o.jpg
33
+ DIS5K/DIS-TE3/im/16#Music Instrument#2#Harp#7670920550_59e992fd7b_o.jpg
34
+ DIS5K/DIS-TE3/im/17#Non-motor Vehicle#1#BabyCarriage#8389984877_1fddf8715c_o.jpg
35
+ DIS5K/DIS-TE3/im/17#Non-motor Vehicle#3#Carriage#5947122724_98e0fc3d1f_o.jpg
36
+ DIS5K/DIS-TE3/im/2#Aircraft#2#Balloon#2487168092_641505883f_o.jpg
37
+ DIS5K/DIS-TE3/im/2#Aircraft#4#Helicopter#8401177591_06c71c8df2_o.jpg
38
+ DIS5K/DIS-TE3/im/20#Sports#1#Archery#12520003103_faa43ea3e0_o.jpg
39
+ DIS5K/DIS-TE3/im/21#Tool#11#Stand#IMG_20210709_221507.jpg
40
+ DIS5K/DIS-TE3/im/21#Tool#2#Clip#5656649687_63d0c6696d_o.jpg
41
+ DIS5K/DIS-TE3/im/21#Tool#6#Key#12878459244_6387a140ea_o.jpg
42
+ DIS5K/DIS-TE3/im/3#Aquatic#1#Lobster#109214461_f52b4b6093_o.jpg
43
+ DIS5K/DIS-TE3/im/4#Architecture#19#Windmill#20195851863_2627117e0e_o.jpg
44
+ DIS5K/DIS-TE3/im/5#Artifact#2#Cage#5821476369_ea23927487_o.jpg
45
+ DIS5K/DIS-TE3/im/8#Electronics#7#MobileHolder#49732997896_7f53c290b5_o.jpg
46
+ DIS5K/DIS-TE4/im/13#Insect#6#Centipede#15302179708_a267850881_o.jpg
47
+ DIS5K/DIS-TE4/im/17#Non-motor Vehicle#11#Tricycle#5771069105_a3aef6f665_o.jpg
48
+ DIS5K/DIS-TE4/im/17#Non-motor Vehicle#2#Bicycle#4245936196_fdf812dcb7_o.jpg
49
+ DIS5K/DIS-TE4/im/17#Non-motor Vehicle#9#ShoppingCart#4674052920_a5b7a2b236_o.jpg
50
+ DIS5K/DIS-TE4/im/18#Plant#1#Bonsai#3539420884_ca8973e2c0_o.jpg
51
+ DIS5K/DIS-TE4/im/2#Aircraft#6#Parachute#33590416634_9d6f2325e7_o.jpg
52
+ DIS5K/DIS-TE4/im/20#Sports#1#Archery#46924476515_0be1caa684_o.jpg
53
+ DIS5K/DIS-TE4/im/20#Sports#8#Racket#19337607166_dd1985fb59_o.jpg
54
+ DIS5K/DIS-TE4/im/21#Tool#6#Key#3193329588_839b0c74ce_o.jpg
55
+ DIS5K/DIS-TE4/im/5#Artifact#2#Cage#5821886526_0573ba2d0d_o.jpg
56
+ DIS5K/DIS-TE4/im/5#Artifact#3#Handcraft#50105138282_3c1d02c968_o.jpg
57
+ DIS5K/DIS-TE4/im/8#Electronics#1#Antenna#4305034305_874f21a701_o.jpg
58
+ DIS5K/DIS-TR/im/1#Accessories#1#Bag#15554964549_3105e51b6f_o.jpg
59
+ DIS5K/DIS-TR/im/1#Accessories#1#Bag#41104261980_098a6c4a56_o.jpg
60
+ DIS5K/DIS-TR/im/1#Accessories#2#Clothes#2284764037_871b2e8ca4_o.jpg
61
+ DIS5K/DIS-TR/im/1#Accessories#3#Eyeglasses#1824643784_70d0134156_o.jpg
62
+ DIS5K/DIS-TR/im/1#Accessories#3#Eyeglasses#3590020230_37b09a29b3_o.jpg
63
+ DIS5K/DIS-TR/im/1#Accessories#3#Eyeglasses#4809652879_4da8a69f3b_o.jpg
64
+ DIS5K/DIS-TR/im/1#Accessories#3#Eyeglasses#792204934_f9b28f99b4_o.jpg
65
+ DIS5K/DIS-TR/im/1#Accessories#5#Jewelry#13909132974_c4750c5fb7_o.jpg
66
+ DIS5K/DIS-TR/im/1#Accessories#7#Shoe#2483391615_9199ece8d6_o.jpg
67
+ DIS5K/DIS-TR/im/1#Accessories#8#Watch#4343266960_f6633b029b_o.jpg
68
+ DIS5K/DIS-TR/im/10#Frame#2#BicycleFrame#17897573_42964dd104_o.jpg
69
+ DIS5K/DIS-TR/im/10#Frame#5#Rack#15898634812_64807069ff_o.jpg
70
+ DIS5K/DIS-TR/im/10#Frame#5#Rack#23928546819_c184cb0b60_o.jpg
71
+ DIS5K/DIS-TR/im/11#Furniture#19#Shower#6189119596_77bcfe80ee_o.jpg
72
+ DIS5K/DIS-TR/im/11#Furniture#2#Bench#3263647075_9306e280b5_o.jpg
73
+ DIS5K/DIS-TR/im/11#Furniture#5#CoatHanger#12774091054_cd5ff520ef_o.jpg
74
+ DIS5K/DIS-TR/im/11#Furniture#6#DentalChair#13878156865_d0439dcb32_o.jpg
75
+ DIS5K/DIS-TR/im/11#Furniture#9#Easel#5861024714_2070cd480c_o.jpg
76
+ DIS5K/DIS-TR/im/12#Graphics#4#TrafficSign#40621867334_f3c32ec189_o.jpg
77
+ DIS5K/DIS-TR/im/13#Insect#1#Ant#3295038190_db5dd0d4f4_o.jpg
78
+ DIS5K/DIS-TR/im/13#Insect#10#Mosquito#24341339_a88a1dad4c_o.jpg
79
+ DIS5K/DIS-TR/im/13#Insect#11#Spider#27171518270_63b78069ff_o.jpg
80
+ DIS5K/DIS-TR/im/13#Insect#11#Spider#49925050281_fa727c154e_o.jpg
81
+ DIS5K/DIS-TR/im/13#Insect#2#Beatle#279616486_2f1e64f591_o.jpg
82
+ DIS5K/DIS-TR/im/13#Insect#3#Bee#43892067695_82cf3e536b_o.jpg
83
+ DIS5K/DIS-TR/im/13#Insect#6#Centipede#20874281788_3e15c90a1c_o.jpg
84
+ DIS5K/DIS-TR/im/13#Insect#7#Dragonfly#14106671120_1b824d77e4_o.jpg
85
+ DIS5K/DIS-TR/im/13#Insect#8#Locust#21637491048_676ef7c9f7_o.jpg
86
+ DIS5K/DIS-TR/im/13#Insect#9#Mantis#1381120202_9dff6987b2_o.jpg
87
+ DIS5K/DIS-TR/im/14#Kitchenware#1#Cup#12812517473_327d6474b8_o.jpg
88
+ DIS5K/DIS-TR/im/14#Kitchenware#10#WineGlass#6402491641_389275d4d1_o.jpg
89
+ DIS5K/DIS-TR/im/14#Kitchenware#3#Hydrovalve#3129932040_8c05825004_o.jpg
90
+ DIS5K/DIS-TR/im/14#Kitchenware#4#Kitchenware#2881934780_87d5218ebb_o.jpg
91
+ DIS5K/DIS-TR/im/14#Kitchenware#4#Kitchenware#IMG_20210520_205527.jpg
92
+ DIS5K/DIS-TR/im/14#Kitchenware#6#Spoon#32989113501_b69eccf0df_o.jpg
93
+ DIS5K/DIS-TR/im/14#Kitchenware#8#SweetStand#2867322189_c56d1e0b87_o.jpg
94
+ DIS5K/DIS-TR/im/15#Machine#1#Gear#19217846720_f5f2807475_o.jpg
95
+ DIS5K/DIS-TR/im/15#Machine#2#Machine#1620160659_9571b7a7ab_o.jpg
96
+ DIS5K/DIS-TR/im/16#Music Instrument#2#Harp#6012801603_1a6e2c16a6_o.jpg
97
+ DIS5K/DIS-TR/im/16#Music Instrument#5#Trombone#8683292118_d223c17ccb_o.jpg
98
+ DIS5K/DIS-TR/im/16#Music Instrument#6#Trumpet#8393262740_b8c216142c_o.jpg
99
+ DIS5K/DIS-TR/im/16#Music Instrument#8#Violin#1511267391_40e4949d68_o.jpg
100
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#1#BabyCarriage#6989512997_38b3dbc88b_o.jpg
101
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#12#Wheel#14627183228_b2d68cf501_o.jpg
102
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#12#Wheel#2932226475_1b2403e549_o.jpg
103
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#12#Wheel#5420155648_86459905b8_o.jpg
104
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#2#Bicycle#IMG_20210513_134904.jpg
105
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#3#Carriage#3311962551_6f211b7bd6_o.jpg
106
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#4#Cart#2609732026_baf7fff3a1_o.jpg
107
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#5#Handcart#5821282211_201cefeaf2_o.jpg
108
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#7#Mower#5779003232_3bb3ae531a_o.jpg
109
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#9#ShoppingCart#10051622843_ace07e32b8_o.jpg
110
+ DIS5K/DIS-TR/im/17#Non-motor Vehicle#9#ShoppingCart#8075259294_f23e243849_o.jpg
111
+ DIS5K/DIS-TR/im/18#Plant#2#Tree#44800999741_e377e16dbb_o.jpg
112
+ DIS5K/DIS-TR/im/2#Aircraft#1#Airplane#2631761913_3ac67d0223_o.jpg
113
+ DIS5K/DIS-TR/im/2#Aircraft#1#Airplane#37707911566_e908a261b6_o.jpg
114
+ DIS5K/DIS-TR/im/2#Aircraft#3#HangGlider#2557220131_b8506920c5_o.jpg
115
+ DIS5K/DIS-TR/im/2#Aircraft#4#Helicopter#6215659280_5dbd9b4546_o.jpg
116
+ DIS5K/DIS-TR/im/2#Aircraft#6#Parachute#20185790493_e56fcaf8c6_o.jpg
117
+ DIS5K/DIS-TR/im/20#Sports#1#Archery#3871269982_ae4c59a7eb_o.jpg
118
+ DIS5K/DIS-TR/im/20#Sports#9#RockClimbing#9662433268_51299bc50e_o.jpg
119
+ DIS5K/DIS-TR/im/21#Tool#14#Sword#26258479365_2950d7fa37_o.jpg
120
+ DIS5K/DIS-TR/im/21#Tool#15#Tapeline#15505703447_e0fdeaa5a6_o.jpg
121
+ DIS5K/DIS-TR/im/21#Tool#4#Flag#26678602024_9b665742de_o.jpg
122
+ DIS5K/DIS-TR/im/21#Tool#4#Flag#5774823110_d603ce3cc8_o.jpg
123
+ DIS5K/DIS-TR/im/21#Tool#5#Hook#6867989814_dba18d673c_o.jpg
124
+ DIS5K/DIS-TR/im/22#Weapon#4#Rifle#4451713125_cd91719189_o.jpg
125
+ DIS5K/DIS-TR/im/3#Aquatic#2#Seadragon#4910944581_913139b238_o.jpg
126
+ DIS5K/DIS-TR/im/4#Architecture#12#Scaffold#3661448960_8aff24cc4d_o.jpg
127
+ DIS5K/DIS-TR/im/4#Architecture#13#Sculpture#6385318715_9a88d4eba7_o.jpg
128
+ DIS5K/DIS-TR/im/4#Architecture#17#Well#5011603479_75cf42808a_o.jpg
129
+ DIS5K/DIS-TR/im/5#Artifact#2#Cage#4892828841_7f1bc05682_o.jpg
130
+ DIS5K/DIS-TR/im/5#Artifact#3#Handcraft#15404211628_9e9ff2ce2e_o.jpg
131
+ DIS5K/DIS-TR/im/5#Artifact#3#Handcraft#3200169865_7c84cfcccf_o.jpg
132
+ DIS5K/DIS-TR/im/5#Artifact#3#Handcraft#5859295071_c217e7c22f_o.jpg
133
+ DIS5K/DIS-TR/im/6#Automobile#10#SteeringWheel#17200338026_f1e2122d8e_o.jpg
134
+ DIS5K/DIS-TR/im/6#Automobile#3#Car#3780893425_1a7d275e09_o.jpg
135
+ DIS5K/DIS-TR/im/6#Automobile#5#Crane#15282506502_1b1132a7c3_o.jpg
136
+ DIS5K/DIS-TR/im/7#Electrical#1#Cable#16767791875_8e6df41752_o.jpg
137
+ DIS5K/DIS-TR/im/7#Electrical#1#Cable#3291433361_38747324c4_o.jpg
138
+ DIS5K/DIS-TR/im/7#Electrical#1#Cable#4195104238_12a754c61a_o.jpg
139
+ DIS5K/DIS-TR/im/7#Electrical#1#Cable#49645415132_61e5664ecf_o.jpg
140
+ DIS5K/DIS-TR/im/7#Electrical#1#Cable#IMG_20210521_232406.jpg
141
+ DIS5K/DIS-TR/im/7#Electrical#10#UtilityPole#3298312021_92f431e3e9_o.jpg
142
+ DIS5K/DIS-TR/im/7#Electrical#10#UtilityPole#47950134773_fbfff63f4e_o.jpg
143
+ DIS5K/DIS-TR/im/7#Electrical#11#VacuumCleaner#5448403677_6a29e21881_o.jpg
144
+ DIS5K/DIS-TR/im/7#Electrical#2#CeilingLamp#611568868_680ed5d39f_o.jpg
145
+ DIS5K/DIS-TR/im/7#Electrical#3#Fan#3391683115_990525a693_o.jpg
146
+ DIS5K/DIS-TR/im/7#Electrical#6#StreetLamp#150049122_0692266618_o.jpg
147
+ DIS5K/DIS-TR/im/7#Electrical#9#TransmissionTower#31433908671_7e7e277dfe_o.jpg
148
+ DIS5K/DIS-TR/im/8#Electronics#1#Antenna#8727884873_e0622ee5c4_o.jpg
149
+ DIS5K/DIS-TR/im/8#Electronics#2#Camcorder#4172690390_7e5f280ace_o.jpg
150
+ DIS5K/DIS-TR/im/8#Electronics#3#Earphone#413984555_f290febdf5_o.jpg
151
+ DIS5K/DIS-TR/im/8#Electronics#5#Headset#30574225373_3717ed9fa4_o.jpg
152
+ DIS5K/DIS-TR/im/8#Electronics#6#Microphone#538006482_4aae4f5bd6_o.jpg
153
+ DIS5K/DIS-TR/im/8#Electronics#9#MusicPlayer#1306012480_2ea80d2afd_o.jpg
154
+ DIS5K/DIS-TR/im/9#Entertainment#1#GymEquipment#33071754135_8f3195cbd1_o.jpg
155
+ DIS5K/DIS-TR/im/9#Entertainment#2#KidsPlayground#2305807849_be53d724ea_o.jpg
156
+ DIS5K/DIS-TR/im/9#Entertainment#2#KidsPlayground#3862040422_5bbf903204_o.jpg
157
+ DIS5K/DIS-TR/im/9#Entertainment#3#OutdoorFitnessEquipment#10814507005_3dacaa28b3_o.jpg
158
+ DIS5K/DIS-TR/im/9#Entertainment#4#FerrisWheel#81640293_4b0ee62040_o.jpg
159
+ DIS5K/DIS-TR/im/9#Entertainment#5#Swing#49867339188_08073f4b76_o.jpg
160
+ DIS5K/DIS-VD/im/1#Accessories#1#Bag#6815402415_e01c1a41e6_o.jpg
161
+ DIS5K/DIS-VD/im/1#Accessories#5#Jewelry#2744070193_1486582e8d_o.jpg
162
+ DIS5K/DIS-VD/im/10#Frame#1#BasketballHoop#IMG_20210521_232650.jpg
163
+ DIS5K/DIS-VD/im/10#Frame#5#Rack#6156611713_49ebf12b1e_o.jpg
164
+ DIS5K/DIS-VD/im/11#Furniture#11#Handrail#3276641240_1b84b5af85_o.jpg
165
+ DIS5K/DIS-VD/im/11#Furniture#13#Ladder#33423266_5391cf47e9_o.jpg
166
+ DIS5K/DIS-VD/im/11#Furniture#17#Table#3725111755_4fc101e7ab_o.jpg
167
+ DIS5K/DIS-VD/im/11#Furniture#2#Bench#35556410400_7235b58070_o.jpg
168
+ DIS5K/DIS-VD/im/11#Furniture#4#Chair#3301769985_e49de6739f_o.jpg
169
+ DIS5K/DIS-VD/im/11#Furniture#6#DentalChair#23811071619_2a95c3a688_o.jpg
170
+ DIS5K/DIS-VD/im/11#Furniture#9#Easel#8322807354_df6d56542e_o.jpg
171
+ DIS5K/DIS-VD/im/13#Insect#10#Mosquito#12391674863_0cdf430d3f_o.jpg
172
+ DIS5K/DIS-VD/im/13#Insect#7#Dragonfly#14693028899_344ea118f2_o.jpg
173
+ DIS5K/DIS-VD/im/14#Kitchenware#10#WineGlass#4450148455_8f460f541a_o.jpg
174
+ DIS5K/DIS-VD/im/14#Kitchenware#3#Hydrovalve#IMG_20210520_203410.jpg
175
+ DIS5K/DIS-VD/im/15#Machine#3#PlowHarrow#34521712846_df4babb024_o.jpg
176
+ DIS5K/DIS-VD/im/16#Music Instrument#5#Trombone#6222242743_e7189405cd_o.jpg
177
+ DIS5K/DIS-VD/im/17#Non-motor Vehicle#12#Wheel#25677578797_ea47e1d9e8_o.jpg
178
+ DIS5K/DIS-VD/im/17#Non-motor Vehicle#2#Bicycle#5153474856_21560b081b_o.jpg
179
+ DIS5K/DIS-VD/im/17#Non-motor Vehicle#7#Mower#16992510572_8a6ff27398_o.jpg
180
+ DIS5K/DIS-VD/im/19#Ship#2#Canoe#40571458163_7faf8b73d9_o.jpg
181
+ DIS5K/DIS-VD/im/2#Aircraft#1#Airplane#4270588164_66a619e834_o.jpg
182
+ DIS5K/DIS-VD/im/2#Aircraft#4#Helicopter#86789665_650b94b2ee_o.jpg
183
+ DIS5K/DIS-VD/im/20#Sports#14#Wakesurfing#5589577652_5061c168d2_o.jpg
184
+ DIS5K/DIS-VD/im/21#Tool#10#Spade#37018312543_63b21b0784_o.jpg
185
+ DIS5K/DIS-VD/im/21#Tool#14#Sword#24789047250_42df9bf422_o.jpg
186
+ DIS5K/DIS-VD/im/21#Tool#18#Umbrella#IMG_20210513_140445.jpg
187
+ DIS5K/DIS-VD/im/21#Tool#6#Key#43939732715_5a6e28b518_o.jpg
188
+ DIS5K/DIS-VD/im/22#Weapon#1#Cannon#12758066705_90b54295e7_o.jpg
189
+ DIS5K/DIS-VD/im/22#Weapon#4#Rifle#8019368790_fb6dc469a7_o.jpg
190
+ DIS5K/DIS-VD/im/3#Aquatic#5#Shrimp#2582833427_7a99e7356e_o.jpg
191
+ DIS5K/DIS-VD/im/4#Architecture#12#Scaffold#1013402687_590750354e_o.jpg
192
+ DIS5K/DIS-VD/im/4#Architecture#13#Sculpture#17176841759_272a3ed6e3_o.jpg
193
+ DIS5K/DIS-VD/im/4#Architecture#14#Stair#15079108505_0d11281624_o.jpg
194
+ DIS5K/DIS-VD/im/4#Architecture#19#Windmill#2928111082_ceb3051c04_o.jpg
195
+ DIS5K/DIS-VD/im/4#Architecture#3#Crack#3551574032_17dd106d31_o.jpg
196
+ DIS5K/DIS-VD/im/4#Architecture#5#GasStation#4564307581_c3069bdc62_o.jpg
197
+ DIS5K/DIS-VD/im/4#Architecture#8#ObservationTower#2704526950_d4f0ddc807_o.jpg
198
+ DIS5K/DIS-VD/im/5#Artifact#3#Handcraft#10873642323_1bafce3aa5_o.jpg
199
+ DIS5K/DIS-VD/im/6#Automobile#11#Tractor#8594504006_0c2c557d85_o.jpg
200
+ DIS5K/DIS-VD/im/8#Electronics#3#Earphone#8106454803_1178d867cc_o.jpg
src/depth_pro/network/__init__.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ """Depth Pro network blocks."""
src/depth_pro/network/decoder.py ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+
3
+ Dense Prediction Transformer Decoder architecture.
4
+
5
+ Implements a variant of Vision Transformers for Dense Prediction, https://arxiv.org/abs/2103.13413
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ from typing import Iterable
11
+
12
+ import torch
13
+ from torch import nn
14
+
15
+
16
+ class MultiresConvDecoder(nn.Module):
17
+ """Decoder for multi-resolution encodings."""
18
+
19
+ def __init__(
20
+ self,
21
+ dims_encoder: Iterable[int],
22
+ dim_decoder: int,
23
+ ):
24
+ """Initialize multiresolution convolutional decoder.
25
+
26
+ Args:
27
+ ----
28
+ dims_encoder: Expected dims at each level from the encoder.
29
+ dim_decoder: Dim of decoder features.
30
+
31
+ """
32
+ super().__init__()
33
+ self.dims_encoder = list(dims_encoder)
34
+ self.dim_decoder = dim_decoder
35
+ self.dim_out = dim_decoder
36
+
37
+ num_encoders = len(self.dims_encoder)
38
+
39
+ # At the highest resolution, i.e. level 0, we apply projection w/ 1x1 convolution
40
+ # when the dimensions mismatch. Otherwise we do not do anything, which is
41
+ # the default behavior of monodepth.
42
+ conv0 = (
43
+ nn.Conv2d(self.dims_encoder[0], dim_decoder, kernel_size=1, bias=False)
44
+ if self.dims_encoder[0] != dim_decoder
45
+ else nn.Identity()
46
+ )
47
+
48
+ convs = [conv0]
49
+ for i in range(1, num_encoders):
50
+ convs.append(
51
+ nn.Conv2d(
52
+ self.dims_encoder[i],
53
+ dim_decoder,
54
+ kernel_size=3,
55
+ stride=1,
56
+ padding=1,
57
+ bias=False,
58
+ )
59
+ )
60
+
61
+ self.convs = nn.ModuleList(convs)
62
+
63
+ fusions = []
64
+ for i in range(num_encoders):
65
+ fusions.append(
66
+ FeatureFusionBlock2d(
67
+ num_features=dim_decoder,
68
+ deconv=(i != 0),
69
+ batch_norm=False,
70
+ )
71
+ )
72
+ self.fusions = nn.ModuleList(fusions)
73
+
74
+ def forward(self, encodings: torch.Tensor) -> torch.Tensor:
75
+ """Decode the multi-resolution encodings."""
76
+ num_levels = len(encodings)
77
+ num_encoders = len(self.dims_encoder)
78
+
79
+ if num_levels != num_encoders:
80
+ raise ValueError(
81
+ f"Got encoder output levels={num_levels}, expected levels={num_encoders+1}."
82
+ )
83
+
84
+ # Project features of different encoder dims to the same decoder dim.
85
+ # Fuse features from the lowest resolution (num_levels-1)
86
+ # to the highest (0).
87
+ features = self.convs[-1](encodings[-1])
88
+ lowres_features = features
89
+ features = self.fusions[-1](features)
90
+ for i in range(num_levels - 2, -1, -1):
91
+ features_i = self.convs[i](encodings[i])
92
+ features = self.fusions[i](features, features_i)
93
+ return features, lowres_features
94
+
95
+
96
+ class ResidualBlock(nn.Module):
97
+ """Generic implementation of residual blocks.
98
+
99
+ This implements a generic residual block from
100
+ He et al. - Identity Mappings in Deep Residual Networks (2016),
101
+ https://arxiv.org/abs/1603.05027
102
+ which can be further customized via factory functions.
103
+ """
104
+
105
+ def __init__(self, residual: nn.Module, shortcut: nn.Module | None = None) -> None:
106
+ """Initialize ResidualBlock."""
107
+ super().__init__()
108
+ self.residual = residual
109
+ self.shortcut = shortcut
110
+
111
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
112
+ """Apply residual block."""
113
+ delta_x = self.residual(x)
114
+
115
+ if self.shortcut is not None:
116
+ x = self.shortcut(x)
117
+
118
+ return x + delta_x
119
+
120
+
121
+ class FeatureFusionBlock2d(nn.Module):
122
+ """Feature fusion for DPT."""
123
+
124
+ def __init__(
125
+ self,
126
+ num_features: int,
127
+ deconv: bool = False,
128
+ batch_norm: bool = False,
129
+ ):
130
+ """Initialize feature fusion block.
131
+
132
+ Args:
133
+ ----
134
+ num_features: Input and output dimensions.
135
+ deconv: Whether to use deconv before the final output conv.
136
+ batch_norm: Whether to use batch normalization in resnet blocks.
137
+
138
+ """
139
+ super().__init__()
140
+
141
+ self.resnet1 = self._residual_block(num_features, batch_norm)
142
+ self.resnet2 = self._residual_block(num_features, batch_norm)
143
+
144
+ self.use_deconv = deconv
145
+ if deconv:
146
+ self.deconv = nn.ConvTranspose2d(
147
+ in_channels=num_features,
148
+ out_channels=num_features,
149
+ kernel_size=2,
150
+ stride=2,
151
+ padding=0,
152
+ bias=False,
153
+ )
154
+
155
+ self.out_conv = nn.Conv2d(
156
+ num_features,
157
+ num_features,
158
+ kernel_size=1,
159
+ stride=1,
160
+ padding=0,
161
+ bias=True,
162
+ )
163
+
164
+ self.skip_add = nn.quantized.FloatFunctional()
165
+
166
+ def forward(self, x0: torch.Tensor, x1: torch.Tensor | None = None) -> torch.Tensor:
167
+ """Process and fuse input features."""
168
+ x = x0
169
+
170
+ if x1 is not None:
171
+ res = self.resnet1(x1)
172
+ x = self.skip_add.add(x, res)
173
+
174
+ x = self.resnet2(x)
175
+
176
+ if self.use_deconv:
177
+ x = self.deconv(x)
178
+ x = self.out_conv(x)
179
+
180
+ return x
181
+
182
+ @staticmethod
183
+ def _residual_block(num_features: int, batch_norm: bool):
184
+ """Create a residual block."""
185
+
186
+ def _create_block(dim: int, batch_norm: bool) -> list[nn.Module]:
187
+ layers = [
188
+ nn.ReLU(False),
189
+ nn.Conv2d(
190
+ num_features,
191
+ num_features,
192
+ kernel_size=3,
193
+ stride=1,
194
+ padding=1,
195
+ bias=not batch_norm,
196
+ ),
197
+ ]
198
+ if batch_norm:
199
+ layers.append(nn.BatchNorm2d(dim))
200
+ return layers
201
+
202
+ residual = nn.Sequential(
203
+ *_create_block(dim=num_features, batch_norm=batch_norm),
204
+ *_create_block(dim=num_features, batch_norm=batch_norm),
205
+ )
206
+ return ResidualBlock(residual)
src/depth_pro/network/encoder.py ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ # DepthProEncoder combining patch and image encoders.
3
+
4
+ from __future__ import annotations
5
+
6
+ import math
7
+ from typing import Iterable, Optional
8
+
9
+ import torch
10
+ import torch.nn as nn
11
+ import torch.nn.functional as F
12
+
13
+
14
+ class DepthProEncoder(nn.Module):
15
+ """DepthPro Encoder.
16
+
17
+ An encoder aimed at creating multi-resolution encodings from Vision Transformers.
18
+ """
19
+
20
+ def __init__(
21
+ self,
22
+ dims_encoder: Iterable[int],
23
+ patch_encoder: nn.Module,
24
+ image_encoder: nn.Module,
25
+ hook_block_ids: Iterable[int],
26
+ decoder_features: int,
27
+ ):
28
+ """Initialize DepthProEncoder.
29
+
30
+ The framework
31
+ 1. creates an image pyramid,
32
+ 2. generates overlapping patches with a sliding window at each pyramid level,
33
+ 3. creates batched encodings via vision transformer backbones,
34
+ 4. produces multi-resolution encodings.
35
+
36
+ Args:
37
+ ----
38
+ img_size: Backbone image resolution.
39
+ dims_encoder: Dimensions of the encoder at different layers.
40
+ patch_encoder: Backbone used for patches.
41
+ image_encoder: Backbone used for global image encoder.
42
+ hook_block_ids: Hooks to obtain intermediate features for the patch encoder model.
43
+ decoder_features: Number of feature output in the decoder.
44
+
45
+ """
46
+ super().__init__()
47
+
48
+ self.dims_encoder = list(dims_encoder)
49
+ self.patch_encoder = patch_encoder
50
+ self.image_encoder = image_encoder
51
+ self.hook_block_ids = list(hook_block_ids)
52
+
53
+ patch_encoder_embed_dim = patch_encoder.embed_dim
54
+ image_encoder_embed_dim = image_encoder.embed_dim
55
+
56
+ self.out_size = int(
57
+ patch_encoder.patch_embed.img_size[0] // patch_encoder.patch_embed.patch_size[0]
58
+ )
59
+
60
+ def _create_project_upsample_block(
61
+ dim_in: int,
62
+ dim_out: int,
63
+ upsample_layers: int,
64
+ dim_int: Optional[int] = None,
65
+ ) -> nn.Module:
66
+ if dim_int is None:
67
+ dim_int = dim_out
68
+ # Projection.
69
+ blocks = [
70
+ nn.Conv2d(
71
+ in_channels=dim_in,
72
+ out_channels=dim_int,
73
+ kernel_size=1,
74
+ stride=1,
75
+ padding=0,
76
+ bias=False,
77
+ )
78
+ ]
79
+
80
+ # Upsampling.
81
+ blocks += [
82
+ nn.ConvTranspose2d(
83
+ in_channels=dim_int if i == 0 else dim_out,
84
+ out_channels=dim_out,
85
+ kernel_size=2,
86
+ stride=2,
87
+ padding=0,
88
+ bias=False,
89
+ )
90
+ for i in range(upsample_layers)
91
+ ]
92
+
93
+ return nn.Sequential(*blocks)
94
+
95
+ self.upsample_latent0 = _create_project_upsample_block(
96
+ dim_in=patch_encoder_embed_dim,
97
+ dim_int=self.dims_encoder[0],
98
+ dim_out=decoder_features,
99
+ upsample_layers=3,
100
+ )
101
+ self.upsample_latent1 = _create_project_upsample_block(
102
+ dim_in=patch_encoder_embed_dim, dim_out=self.dims_encoder[0], upsample_layers=2
103
+ )
104
+
105
+ self.upsample0 = _create_project_upsample_block(
106
+ dim_in=patch_encoder_embed_dim, dim_out=self.dims_encoder[1], upsample_layers=1
107
+ )
108
+ self.upsample1 = _create_project_upsample_block(
109
+ dim_in=patch_encoder_embed_dim, dim_out=self.dims_encoder[2], upsample_layers=1
110
+ )
111
+ self.upsample2 = _create_project_upsample_block(
112
+ dim_in=patch_encoder_embed_dim, dim_out=self.dims_encoder[3], upsample_layers=1
113
+ )
114
+
115
+ self.upsample_lowres = nn.ConvTranspose2d(
116
+ in_channels=image_encoder_embed_dim,
117
+ out_channels=self.dims_encoder[3],
118
+ kernel_size=2,
119
+ stride=2,
120
+ padding=0,
121
+ bias=True,
122
+ )
123
+ self.fuse_lowres = nn.Conv2d(
124
+ in_channels=(self.dims_encoder[3] + self.dims_encoder[3]),
125
+ out_channels=self.dims_encoder[3],
126
+ kernel_size=1,
127
+ stride=1,
128
+ padding=0,
129
+ bias=True,
130
+ )
131
+
132
+ # Obtain intermediate outputs of the blocks.
133
+ self.patch_encoder.blocks[self.hook_block_ids[0]].register_forward_hook(
134
+ self._hook0
135
+ )
136
+ self.patch_encoder.blocks[self.hook_block_ids[1]].register_forward_hook(
137
+ self._hook1
138
+ )
139
+
140
+ def _hook0(self, model, input, output):
141
+ self.backbone_highres_hook0 = output
142
+
143
+ def _hook1(self, model, input, output):
144
+ self.backbone_highres_hook1 = output
145
+
146
+ @property
147
+ def img_size(self) -> int:
148
+ """Return the full image size of the SPN network."""
149
+ return self.patch_encoder.patch_embed.img_size[0] * 4
150
+
151
+ def _create_pyramid(
152
+ self, x: torch.Tensor
153
+ ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
154
+ """Create a 3-level image pyramid."""
155
+ # Original resolution: 1536 by default.
156
+ x0 = x
157
+
158
+ # Middle resolution: 768 by default.
159
+ x1 = F.interpolate(
160
+ x, size=None, scale_factor=0.5, mode="bilinear", align_corners=False
161
+ )
162
+
163
+ # Low resolution: 384 by default, corresponding to the backbone resolution.
164
+ x2 = F.interpolate(
165
+ x, size=None, scale_factor=0.25, mode="bilinear", align_corners=False
166
+ )
167
+
168
+ return x0, x1, x2
169
+
170
+ def split(self, x: torch.Tensor, overlap_ratio: float = 0.25) -> torch.Tensor:
171
+ """Split the input into small patches with sliding window."""
172
+ patch_size = 384
173
+ patch_stride = int(patch_size * (1 - overlap_ratio))
174
+
175
+ image_size = x.shape[-1]
176
+ steps = int(math.ceil((image_size - patch_size) / patch_stride)) + 1
177
+
178
+ x_patch_list = []
179
+ for j in range(steps):
180
+ j0 = j * patch_stride
181
+ j1 = j0 + patch_size
182
+
183
+ for i in range(steps):
184
+ i0 = i * patch_stride
185
+ i1 = i0 + patch_size
186
+ x_patch_list.append(x[..., j0:j1, i0:i1])
187
+
188
+ return torch.cat(x_patch_list, dim=0)
189
+
190
+ def merge(self, x: torch.Tensor, batch_size: int, padding: int = 3) -> torch.Tensor:
191
+ """Merge the patched input into a image with sliding window."""
192
+ steps = int(math.sqrt(x.shape[0] // batch_size))
193
+
194
+ idx = 0
195
+
196
+ output_list = []
197
+ for j in range(steps):
198
+ output_row_list = []
199
+ for i in range(steps):
200
+ output = x[batch_size * idx : batch_size * (idx + 1)]
201
+
202
+ if j != 0:
203
+ output = output[..., padding:, :]
204
+ if i != 0:
205
+ output = output[..., :, padding:]
206
+ if j != steps - 1:
207
+ output = output[..., :-padding, :]
208
+ if i != steps - 1:
209
+ output = output[..., :, :-padding]
210
+
211
+ output_row_list.append(output)
212
+ idx += 1
213
+
214
+ output_row = torch.cat(output_row_list, dim=-1)
215
+ output_list.append(output_row)
216
+ output = torch.cat(output_list, dim=-2)
217
+ return output
218
+
219
+ def reshape_feature(
220
+ self, embeddings: torch.Tensor, width, height, cls_token_offset=1
221
+ ):
222
+ """Discard class token and reshape 1D feature map to a 2D grid."""
223
+ b, hw, c = embeddings.shape
224
+
225
+ # Remove class token.
226
+ if cls_token_offset > 0:
227
+ embeddings = embeddings[:, cls_token_offset:, :]
228
+
229
+ # Shape: (batch, height, width, dim) -> (batch, dim, height, width)
230
+ embeddings = embeddings.reshape(b, height, width, c).permute(0, 3, 1, 2)
231
+ return embeddings
232
+
233
+ def forward(self, x: torch.Tensor) -> list[torch.Tensor]:
234
+ """Encode input at multiple resolutions.
235
+
236
+ Args:
237
+ ----
238
+ x (torch.Tensor): Input image.
239
+
240
+ Returns:
241
+ -------
242
+ Multi resolution encoded features.
243
+
244
+ """
245
+ batch_size = x.shape[0]
246
+
247
+ # Step 0: create a 3-level image pyramid.
248
+ x0, x1, x2 = self._create_pyramid(x)
249
+
250
+ # Step 1: split to create batched overlapped mini-images at the backbone (BeiT/ViT/Dino)
251
+ # resolution.
252
+ # 5x5 @ 384x384 at the highest resolution (1536x1536).
253
+ x0_patches = self.split(x0, overlap_ratio=0.25)
254
+ # 3x3 @ 384x384 at the middle resolution (768x768).
255
+ x1_patches = self.split(x1, overlap_ratio=0.5)
256
+ # 1x1 # 384x384 at the lowest resolution (384x384).
257
+ x2_patches = x2
258
+
259
+ # Concatenate all the sliding window patches and form a batch of size (35=5x5+3x3+1x1).
260
+ x_pyramid_patches = torch.cat(
261
+ (x0_patches, x1_patches, x2_patches),
262
+ dim=0,
263
+ )
264
+
265
+ # Step 2: Run the backbone (BeiT) model and get the result of large batch size.
266
+ x_pyramid_encodings = self.patch_encoder(x_pyramid_patches)
267
+ x_pyramid_encodings = self.reshape_feature(
268
+ x_pyramid_encodings, self.out_size, self.out_size
269
+ )
270
+
271
+ # Step 3: merging.
272
+ # Merge highres latent encoding.
273
+ x_latent0_encodings = self.reshape_feature(
274
+ self.backbone_highres_hook0,
275
+ self.out_size,
276
+ self.out_size,
277
+ )
278
+ x_latent0_features = self.merge(
279
+ x_latent0_encodings[: batch_size * 5 * 5], batch_size=batch_size, padding=3
280
+ )
281
+
282
+ x_latent1_encodings = self.reshape_feature(
283
+ self.backbone_highres_hook1,
284
+ self.out_size,
285
+ self.out_size,
286
+ )
287
+ x_latent1_features = self.merge(
288
+ x_latent1_encodings[: batch_size * 5 * 5], batch_size=batch_size, padding=3
289
+ )
290
+
291
+ # Split the 35 batch size from pyramid encoding back into 5x5+3x3+1x1.
292
+ x0_encodings, x1_encodings, x2_encodings = torch.split(
293
+ x_pyramid_encodings,
294
+ [len(x0_patches), len(x1_patches), len(x2_patches)],
295
+ dim=0,
296
+ )
297
+
298
+ # 96x96 feature maps by merging 5x5 @ 24x24 patches with overlaps.
299
+ x0_features = self.merge(x0_encodings, batch_size=batch_size, padding=3)
300
+
301
+ # 48x84 feature maps by merging 3x3 @ 24x24 patches with overlaps.
302
+ x1_features = self.merge(x1_encodings, batch_size=batch_size, padding=6)
303
+
304
+ # 24x24 feature maps.
305
+ x2_features = x2_encodings
306
+
307
+ # Apply the image encoder model.
308
+ x_global_features = self.image_encoder(x2_patches)
309
+ x_global_features = self.reshape_feature(
310
+ x_global_features, self.out_size, self.out_size
311
+ )
312
+
313
+ # Upsample feature maps.
314
+ x_latent0_features = self.upsample_latent0(x_latent0_features)
315
+ x_latent1_features = self.upsample_latent1(x_latent1_features)
316
+
317
+ x0_features = self.upsample0(x0_features)
318
+ x1_features = self.upsample1(x1_features)
319
+ x2_features = self.upsample2(x2_features)
320
+
321
+ x_global_features = self.upsample_lowres(x_global_features)
322
+ x_global_features = self.fuse_lowres(
323
+ torch.cat((x2_features, x_global_features), dim=1)
324
+ )
325
+
326
+ return [
327
+ x_latent0_features,
328
+ x_latent1_features,
329
+ x0_features,
330
+ x1_features,
331
+ x_global_features,
332
+ ]
src/depth_pro/network/fov.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ # Field of View network architecture.
3
+
4
+ from typing import Optional
5
+
6
+ import torch
7
+ from torch import nn
8
+ from torch.nn import functional as F
9
+
10
+
11
+ class FOVNetwork(nn.Module):
12
+ """Field of View estimation network."""
13
+
14
+ def __init__(
15
+ self,
16
+ num_features: int,
17
+ fov_encoder: Optional[nn.Module] = None,
18
+ ):
19
+ """Initialize the Field of View estimation block.
20
+
21
+ Args:
22
+ ----
23
+ num_features: Number of features used.
24
+ fov_encoder: Optional encoder to bring additional network capacity.
25
+
26
+ """
27
+ super().__init__()
28
+
29
+ # Create FOV head.
30
+ fov_head0 = [
31
+ nn.Conv2d(
32
+ num_features, num_features // 2, kernel_size=3, stride=2, padding=1
33
+ ), # 128 x 24 x 24
34
+ nn.ReLU(True),
35
+ ]
36
+ fov_head = [
37
+ nn.Conv2d(
38
+ num_features // 2, num_features // 4, kernel_size=3, stride=2, padding=1
39
+ ), # 64 x 12 x 12
40
+ nn.ReLU(True),
41
+ nn.Conv2d(
42
+ num_features // 4, num_features // 8, kernel_size=3, stride=2, padding=1
43
+ ), # 32 x 6 x 6
44
+ nn.ReLU(True),
45
+ nn.Conv2d(num_features // 8, 1, kernel_size=6, stride=1, padding=0),
46
+ ]
47
+ if fov_encoder is not None:
48
+ self.encoder = nn.Sequential(
49
+ fov_encoder, nn.Linear(fov_encoder.embed_dim, num_features // 2)
50
+ )
51
+ self.downsample = nn.Sequential(*fov_head0)
52
+ else:
53
+ fov_head = fov_head0 + fov_head
54
+ self.head = nn.Sequential(*fov_head)
55
+
56
+ def forward(self, x: torch.Tensor, lowres_feature: torch.Tensor) -> torch.Tensor:
57
+ """Forward the fov network.
58
+
59
+ Args:
60
+ ----
61
+ x (torch.Tensor): Input image.
62
+ lowres_feature (torch.Tensor): Low resolution feature.
63
+
64
+ Returns:
65
+ -------
66
+ The field of view tensor.
67
+
68
+ """
69
+ if hasattr(self, "encoder"):
70
+ x = F.interpolate(
71
+ x,
72
+ size=None,
73
+ scale_factor=0.25,
74
+ mode="bilinear",
75
+ align_corners=False,
76
+ )
77
+ x = self.encoder(x)[:, 1:].permute(0, 2, 1)
78
+ lowres_feature = self.downsample(lowres_feature)
79
+ x = x.reshape_as(lowres_feature) + lowres_feature
80
+ else:
81
+ x = lowres_feature
82
+ return self.head(x)
src/depth_pro/network/vit.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+
3
+
4
+ try:
5
+ from timm.layers import resample_abs_pos_embed
6
+ except ImportError as err:
7
+ print("ImportError: {0}".format(err))
8
+ import torch
9
+ import torch.nn as nn
10
+ from torch.utils.checkpoint import checkpoint
11
+
12
+
13
+ def make_vit_b16_backbone(
14
+ model,
15
+ encoder_feature_dims,
16
+ encoder_feature_layer_ids,
17
+ vit_features,
18
+ start_index=1,
19
+ use_grad_checkpointing=False,
20
+ ) -> nn.Module:
21
+ """Make a ViTb16 backbone for the DPT model."""
22
+ if use_grad_checkpointing:
23
+ model.set_grad_checkpointing()
24
+
25
+ vit_model = nn.Module()
26
+ vit_model.hooks = encoder_feature_layer_ids
27
+ vit_model.model = model
28
+ vit_model.features = encoder_feature_dims
29
+ vit_model.vit_features = vit_features
30
+ vit_model.model.start_index = start_index
31
+ vit_model.model.patch_size = vit_model.model.patch_embed.patch_size
32
+ vit_model.model.is_vit = True
33
+ vit_model.model.forward = vit_model.model.forward_features
34
+
35
+ return vit_model
36
+
37
+
38
+ def forward_features_eva_fixed(self, x):
39
+ """Encode features."""
40
+ x = self.patch_embed(x)
41
+ x, rot_pos_embed = self._pos_embed(x)
42
+ for blk in self.blocks:
43
+ if self.grad_checkpointing:
44
+ x = checkpoint(blk, x, rot_pos_embed)
45
+ else:
46
+ x = blk(x, rot_pos_embed)
47
+ x = self.norm(x)
48
+ return x
49
+
50
+
51
+ def resize_vit(model: nn.Module, img_size) -> nn.Module:
52
+ """Resample the ViT module to the given size."""
53
+ patch_size = model.patch_embed.patch_size
54
+ model.patch_embed.img_size = img_size
55
+ grid_size = tuple([s // p for s, p in zip(img_size, patch_size)])
56
+ model.patch_embed.grid_size = grid_size
57
+
58
+ pos_embed = resample_abs_pos_embed(
59
+ model.pos_embed,
60
+ grid_size, # img_size
61
+ num_prefix_tokens=(
62
+ 0 if getattr(model, "no_embed_class", False) else model.num_prefix_tokens
63
+ ),
64
+ )
65
+ model.pos_embed = torch.nn.Parameter(pos_embed)
66
+
67
+ return model
68
+
69
+
70
+ def resize_patch_embed(model: nn.Module, new_patch_size=(16, 16)) -> nn.Module:
71
+ """Resample the ViT patch size to the given one."""
72
+ # interpolate patch embedding
73
+ if hasattr(model, "patch_embed"):
74
+ old_patch_size = model.patch_embed.patch_size
75
+
76
+ if (
77
+ new_patch_size[0] != old_patch_size[0]
78
+ or new_patch_size[1] != old_patch_size[1]
79
+ ):
80
+ patch_embed_proj = model.patch_embed.proj.weight
81
+ patch_embed_proj_bias = model.patch_embed.proj.bias
82
+ use_bias = True if patch_embed_proj_bias is not None else False
83
+ _, _, h, w = patch_embed_proj.shape
84
+
85
+ new_patch_embed_proj = torch.nn.functional.interpolate(
86
+ patch_embed_proj,
87
+ size=[new_patch_size[0], new_patch_size[1]],
88
+ mode="bicubic",
89
+ align_corners=False,
90
+ )
91
+ new_patch_embed_proj = (
92
+ new_patch_embed_proj * (h / new_patch_size[0]) * (w / new_patch_size[1])
93
+ )
94
+
95
+ model.patch_embed.proj = nn.Conv2d(
96
+ in_channels=model.patch_embed.proj.in_channels,
97
+ out_channels=model.patch_embed.proj.out_channels,
98
+ kernel_size=new_patch_size,
99
+ stride=new_patch_size,
100
+ bias=use_bias,
101
+ )
102
+
103
+ if use_bias:
104
+ model.patch_embed.proj.bias = patch_embed_proj_bias
105
+
106
+ model.patch_embed.proj.weight = torch.nn.Parameter(new_patch_embed_proj)
107
+
108
+ model.patch_size = new_patch_size
109
+ model.patch_embed.patch_size = new_patch_size
110
+ model.patch_embed.img_size = (
111
+ int(
112
+ model.patch_embed.img_size[0]
113
+ * new_patch_size[0]
114
+ / old_patch_size[0]
115
+ ),
116
+ int(
117
+ model.patch_embed.img_size[1]
118
+ * new_patch_size[1]
119
+ / old_patch_size[1]
120
+ ),
121
+ )
122
+
123
+ return model
src/depth_pro/network/vit_factory.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+ # Factory functions to build and load ViT models.
3
+
4
+
5
+ from __future__ import annotations
6
+
7
+ import logging
8
+ import types
9
+ from dataclasses import dataclass
10
+ from typing import Dict, List, Literal, Optional
11
+
12
+ import timm
13
+ import torch
14
+ import torch.nn as nn
15
+
16
+ from .vit import (
17
+ forward_features_eva_fixed,
18
+ make_vit_b16_backbone,
19
+ resize_patch_embed,
20
+ resize_vit,
21
+ )
22
+
23
+ LOGGER = logging.getLogger(__name__)
24
+
25
+
26
+ ViTPreset = Literal[
27
+ "dinov2l16_384",
28
+ ]
29
+
30
+
31
+ @dataclass
32
+ class ViTConfig:
33
+ """Configuration for ViT."""
34
+
35
+ in_chans: int
36
+ embed_dim: int
37
+
38
+ img_size: int = 384
39
+ patch_size: int = 16
40
+
41
+ # In case we need to rescale the backbone when loading from timm.
42
+ timm_preset: Optional[str] = None
43
+ timm_img_size: int = 384
44
+ timm_patch_size: int = 16
45
+
46
+ # The following 2 parameters are only used by DPT. See dpt_factory.py.
47
+ encoder_feature_layer_ids: List[int] = None
48
+ """The layers in the Beit/ViT used to constructs encoder features for DPT."""
49
+ encoder_feature_dims: List[int] = None
50
+ """The dimension of features of encoder layers from Beit/ViT features for DPT."""
51
+
52
+
53
+ VIT_CONFIG_DICT: Dict[ViTPreset, ViTConfig] = {
54
+ "dinov2l16_384": ViTConfig(
55
+ in_chans=3,
56
+ embed_dim=1024,
57
+ encoder_feature_layer_ids=[5, 11, 17, 23],
58
+ encoder_feature_dims=[256, 512, 1024, 1024],
59
+ img_size=384,
60
+ patch_size=16,
61
+ timm_preset="vit_large_patch14_dinov2",
62
+ timm_img_size=518,
63
+ timm_patch_size=14,
64
+ ),
65
+ }
66
+
67
+
68
+ def create_vit(
69
+ preset: ViTPreset,
70
+ use_pretrained: bool = False,
71
+ checkpoint_uri: str | None = None,
72
+ use_grad_checkpointing: bool = False,
73
+ ) -> nn.Module:
74
+ """Create and load a VIT backbone module.
75
+
76
+ Args:
77
+ ----
78
+ preset: The VIT preset to load the pre-defined config.
79
+ use_pretrained: Load pretrained weights if True, default is False.
80
+ checkpoint_uri: Checkpoint to load the wights from.
81
+ use_grad_checkpointing: Use grandient checkpointing.
82
+
83
+ Returns:
84
+ -------
85
+ A Torch ViT backbone module.
86
+
87
+ """
88
+ config = VIT_CONFIG_DICT[preset]
89
+
90
+ img_size = (config.img_size, config.img_size)
91
+ patch_size = (config.patch_size, config.patch_size)
92
+
93
+ if "eva02" in preset:
94
+ model = timm.create_model(config.timm_preset, pretrained=use_pretrained)
95
+ model.forward_features = types.MethodType(forward_features_eva_fixed, model)
96
+ else:
97
+ model = timm.create_model(
98
+ config.timm_preset, pretrained=use_pretrained, dynamic_img_size=True
99
+ )
100
+ model = make_vit_b16_backbone(
101
+ model,
102
+ encoder_feature_dims=config.encoder_feature_dims,
103
+ encoder_feature_layer_ids=config.encoder_feature_layer_ids,
104
+ vit_features=config.embed_dim,
105
+ use_grad_checkpointing=use_grad_checkpointing,
106
+ )
107
+ if config.patch_size != config.timm_patch_size:
108
+ model.model = resize_patch_embed(model.model, new_patch_size=patch_size)
109
+ if config.img_size != config.timm_img_size:
110
+ model.model = resize_vit(model.model, img_size=img_size)
111
+
112
+ if checkpoint_uri is not None:
113
+ state_dict = torch.load(checkpoint_uri, map_location="cpu")
114
+ missing_keys, unexpected_keys = model.load_state_dict(
115
+ state_dict=state_dict, strict=False
116
+ )
117
+
118
+ if len(unexpected_keys) != 0:
119
+ raise KeyError(f"Found unexpected keys when loading vit: {unexpected_keys}")
120
+ if len(missing_keys) != 0:
121
+ raise KeyError(f"Keys are missing when loading vit: {missing_keys}")
122
+
123
+ LOGGER.info(model)
124
+ return model.model
src/depth_pro/utils.py ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
2
+
3
+ import logging
4
+ from pathlib import Path
5
+ from typing import Any, Dict, List, Tuple, Union
6
+
7
+ import numpy as np
8
+ import pillow_heif
9
+ from PIL import ExifTags, Image, TiffTags
10
+ from pillow_heif import register_heif_opener
11
+
12
+ register_heif_opener()
13
+ LOGGER = logging.getLogger(__name__)
14
+
15
+
16
+ def extract_exif(img_pil: Image) -> Dict[str, Any]:
17
+ """Return exif information as a dictionary.
18
+
19
+ Args:
20
+ ----
21
+ img_pil: A Pillow image.
22
+
23
+ Returns:
24
+ -------
25
+ A dictionary with extracted EXIF information.
26
+
27
+ """
28
+ # Get full exif description from get_ifd(0x8769):
29
+ # cf https://pillow.readthedocs.io/en/stable/releasenotes/8.2.0.html#image-getexif-exif-and-gps-ifd
30
+ img_exif = img_pil.getexif().get_ifd(0x8769)
31
+ exif_dict = {ExifTags.TAGS[k]: v for k, v in img_exif.items() if k in ExifTags.TAGS}
32
+
33
+ tiff_tags = img_pil.getexif()
34
+ tiff_dict = {
35
+ TiffTags.TAGS_V2[k].name: v
36
+ for k, v in tiff_tags.items()
37
+ if k in TiffTags.TAGS_V2
38
+ }
39
+ return {**exif_dict, **tiff_dict}
40
+
41
+
42
+ def fpx_from_f35(width: float, height: float, f_mm: float = 50) -> float:
43
+ """Convert a focal length given in mm (35mm film equivalent) to pixels."""
44
+ return f_mm * np.sqrt(width**2.0 + height**2.0) / np.sqrt(36**2 + 24**2)
45
+
46
+
47
+ def load_rgb(
48
+ path: Union[Path, str], auto_rotate: bool = True, remove_alpha: bool = True
49
+ ) -> Tuple[np.ndarray, List[bytes], float]:
50
+ """Load an RGB image.
51
+
52
+ Args:
53
+ ----
54
+ path: The url to the image to load.
55
+ auto_rotate: Rotate the image based on the EXIF data, default is True.
56
+ remove_alpha: Remove the alpha channel, default is True.
57
+
58
+ Returns:
59
+ -------
60
+ img: The image loaded as a numpy array.
61
+ icc_profile: The color profile of the image.
62
+ f_px: The optional focal length in pixels, extracting from the exif data.
63
+
64
+ """
65
+ LOGGER.debug(f"Loading image {path} ...")
66
+
67
+ path = Path(path)
68
+ if path.suffix.lower() in [".heic"]:
69
+ heif_file = pillow_heif.open_heif(path, convert_hdr_to_8bit=True)
70
+ img_pil = heif_file.to_pillow()
71
+ else:
72
+ img_pil = Image.open(path)
73
+
74
+ img_exif = extract_exif(img_pil)
75
+ icc_profile = img_pil.info.get("icc_profile", None)
76
+
77
+ # Rotate the image.
78
+ if auto_rotate:
79
+ exif_orientation = img_exif.get("Orientation", 1)
80
+ if exif_orientation == 3:
81
+ img_pil = img_pil.transpose(Image.ROTATE_180)
82
+ elif exif_orientation == 6:
83
+ img_pil = img_pil.transpose(Image.ROTATE_270)
84
+ elif exif_orientation == 8:
85
+ img_pil = img_pil.transpose(Image.ROTATE_90)
86
+ elif exif_orientation != 1:
87
+ LOGGER.warning(f"Ignoring image orientation {exif_orientation}.")
88
+
89
+ img = np.array(img_pil)
90
+ # Convert to RGB if single channel.
91
+ if img.ndim < 3 or img.shape[2] == 1:
92
+ img = np.dstack((img, img, img))
93
+
94
+ if remove_alpha:
95
+ img = img[:, :, :3]
96
+
97
+ LOGGER.debug(f"\tHxW: {img.shape[0]}x{img.shape[1]}")
98
+
99
+ # Extract the focal length from exif data.
100
+ f_35mm = img_exif.get(
101
+ "FocalLengthIn35mmFilm",
102
+ img_exif.get(
103
+ "FocalLenIn35mmFilm", img_exif.get("FocalLengthIn35mmFormat", None)
104
+ ),
105
+ )
106
+ if f_35mm is not None and f_35mm > 0:
107
+ LOGGER.debug(f"\tfocal length @ 35mm film: {f_35mm}mm")
108
+ f_px = fpx_from_f35(img.shape[1], img.shape[0], f_35mm)
109
+ else:
110
+ f_px = None
111
+
112
+ return img, icc_profile, f_px