1. Datasets
1.1 Horizontal-Text Datasets
ICDAR 2003(IC03):
- Introduction: It contains 509 images in total, 258 for training and 251 for testing. Specifically, it contains 1110 text instance in training set, while 1156 in testing set. It has word-level annotation. IC03 only consider English text instance.
- Link: IC03-download
ICDAR 2011(IC11):
- Introduction: IC11 is an English dataset for text detection. It contains 484 images, 229 for training and 255 for testing. There are 1564 text instance in this dataset. It provides both word-level and character-level annotation.
- Link: IC11-download
ICDAR 2013(IC13):
- Introduction: IC13 is almost the same as IC11. It contains 462 images in total, 229 for training and 233 for testing. Specifically, it contains 849 text instance in training set, while 1095 in testing set.
- Link: IC13-download
1.2 Arbitrary-Quadrilateral-Text Datasets
USTB-SV1K:
- Introduction: USTB-SV1K is an English dataset. It contains 1000 street images from Google Street View with 2955 text instance in total. It only provides word-level annotations.
- Link: USTB-SV1K-download
SVT:
- Introduction: It contains 350 images with 725 English text intance in total. SVT has both character-level and word-level annotations. The images of SVT are harvested from Google Street View and have low resolution.
- Link: SVT-download
SVT-P:
- Introduction: It contains 639 cropped word images for testing. Images were selected from the side-view angle snapshots in Google Street View. Therefore, most images are heavily distorted by the non-frontal view angle. It is the imporved datasets of SVT.
- Link: SVT-P-download (Password : vnis)
ICDAR 2015(IC15):
- Introduction: It contains 1500 images in total, 1000 for training and 500 for testing. Specifically, it contains 17548 text instance. It provides word-level annotations. IC15 is the first incidental scene text dataset and it only considers English words.
- Link: IC15-download
COCO-Text:
- Introduction: It contains 63686 images in total, 43686 for training, 10000 for validating and 10000 for testing. Specifically, it contains 145859 cropped word images for testing, including handwritten and printed, clear and blur, English and non-English.
- Link: COCO-Text-download
MSRA-TD500:
- Introduction: It contains 500 images in total. It provides text-line-level annotation rather than word, and polygon boxes rather than axis-aligned rectangles for text region annootation. It contains both English and Chinese text instance.
- Link: MSRA-TD500-download
MLT 2017:
- Introduction: It contains 10000 natural images in total. It provides word-level annotation. There are 9 languages for MLT. It is a more real and complex datasets for scene text detection and recognition..
- Link: MLT-download
MLT 2019:
- Introduction: It contains 18000 images in total. It provides word-level annotation. Compared to MLT, this dataset has 10 languages. It is a more real and complex datasets for scene text detection and recognition..
- Link: MLT-2019-download
CTW:
- Introduction: It contains 32285 high resolution street view images of Chinese text, with 1018402 character instances in total. All images are annotated at the character level, including its underlying character type, bouding box, and 6 other attributes. These attributes indicate whether its background is complex, whether it’s raised, whether it’s hand-written or printed, whether it’s occluded, whether it’s distorted, whether it uses word-art.
- Link: CTW-download
RCTW-17:
- Introduction: It contains 12514 images in total, 11514 for training and 1000 for testing. Images in RCTW-17 were mostly collected by camera or mobile phone, and others were generated images. Text instances are annotated with parallelograms. It is the first large scale Chinese dataset, and was also the largest published one by then.
- Link: RCTW-17-download
ReCTS:
- Introduction: This data set is a large-scale Chinese Street View Trademark Data Set. It is based on Chinese words and Chinese text line-level labeling. The labeling method is arbitrary quadrilateral labeling. It contains 20000 images in total.
- Link: ReCTS-download
1.3 Irregular-Text Datasets
CUTE80:
- Introduction: It contains 80 high-resolution images taken in natural scenes. Specifically, it contains 288 cropped word images for testing. The dataset focuses on curved text. No lexicon is provided.
- Link: CUTE80-download
Total-Text:
- Introduction: It contains 1,555 images in total. Specifically, it contains 11,459 cropped word images with more than three different text orientations: horizontal, multi-oriented and curved.
- Link: Total-Text-download
SCUT-CTW1500:
- Introduction: It contains 1500 images in total, 1000 for training and 500 for testing. Specifically, it contains 10751 cropped word images for testing. Annotations in CTW-1500 are polygons with 14 vertexes. The dataset mainly consists of Chinese and English.
- Link: CTW-1500-download
LSVT:
- Introduction: LSVT consists of 20,000 testing data, 30,000 training data in full annotations and 400,000 training data in weak annotations, which are referred to as partial labels. The labeled text regions demonstrate the diversity of text: horizontal, multi-oriented and curved.
- Link: LSVT-download
ArTs:
- Introduction: ArT consists of 10,166 images, 5,603 for training and 4,563 for testing. They were collected with text shape diversity in mind and all text shapes have high number of existence in ArT.
- Link: ArT-download
1.4 Synthetic Datasets
Synth80k :
- Introduction: It contains 800 thousands images with approximately 8 million synthetic word instances. Each text instance is annotated with its text-string, word-level and character-level bounding-boxes.
- Link: Synth80k-download
SynthText :
- Introduction: It contains 6 million cropped word images. The generation process is similar to that of Synth90k. It is also annotated in horizontal-style.
- Link: SynthText-download
1.5 Comparison of Datasets
Comparison of Datasets | |||||||||||||
Datasets | Language | Image | Text instance | Text Shape | Annotation level | ||||||||
Total | Train | Test | Total | Train | Test | Horizontal | Arbitrary-Quadrilateral | Multi-oriented | Char | Word | Text-Line | ||
IC03 | English | 509 | 258 | 251 | 2266 | 1110 | 1156 | ✓ | ✕ | ✕ | ✕ | ✓ | ✕ |
IC11 | English | 484 | 229 | 255 | 1564 | ~ | ~ | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ |
IC13 | English | 462 | 229 | 233 | 1944 | 849 | 1095 | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ |
USTB-SV1K | English | 1000 | 500 | 500 | 2955 | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
SVT | English | 350 | 100 | 250 | 725 | 211 | 514 | ✓ | ✓ | ✕ | ✓ | ✓ | ✕ |
SVT-P | English | 238 | ~ | ~ | 639 | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
IC15 | English | 1500 | 1000 | 500 | 17548 | 122318 | 5230 | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
COCO-Text | English | 63686 | 43686 | 20000 | 145859 | 118309 | 27550 | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
MSRA-TD500 | English/Chinese | 500 | 300 | 200 | ~ | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✕ | ✓ |
MLT 2017 | Multi-lingual | 18000 | 7200 | 10800 | ~ | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
MLT 2019 | Multi-lingual | 20000 | 10000 | 10000 | ~ | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
CTW | Chinese | 32285 | 25887 | 6398 | 1018402 | 812872 | 205530 | ✓ | ✓ | ✕ | ✓ | ✓ | ✕ |
RCTW-17 | English/Chinese | 12514 | 15114 | 1000 | ~ | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✕ | ✓ |
ReCTS | Chinese | 20000 | ~ | ~ | ~ | ~ | ~ | ✓ | ✓ | ✕ | ✓ | ✓ | ✕ |
CUTE80 | English | 80 | ~ | ~ | ~ | ~ | ~ | ✕ | ✕ | ✓ | ✕ | ✓ | ✓ |
Total-Text | English | 1525 | 1225 | 300 | 9330 | ~ | ~ | ✓ | ✓ | ✓ | ✕ | ✓ | ✓ |
CTW-1500 | English/Chinese | 1500 | 1000 | 500 | 10751 | ~ | ~ | ✓ | ✓ | ✓ | ✕ | ✓ | ✓ |
LSVT | English/Chinese | 450000 | 430000 | 20000 | ~ | ~ | ~ | ✓ | ✓ | ✓ | ✕ | ✓ | ✓ |
ArT | English/Chinese | 10166 | 5603 | 4563 | ~ | ~ | ~ | ✓ | ✓ | ✓ | ✕ | ✓ | ✕ |
Synth80k | English | 80k | ~ | ~ | 8m | ~ | ~ | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ |
SynthText | English | 800k | ~ | ~ | 6m | ~ | ~ | ✓ | ✓ | ✕ | ✕ | ✓ | ✕ |
2. Summary of Scene Text Detection Resources
2.1 Comparison of Methods
Scene text detection methods can be devided into four parts:
(a) Traditional methods;
(b) Segmentation-based methods;
(c) Regression-based methods;
(d) Hybrid methods.
It is important to notice that: (1) "Hori" stands for horizontal scene text datasets. (2) "Quad" stands for arbitrary-quadrilateral-text datasets. (3) "Irreg" stands for irregular scence text datasets. (4) "Traditional method" stands for the methods that don't rely on deep learning.
2.1.1 Traditional Methods
Method | Model | Code | Hori | Quad | Irreg | Source | Time | Highlight |
Yao et al. [1] | TD-Mixture | ✕ | ✓ | ✓ | ✕ | CVPR | 2012 | 1) A new dataset MSRA-TD500 and protocol for evaluation. 2) Equipped a two-level classification scheme and two sets of features extractor. |
Yin et al. [2] | ✕ | ✓ | ✕ | ✕ | TPAMI | 2013 | Extract Maximally Stable Extremal Regions (MSERs) as character candidates and group them together. | |
Le et al. [5] | HOCC | ✕ | ✓ | ✓ | ✕ | CVPR | 2014 | HOCC + MSERs |
Yin et al. [7] | ✕ | ✓ | ✓ | ✕ | TPAMI | 2015 | Presenting a unified distance metric learning framework for adaptive hierarchical clustering. | |
Wu et al. [9] | ✕ | ✓ | ✓ | ✕ | TMM | 2015 | Exploring gradient directional symmetry at component level for smoothing edge components before text detection. | |
Tian et al. [17] | ✕ | ✓ | ✕ | ✕ | IJCAI | 2016 | Scene text is first detected locally in individual frames and finally linked by an optimal tracking trajectory. | |
Yang et al. [33] | ✕ | ✓ | ✓ | ✕ | TIP | 2017 | A text detector will locate character candidates and extract text regions. Then they will linked by an optimal tracking trajectory. | |
Liang et al. [8] | ✕ | ✓ | ✓ | ✓ | TIP | 2015 | Exploring maxima stable extreme regions along with stroke width transform for detecting candidate text regions. | |
Michal et al.[12] | FASText | ✕ | ✓ | ✓ | ✕ | ICCV | 2015 | Stroke keypoints are efficiently detected and then exploited to obtain stroke segmentations. |
2.1.2 Segmentation-based Methods
Method | Model | Code | Hori | Quad | Irreg | Source | Time | Highlight | ||||||||||||
Li et al. [3] | ✕ | ✓ | ✓ | ✕ | TIP | 2014 | (1)develop three novel cues that are tailored for character detection and a Bayesian method for their integration; (2)design a Markov random field model to exploit the inherent dependencies between characters. | |||||||||||||
Zhang et al. [14] | ✕ | ✓ | ✓ | ✕ | CVPR | 2016 | Utilizing FCN for salient map detection and centroid of each character prediction. | |||||||||||||
Zhu et al. [16] | ✕ | ✓ | ✓ | ✕ | CVPR | 2016 | Performs a graph-based segmentation of connected components into words (Word-Graph). | |||||||||||||
He et al. [18] | Text-CNN | ✕ | ✓ | ✓ | ✕ | TIP | 2016 | Developing a new learning mechanism to train the Text-CNN with multi-level and rich supervised information. | ||||||||||||
Yao et al. [21] | ✕ | ✓ | ✓ | ✕ | arXiv | 2016 | Proposing to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. | |||||||||||||
Hu et al. [27] | WordSup | ✕ | ✓ | ✓ | ✕ | ICCV | 2017 | Proposing a weakly supervised framework that can utilize word annotations. Then the detected characters are fed to a text structure analysis module. | ||||||||||||
Wu et al. [28] | ✕ | ✓ | ✓ | ✕ | ICCV | 2017 | Introducing the border class to the text detection problem for the first time, and validate that the decoding process is largely simplified with the help of text border. | |||||||||||||
Tang et al.[32] | ✕ | ✓ | ✕ | ✕ | TIP | 2017 | A text-aware candidate text region(CTR) extraction model + CTR refinement model. | |||||||||||||
Dai et al. [35] | FTSN | ✕ | ✓ | ✓ | ✕ | arXiv | 2017 | Detecting and segmenting the text instance jointly and simultaneously, leveraging merits from both semantic segmentation task and region proposal based object detection task. | ||||||||||||
Wang et al. [38] | ✕ | ✓ | ✕ | ✕ | ICDAR | 2017 | This paper proposes a novel character candidate extraction method based on super-pixel segmentation and hierarchical clustering. | |||||||||||||
Deng et al. [40] | PixelLink | ✓ | ✓ | ✓ | ✕ | AAAI | 2018 | Text instances are first segmented out by linking pixels wthin the same instance together. | ||||||||||||
Liu et al. [42] | MCN | ✕ | ✓ | ✓ | ✕ | CVPR | 2018 | Stochastic Flow Graph (SFG) + Markov Clustering. | ||||||||||||
Lyu et al. [43] | ✕ | ✓ | ✓ | ✕ | CVPR | 2018 | Detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. | |||||||||||||
Chu et al. [45] | Border | ✕ | ✓ | ✓ | ✕ | ECCV | 2018 | The paper presents a novel scene text detection technique that makes use of semantics-aware text borders and bootstrapping based text segment augmentation. | ||||||||||||
Long et al. [46] | TextSnake | ✕ | ✓ | ✓ | ✓ | ECCV | 2018 | The paper proposes TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms based on symmetry axis. | ||||||||||||
Yang et al. [47] | IncepText | ✕ | ✓ | ✓ | ✕ | IJCAI | 2018 | Designing a novel Inception-Text module and introduce deformable PSROI pooling to deal with multi-oriented text detection. | ||||||||||||
Yue et al. [48] | ✕ | ✓ | ✓ | ✕ | BMVC | 2018 | Proposing a general framework for text detection called Guided CNN to achieve the two goals simultaneously. | |||||||||||||
Zhong et al. [53] | AF-RPN | ✕ | ✓ | ✓ | ✕ | arXiv | 2018 | Presenting AF-RPN(anchor-free) as an anchor-free and scale-friendly region proposal network for the Faster R-CNN framework. | ||||||||||||
Wang et al. [54] | PSENet | ✓ | ✓ | ✓ | ✓ | CVPR | 2019 | Proposing a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance. | ||||||||||||
Xu et al.[57] | TextField | ✕ | ✓ | ✓ | ✓ | arXiv | 2018 | Presenting a novel direction field which can represent scene texts of arbitrary shapes. | ||||||||||||
Tian et al. [58] | FTDN | ✕ | ✓ | ✓ | ✕ | ICIP | 2018 | FTDN is able to segment text region and simultaneously regress text box at pixel-level. | ||||||||||||
Tian et al. [83] | ✕ | ✓ | ✓ | ✓ | CVPR | 2019 | Constraining embedding feature of pixels inside the same text region to share similar properties. | |||||||||||||
Huang et al. [4] | MSERs-CNN | ✕ | ✓ | ✕ | ✕ | ECCV | 2014 | Combining MSERs with CNN | ||||||||||||
Sun et al. [6] | ✕ | ✓ | ✕ | ✕ | PR | 2015 | Presenting a robust text detection approach based on color-enhanced CER and neural networks. | |||||||||||||
Baek et al. [62] | CRAFT | ✕ | ✓ | ✓ | ✓ | CVPR | 2019 | Proposing CRAFT effectively detect text area by exploring each character and affinity between characters. | ||||||||||||
Richardson et al. [87] | ✕ | ✓ | ✓ | ✕ | WACV | 2019 | Presenting an additional scale predictor the estimate the better scale of text regions for testing. | |||||||||||||
Wang et al. [88] | SAST | ✕ | ✓ | ✓ | ✓ | ACMM | 2019 | Presenting a context attended multi-task learning framework for scene text detection. | ||||||||||||
Wang et al. [90] | PAN | ✕ | ✓ | ✓ | ✓ | ICCV | 2019 | Proposing an efficient and accurate arbitrary-shaped text detector called Pixel Aggregation Network(PAN), |
2.1.3 Regression-based Methods
Method | Model | Code | Hori | Quad | Irreg | Source | Time | Highlight | ||||||||||||
Gupta et al. [15] | FCRN | ✓ | ✓ | ✕ | ✕ | CVPR | 2016 | (a) Proposing a fast and scalable engine to generate synthetic images of text in clutter; (b) FCRN. | ||||||||||||
Zhong et al. [20] | DeepText | ✕ | ✓ | ✕ | ✕ | arXiv | 2016 | (a) Inception-RPN; (b) Utilize ambiguous text category (ATC) information and multilevel region-of-interest pooling (MLRP). | ||||||||||||
Liao et al. [22] | TextBoxes | ✓ | ✓ | ✕ | ✕ | AAAI | 2017 | Mainly basing SSD object detection framework. | ||||||||||||
Liu et al. [25] | DMPNet | ✕ | ✓ | ✓ | ✕ | CVPR | 2017 | Quadrilateral sliding windows + shared Monte-Carlo method for fast and accurate computing of the polygonal areas + a sequential protocol for relative regression. | ||||||||||||
He et al. [26] | DDR | ✕ | ✓ | ✓ | ✕ | ICCV | 2017 | Proposing an FCN that has bi-task outputs where one is pixel-wise classification between text and non-text, and the other is direct regression to determine the vertex coordinates of quadrilateral text boundaries. | ||||||||||||
Jiang et al. [36] | R2CNN | ✕ | ✓ | ✓ | ✕ | arXiv | 2017 | Using the Region Proposal Network (RPN) to generate axis-aligned bounding boxes that enclose the texts with different orientations. | ||||||||||||
Xing et al. [37] | ArbiText | ✕ | ✓ | ✓ | ✕ | arXiv | 2017 | Adopting the circle anchors and incorporating a pyramid pooling module into the Single Shot MultiBox Detector framework. | ||||||||||||
Zhang et al. [39] | FEN | ✕ | ✓ | ✕ | ✕ | AAAI | 2018 | Proposing a refined scene text detector with a novel Feature Enhancement Network (FEN) for Region Proposal and Text Detection Refinement. | ||||||||||||
Wang et al. [41] | ITN | ✕ | ✓ | ✓ | ✕ | CVPR | 2018 | ITN is presented to learn the geometry-aware representation encoding the unique geometric configurations of scene text instances with in-network transformation embedding. | ||||||||||||
Liao et al. [44] | RRD | ✕ | ✓ | ✓ | ✕ | CVPR | 2018 | The regression branch extracts rotation-sensitive features, while the classification branch extracts rotation-invariant features by pooling the rotation sensitive features. | ||||||||||||
Liao et al. [49] | TextBoxes++ | ✓ | ✓ | ✓ | ✕ | TIP | 2018 | Mainly basing SSD object detection framework and it replaces the rectangular box representation in conventional object detector by a quadrilateral or oriented rectangle representation. | ||||||||||||
He et al. [50] | ✕ | ✓ | ✓ | ✕ | TIP | 2018 | Proposing a scene text detection framework based on fully convolutional network with a bi-task prediction module. | |||||||||||||
Ma et al. [51] | RRPN | ✓ | ✓ | ✓ | ✕ | TMM | 2018 | RRPN + RRoI Pooling. | ||||||||||||
Zhu et al. [55] | SLPR | ✕ | ✓ | ✓ | ✓ | arXiv | 2018 | SLPR regresses multiple points on the edge of text line and then utilizes these points to sketch the outlines of the text. | ||||||||||||
Deng et al. [56] | ✓ | ✓ | ✓ | ✕ | arXiv | 2018 | CRPN employs corners to estimate the possible locations of text instances. And it also designs a embedded data augmentation module inside region-wise subnetwork. | |||||||||||||
Cai et al. [59] | FFN | ✕ | ✓ | ✕ | ✕ | ICIP | 2018 | Proposing a Feature Fusion Network to deal with text regions differing in enormous sizes. | ||||||||||||
Sabyasachi et al. [60] | RGC | ✕ | ✓ | ✓ | ✕ | ICIP | 2018 | Proposing a novel recurrent architecture to improve the learnings of a feature map at a given time. | ||||||||||||
Liu et al. [63] | CTD | ✓ | ✓ | ✓ | ✓ | PR | 2019 | CTD + TLOC + PNMS | ||||||||||||
Xie et al. [79] | DeRPN | ✓ | ✓ | ✕ | ✕ | AAAI | 2019 | DeRPN utilizes anchor string mechanism instead of anchor box in RPN. | ||||||||||||
Wang et al. [82] | ✕ | ✓ | ✓ | ✓ | CVPR | 2019 | Text-RPN + RNN | |||||||||||||
Liu et al. [84] | ✕ | ✓ | ✓ | ✓ | CVPR | 2019 | CSE mechanism | |||||||||||||
He et al. [29] | SSTD | ✓ | ✓ | ✓ | ✕ | ICCV | 2017 | Proposing an attention mechanism. Then developing a hierarchical inception module which efficiently aggregates multi-scale inception features. | ||||||||||||
Tian et al. [11] | ✕ | ✓ | ✕ | ✕ | ICCV | 2015 | Cascade boosting detects character candidates, and the min-cost flow network model get the final result. | |||||||||||||
Tian et al. [13] | CTPN | ✓ | ✓ | ✕ | ✕ | ECCV | 2016 | 1) RPN + LSTM. 2) RPN incorporate a new vertical anchor mechanism and LSTM connects the region to get the final result. | ||||||||||||
He et al. [19] | ✕ | ✓ | ✓ | ✕ | ACCV | 2016 | ER detetctor detects regions to get coarse prediction of text regions. Then the local context is aggregated to classify the remaining regions to obtain a final prediction. | |||||||||||||
Shi et al. [23] | SegLink | ✓ | ✓ | ✓ | ✕ | CVPR | 2017 | Decomposing text into segments and links. A link connects two adjacent segments. | ||||||||||||
Tian et al. [30] | WeText | ✕ | ✓ | ✕ | ✕ | ICCV | 2017 | Proposing a weakly supervised scene text detection method (WeText). | ||||||||||||
Zhu et al. [31] | RTN | ✕ | ✓ | ✕ | ✕ | ICDAR | 2017 | Mainly basing CTPN vertical vertical proposal mechanism. | ||||||||||||
Ren et al. [34] | ✕ | ✓ | ✕ | ✕ | TMM | 2017 | Proposing a CNN-based detector. It contains a text structure component detector layer, a spatial pyramid layer, and a multi-input-layer deep belief network (DBN). | |||||||||||||
Zhang et al. [10] | ✕ | ✓ | ✕ | ✕ | CVPR | 2015 | The proposed algorithm exploits the symmetry property of character groups and allows for direct extraction of text lines from natural images. | |||||||||||||
Wang et al. [86] | DSRN | ✕ | ✓ | ✓ | ✕ | IJCAI | 2019 | Presenting a scale-transfer module and scale relationship module to handle the problem of scale variation. | ||||||||||||
Tang et al.[89] | Seglink++ | ✕ | ✓ | ✓ | ✓ | PR | 2019 | Presenting instance aware component grouping (ICG) for arbitrary-shape text detection. | ||||||||||||
Wang et al.[92] | ContourNet | ✓ | ✓ | ✓ | ✓ | CVPR | 2020 | 1.A scale-insensitive Adaptive Region Proposal Network (AdaptiveRPN); 2. Local Orthogonal Texture-aware Module (LOTM). |
2.1.4 Hybrid Methods
Method | Model | Code | Hori | Quad | Irreg | Source | Time | Highlight | ||||||||||||
Tang et al. [52] | SSFT | ✕ | ✓ | ✕ | ✕ | TMM | 2018 | Proposing a novel scene text detection method that involves superpixel-based stroke feature transform (SSFT) and deep learning based region classification (DLRC). | ||||||||||||
Xie et al.[61] | SPCNet | ✕ | ✓ | ✓ | ✓ | AAAI | 2019 | Text Context module + Re-Score mechanism. | ||||||||||||
Liu et al. [64] | PMTD | ✓ | ✓ | ✓ | ✕ | arXiv | 2019 | Perform “soft” semantic segmentation. It assigns a soft pyramid label (i.e., a real value between 0 and 1) for each pixel within text instance. | ||||||||||||
Liu et al. [80] | BDN | ✓ | ✓ | ✓ | ✕ | IJCAI | 2019 | Discretizing bouding boxes into key edges to address label confusion for text detection. | ||||||||||||
Zhang et al. [81] | LOMO | ✕ | ✓ | ✓ | ✓ | CVPR | 2019 | DR + IRM + SEM | ||||||||||||
Zhou et al. [24] | EAST | ✓ | ✓ | ✓ | ✕ | CVPR | 2017 | The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images with instance segmentation. | ||||||||||||
Yue et al. [48] | ✕ | ✓ | ✓ | ✕ | BMVC | 2018 | Proposing a general framework for text detection called Guided CNN to achieve the two goals simultaneously. | |||||||||||||
Zhong et al. [53] | AF-RPN | ✕ | ✓ | ✓ | ✕ | arXiv | 2018 | Presenting AF-RPN(anchor-free) as an anchor-free and scale-friendly region proposal network for the Faster R-CNN framework. | ||||||||||||
Xue et al.[85] | MSR | ✕ | ✓ | ✓ | ✓ | IJCAI | 2019 | Presenting a noval multi-scale regression network. | ||||||||||||
Liao et al. [91] | DB | ✓ | ✓ | ✓ | ✓ | AAAI | 2020 | Presenting differentiable binarization module to adaptively set the thresholds for binarization, which simplifies the post-processing. | ||||||||||||
Xiao et al. [93] | SDM | ✕ | ✓ | ✓ | ✓ | ECCV | 2020 | 1. A novel sequential deformation method; 2. auxiliary character counting supervision. |
2.2 Detection Results
2.2.1 Detection Results on Horizontal-Text Datasets
Method | Model | Source | Time | Method Category | IC11[68] | IC13 [69] | IC05[67] | ||||||
P | R | F | P | R | F | P | R | F | |||||
Yao et al. [1] | TD-Mixture | CVPR | 2012 | Traditional | ~ | ~ | ~ | 0.69 | 0.66 | 0.67 | ~ | ~ | ~ |
Yin et al. [2] | TPAMI | 2013 | 0.86 | 0.68 | 0.76 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Yin et al. [7] | TPAMI | 2015 | 0.838 | 0.66 | 0.738 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Wu et al. [9] | TMM | 2015 | ~ | ~ | ~ | 0.76 | 0.70 | 0.73 | ~ | ~ | ~ | ||
Liang et al. [8] | TIP | 2015 | 0.77 | 0.68 | 0.71 | 0.76 | 0.68 | 0.72 | ~ | ~ | ~ | ||
Michal et al.[12] | FASText | ICCV | 2015 | ~ | ~ | ~ | 0.84 | 0.69 | 0.77 | ~ | ~ | ~ | |
Li et al. [3] | TIP | 2014 | Segmentation | 0.80 | 0.62 | 0.70 | ~ | ~ | ~ | ~ | ~ | ~ | |
Zhang et al. [14] | CVPR | 2016 | ~ | ~ | ~ | 0.88 | 0.78 | 0.83 | ~ | ~ | ~ | ||
He et al. [18] | Text-CNN | TIP | 2016 | 0.91 | 0.74 | 0.82 | 0.93 | 0.73 | 0.82 | 0.87 | 0.73 | 0.79 | |
Yao et al. [21] | arXiv | 2016 | ~ | ~ | ~ | 0.889 | 0.802 | 0.843 | ~ | ~ | ~ | ||
Hu et al. [27] | WordSup | ICCV | 2017 | ~ | ~ | ~ | 0.933 | 0.875 | 0.903 | ~ | ~ | ~ | |
Tang et al.[32] | TIP | 2017 | 0.90 | 0.86 | 0.88 | 0.92 | 0.87 | 0.89 | ~ | ~ | ~ | ||
Wang et al. [38] | ICDAR | 2017 | 0.87 | 0.78 | 0.82 | 0.87 | 0.82 | 0.84 | ~ | ~ | ~ | ||
Deng et al. [40] | PixelLink | AAAI | 2018 | ~ | ~ | ~ | 0.886 | 0.875 | 0.881 | ~ | ~ | ~ | |
Liu et al. [42] | MCN | CVPR | 2018 | ~ | ~ | ~ | 0.88 | 0.87 | 0.88 | ~ | ~ | ~ | |
Lyu et al. [43] | CVPR | 2018 | ~ | ~ | ~ | 0.92 | 0.844 | 0.880 | ~ | ~ | ~ | ||
Chu et al. [45] | Border | ECCV | 2018 | ~ | ~ | ~ | 0.915 | 0.871 | 0.892 | ~ | ~ | ~ | |
Wang et al. [54] | PSENet | CVPR | 2019 | ~ | ~ | ~ | 0.94 | 0.90 | 0.92 | ~ | ~ | ~ | |
Huang et al. [4] | MSERs-CNN | ECCV | 2014 | 0.88 | 0.71 | 0.78 | ~ | ~ | ~ | 0.84 | 0.67 | 0.75 | |
Sun et al. [6] | PR | 2015 | 0.92 | 0.91 | 0.91 | 0.94 | 0.92 | 0.93 | ~ | ~ | ~ | ||
Gupta et al. [15] | FCRN | CVPR | 2016 | Regression | 0.94 | 0.77 | 0.85 | 0.938 | 0.764 | 0.842 | ~ | ~ | ~ |
Zhong et al. [20] | DeepText | arXiv | 2016 | 0.87 | 0.83 | 0.85 | 0.85 | 0.81 | 0.83 | ~ | ~ | ~ | |
Liao et al. [22] | TextBoxes | AAAI | 2017 | 0.89 | 0.82 | 0.86 | 0.89 | 0.83 | 0.86 | ~ | ~ | ~ | |
Liu et al. [25] | DMPNet | CVPR | 2017 | ~ | ~ | ~ | 0.93 | 0.83 | 0.870 | ~ | ~ | ~ | |
Jiang et al. [36] | R2CNN | arXiv | 2017 | ~ | ~ | ~ | 0.92 | 0.81 | 0.86 | ~ | ~ | ~ | |
Xing et al. [37] | ArbiText | arXiv | 2017 | ~ | ~ | ~ | 0.826 | 0.936 | 0.877 | ~ | ~ | ~ | |
Wang et al. [41] | ITN | CVPR | 2018 | 0.896 | 0.889 | 0.892 | 0.941 | 0.893 | 0.916 | ~ | ~ | ~ | |
Liao et al. [49] | TextBoxes++ | TIP | 2018 | ~ | ~ | ~ | 0.92 | 0.86 | 0.89 | ~ | ~ | ~ | |
He et al. [50] | TIP | 2018 | ~ | ~ | ~ | 0.91 | 0.84 | 0.88 | ~ | ~ | ~ | ||
Ma et al. [51] | RRPN | TMM | 2018 | ~ | ~ | ~ | 0.95 | 0.89 | 0.91 | ~ | ~ | ~ | |
Zhu et al. [55] | SLPR | arXiv | 2018 | ~ | ~ | ~ | 0.90 | 0.72 | 0.80 | ~ | ~ | ~ | |
Cai et al. [59] | FFN | ICIP | 2018 | ~ | ~ | ~ | 0.92 | 0.84 | 0.876 | ~ | ~ | ~ | |
Sabyasachi et al. [60] | RGC | ICIP | 2018 | ~ | ~ | ~ | 0.89 | 0.77 | 0.83 | ~ | ~ | ~ | |
Wang et al. [82] | CVPR | 2019 | ~ | ~ | ~ | 0.937 | 0.878 | 0.907 | ~ | ~ | ~ | ||
Liu et al. [84] | CVPR | 2019 | ~ | ~ | ~ | 0.937 | 0.897 | 0.917 | ~ | ~ | ~ | ||
He et al. [29] | SSTD | ICCV | 2017 | ~ | ~ | ~ | 0.89 | 0.86 | 0.88 | ~ | ~ | ~ | |
Tian et al. [11] | ICCV | 2015 | 0.86 | 0.76 | 0.81 | 0.852 | 0.759 | 0.802 | ~ | ~ | ~ | ||
Tian et al. [13] | CTPN | ECCV | 2016 | ~ | ~ | ~ | 0.93 | 0.83 | 0.88 | ~ | ~ | ~ | |
He et al. [19] | ACCV | 2016 | ~ | ~ | ~ | 0.90 | 0.75 | 0.81 | ~ | ~ | ~ | ||
Shi et al. [23] | SegLink | CVPR | 2017 | ~ | ~ | ~ | 0.877 | 0.83 | 0.853 | ~ | ~ | ~ | |
Tian et al. [30] | WeText | ICCV | 2017 | ~ | ~ | ~ | 0.911 | 0.831 | 0.869 | ~ | ~ | ~ | |
Zhu et al. [31] | RTN | ICDAR | 2017 | ~ | ~ | ~ | 0.94 | 0.89 | 0.91 | ~ | ~ | ~ | |
Ren et al. [34] | TMM | 2017 | 0.78 | 0.67 | 0.72 | 0.81 | 0.67 | 0.73 | ~ | ~ | ~ | ||
Zhang et al. [10] | CVPR | 2015 | 0.84 | 0.76 | 0.80 | 0.88 | 0.74 | 0.80 | ~ | ~ | ~ | ||
Tang et al. [52] | SSFT | TMM | 2018 | Hybrid | 0.906 | 0.847 | 0.876 | 0.911 | 0.861 | 0.885 | ~ | ~ | ~ |
Xie et al.[61] | SPCNet | AAAI | 2019 | ~ | ~ | ~ | 0.94 | 0.91 | 0.92 | ~ | ~ | ~ | |
Liu et al. [80] | BDN | IJCAI | 2019 | ~ | ~ | ~ | 0.887 | 0.894 | 0.89 | ~ | ~ | ~ | |
Zhou et al. [24] | EAST | CVPR | 2017 | ~ | ~ | ~ | 0.93 | 0.83 | 0.870 | ~ | ~ | ~ | |
Yue et al. [48] | BMVC | 2018 | ~ | ~ | ~ | 0.885 | 0.846 | 0.870 | ~ | ~ | ~ | ||
Zhong et al. [53] | AF-RPN | arXiv | 2018 | ~ | ~ | ~ | 0.94 | 0.90 | 0.92 | ~ | ~ | ~ | |
Xue et al.[85] | MSR | IJCAI | 2019 | ~ | ~ | ~ | 0.918 | 0.885 | 0.901 | ~ | ~ | ~ |
2.2.2 Detection Results on Arbitrary-Quadrilateral-Text Datasets
Method | Model | Source | Time | Method Category | IC15 [70] | MSRA-TD500 [71] | USTB-SV1K [65] | SVT [66] | ||||||||
P | R | F | P | R | F | P | R | F | P | R | F | |||||
Le et al. [5] | HOCC | CVPR | 2014 | Traditional | ~ | ~ | ~ | 0.71 | 0.62 | 0.66 | ~ | ~ | ~ | ~ | ~ | ~ |
Yin et al. [7] | TPAMI | 2015 | ~ | ~ | ~ | 0.81 | 0.63 | 0.71 | 0.499 | 0.454 | 0.475 | ~ | ~ | ~ | ||
Wu et al. [9] | TMM | 2015 | ~ | ~ | ~ | 0.63 | 0.70 | 0.66 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Tian et al. [17] | IJCAI | 2016 | ~ | ~ | ~ | 0.95 | 0.58 | 0.721 | 0.537 | 0.488 | 0.51 | ~ | ~ | ~ | ||
Yang et al. [33] | TIP | 2017 | ~ | ~ | ~ | 0.95 | 0.58 | 0.72 | 0.54 | 0.49 | 0.51 | ~ | ~ | ~ | ||
Liang et al. [8] | TIP | 2015 | ~ | ~ | ~ | 0.74 | 0.66 | 0.70 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Zhang et al. [14] | CVPR | 2016 | Segmentation | 0.71 | 0.43 | 0.54 | 0.83 | 0.67 | 0.74 | ~ | ~ | ~ | ~ | ~ | ~ | |
Zhu et al. [16] | CVPR | 2016 | 0.81 | 0.91 | 0.85 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ||
He et al. [18] | Text-CNN | TIP | 2016 | ~ | ~ | ~ | 0.76 | 0.61 | 0.69 | ~ | ~ | ~ | ~ | ~ | ~ | |
Yao et al. [21] | arXiv | 2016 | 0.723 | 0.587 | 0.648 | 0.765 | 0.753 | 0.759 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Hu et al. [27] | WordSup | ICCV | 2017 | 0.793 | 0.77 | 0.782 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Wu et al. [28] | ICCV | 2017 | 0.91 | 0.78 | 0.84 | 0.77 | 0.78 | 0.77 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Dai et al. [35] | FTSN | arXiv | 2017 | 0.886 | 0.80 | 0.841 | 0.876 | 0.771 | 0.82 | ~ | ~ | ~ | ~ | ~ | ~ | |
Deng et al. [40] | PixelLink | AAAI | 2018 | 0.855 | 0.820 | 0.837 | 0.830 | 0.732 | 0.778 | ~ | ~ | ~ | ~ | ~ | ~ | |
Liu et al. [42] | MCN | CVPR | 2018 | 0.72 | 0.80 | 0.76 | 0.88 | 0.79 | 0.83 | ~ | ~ | ~ | ~ | ~ | ~ | |
Lyu et al. [43] | CVPR | 2018 | 0.895 | 0.797 | 0.843 | 0.876 | 0.762 | 0.815 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Chu et al. [45] | Border | ECCV | 2018 | ~ | ~ | ~ | 0.830 | 0.774 | 0.801 | ~ | ~ | ~ | ~ | ~ | ~ | |
Long et al. [46] | TextSnake | ECCV | 2018 | 0.849 | 0.804 | 0.826 | 0.832 | 0.739 | 0.783 | ~ | ~ | ~ | ~ | ~ | ~ | |
Yang et al. [47] | IncepText | IJCAI | 2018 | 0.938 | 0.873 | 0.905 | 0.875 | 0.790 | 0.830 | ~ | ~ | ~ | ~ | ~ | ~ | |
Wang et al. [54] | PSENet | CVPR | 2019 | 0.8692 | 0.845 | 0.8569 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Xu et al.[57] | TextField | arXiv | 2018 | 0.843 | 0.805 | 0.824 | 0.874 | 0.759 | 0.813 | ~ | ~ | ~ | ~ | ~ | ~ | |
Tian et al. [58] | FTDN | ICIP | 2018 | 0.847 | 0.773 | 0.809 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Tian et al. [83] | CVPR | 2019 | 0.883 | 0.850 | 0.866 | 0.842 | 0.817 | 0.829 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Baek et al. [62] | CRAFT | CVPR | 2019 | 0.898 | 0.843 | 0.869 | 0.882 | 0.782 | 0.829 | ~ | ~ | ~ | ~ | ~ | ~ | |
Richardson et al. [87] | IJCAI | 2019 | 0.853 | 0.83 | 0.827 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ||
Wang et al. [88] | SAST | ACMM | 2019 | 0.8755 | 0.8734 | 0.8744 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Wang et al. [90] | PAN | ICCV | 2019 | 0.84 | 0.819 | 0.829 | 0.844 | 0.838 | 0.821 | ~ | ~ | ~ | ~ | ~ | ~ | |
Gupta et al. [15] | FCRN | CVPR | 2016 | Regression | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 0.651 | 0.599 | 0.624 |
Liu et al. [25] | DMPNet | CVPR | 2017 | 0.732 | 0.682 | 0.706 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
He et al. [26] | DDR | ICCV | 2017 | 0.82 | 0.80 | 0.81 | 0.77 | 0.70 | 0.74 | ~ | ~ | ~ | ~ | ~ | ~ | |
Jiang et al. [36] | R2CNN | arXiv | 2017 | 0.856 | 0.797 | 0.825 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Xing et al. [37] | ArbiText | arXiv | 2017 | 0.792 | 0.735 | 0.759 | 0.78 | 0.72 | 0.75 | ~ | ~ | ~ | ~ | ~ | ~ | |
Wang et al. [41] | ITN | CVPR | 2018 | 0.857 | 0.741 | 0.795 | 0.903 | 0.723 | 0.803 | ~ | ~ | ~ | ~ | ~ | ~ | |
Liao et al. [44] | RRD | CVPR | 2018 | 0.88 | 0.8 | 0.838 | 0.876 | 0.73 | 0.79 | ~ | ~ | ~ | ~ | ~ | ~ | |
Liao et al. [49] | TextBoxes++ | TIP | 2018 | 0.878 | 0.785 | 0.829 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
He et al. [50] | TIP | 2018 | 0.85 | 0.80 | 0.82 | 0.91 | 0.81 | 0.86 | ~ | ~ | ~ | ~ | ~ | ~ | ||
Ma et al. [51] | RRPN | TMM | 2018 | 0.822 | 0.732 | 0.774 | 0.821 | 0.677 | 0.742 | ~ | ~ | ~ | ~ | ~ | ~ | |
Zhu et al. [55] | SLPR | arXiv | 2018 | 0.855 | 0.836 | 0.845 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Deng et al. [56] | arXiv | 2018 | 0.89 | 0.81 | 0.845 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ||
Sabyasachi et al. [60] | RGC | ICIP | 2018 | 0.83 | 0.81 | 0.82 | 0.85 | 0.76 | 0.80 | ~ | ~ | ~ | ~ | ~ | ~ | |
Wang et al. [82] | CVPR | 2019 | 0.892 | 0.86 | 0.876 | 0.852 | 0.821 | 0.836 | ~ | ~ | ~ | ~ | ~ | ~ | ||
He et al. [29] | SSTD | ICCV | 2017 | 0.80 | 0.73 | 0.77 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Tian et al. [13] | CTPN | ECCV | 2016 | 0.74 | 0.52 | 0.61 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
He et al. [19] | ACCV | 2016 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 0.87 | 0.73 | 0.79 | ||
Shi et al. [23] | SegLink | CVPR | 2017 | 0.731 | 0.768 | 0.75 | 0.86 | 0.70 | 0.77 | ~ | ~ | ~ | ~ | ~ | ~ | |
Wang et al. [86] | DSRN | IJCAI | 2019 | 0.832 | 0.796 | 0.814 | 0.876 | 0.712 | 0.785 | ~ | ~ | ~ | ~ | ~ | ~ | |
Tang et al.[89] | Seglink++ | PR | 2019 | 0.837 | 0.803 | 0.820 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Wang et al. [92] | ContourNet | CVPR | 2020 | 0.876 | 0.861 | 0.869 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Tang et al. [52] | SSFT | TMM | 2018 | Hybrid | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | 0.541 | 0.758 | 0.631 |
Xie et al.[61] | SPCNet | AAAI | 2019 | 0.89 | 0.86 | 0.87 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Liu et al. [64] | PMTD | arXiv | 2019 | 0.913 | 0.874 | 0.893 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Liu et al. [80] | BDN | IJCAI | 2019 | 0.881 | 0.846 | 0.863 | 0.87 | 0.815 | 0.842 | ~ | ~ | ~ | ~ | ~ | ~ | |
Zhang et al. [81] | LOMO | CVPR | 2019 | 0.878 | 0.876 | 0.877 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Zhou et al. [24] | EAST | CVPR | 2017 | 0.833 | 0.783 | 0.807 | 0.873 | 0.674 | 0.761 | ~ | ~ | ~ | ~ | ~ | ~ | |
Yue et al. [48] | BMVC | 2018 | 0.866 | 0.789 | 0.823 | ~ | ~ | ~ | ~ | ~ | ~ | 0.691 | 0.660 | 0.675 | ||
Zhong et al. [53] | AF-RPN | arXiv | 2018 | 0.89 | 0.83 | 0.86 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | |
Xue et al.[85] | MSR | IJCAI | 2019 | ~ | ~ | ~ | 0.874 | 0.767 | 0.817 | ~ | ~ | ~ | ~ | ~ | ~ | |
Liao et al. [91] | DB | AAAI | 2020 | 0.918 | 0.832 | 0.873 | 0.915 | 0.792 | 0.849 | ~ | ~ | ~ | ~ | ~ | ~ | |
Xiao et al. [93] | SDM | ECCV | 2020 | 0.9196 | 0.8922 | 0.9057 | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ | ~ |
评论 (0)